Scientific Soul Searching: Branding for Data Scientists (With a Miyazaki Bonus)

I know, I know. The word "branding" makes me gag a little too. But finding your data niche is so important because data science is a broad genre which simultaneously serves as both a category of work and a job title. So try not to go blind from rolling your eyes just yet; I'm going to put this in terms a Data Scientist can understand. 

Take the below equation:

In order to find y, we need the optimal combination of X and ß . Apply this to finding a model for your brand: let y be equal to "a career path in the data science I actually want". Can you picture what the terms describing y should be? To me, X is equal to skills, ß is equal to interests and ε is equal to the unknown (more on that later). I call this method scientific soul searching.

Let's get searching

We'll walk through developing these features together. Think hard: the more honest you are with yourself about your skills, interests, and intentions², the more skilled and memorable you will be. 

Don't fill up on bread

Don't fill up on bread

X - knowing your skills

Data science is a buffet. There are libraries and languages aplenty to keep your eyes glued to a terminal for years. Search job listings and see a broad distribution of skills: Python, R, SQL, viz, ETL, NLP, statistical analysis, deep learning, Hadoop, etc.

It would be tempting to try learning all of these to appear competitive, but this is the absolute wrong strategy. Each is associated with an entirely different skill set.

Instead, consider: what are your strongest skills? Forget about Python, R, SQL and any of the other basic prerequisites for now³. Think of an interesting analysis you did, or the time you took a novel approach to dealing with missing features. What is at the core of each instance? Dig deep - use your soft skills to highlight your hard skills. Set yourself apart. 

These guys would never would never lie about knowing Hadoop on their resumes, amirite? 

These guys would never would never lie about knowing Hadoop on their resumes, amirite? 

𝛽 - knowing your interests and what's important to you

Take a moment to write down a handful of your interests, both professional and personal. Study what they have in common. Now go down the list, explain why you are interested in each. 

Be true to yourself here. This step is key. I say this because "people don't care what you do, they care why you do it."¹ It's easy to get swept up in what's hot and pretend to be into something simply because there's an opportunity this week. Your honest enthusiasm about a subject and how data science can impact that subject will be apparent to the community at large. 

You've got the X and 𝛽, now prove it with projects

This seems obvious, but I often receive messages from people who don't realize this to be true. You've got the brand, now you have to get to work! Here's my story to demonstrate this:

I had a handful of ideas for my Metis capstone project. I considered the formula I've presented to you: I wanted to highlight my NLP skills and a list of ideas, but none of them felt quite right. I did know, though, that I was drawn to the Supreme Court, a topic I've been personally fascinated with for almost 20 years. 

I pushed forward with an idea to topic model every Supreme Court case in history, excited to talk about it to anyone that would listen. This enthusiasm was memorable: my project was passed around social media hundreds of times and this buzz resulted in a number of recruiter calls from tech firms like Google. The project has also resulted in plenty of speaking opportunities. 

I'm writing this post now as a pioneer for NLP in legal aid - building complex multi-layer models that connect clients with the appropriate pro bono lawyer. This has been the most rewarding career experience I've had so far, and it came to me because the founder of the startup recognized my enthusiasm through my blog on the project.

Communication is how to tie the whole thing up into a nice brand

You've done the project, now tell us about it. Present it in a compelling way. Blog about it. Consider what would make you want to keep reading if this wasn't your project. Choose something you want to talk about because the more you write and talk about the subject, the more synonymous your name will be with it.

ε is kind of like Chihiro's parents at the unattended buffet. The food is great but they don't consider the outside factors for why it's there. 

ε is kind of like Chihiro's parents at the unattended buffet. The food is great but they don't consider the outside factors for why it's there. 

But wait, the dreaded Ɛ

It's easy to enter the market wide-eyed and ready to impress. You've sorted out your angles and are confident in your unique voice. But you see rejection after rejection for the coveted title of Data Scientist.

Just as regressions without controlling for the unknown, you've ignored lurking variables. Take a step back; assess what roles are within your reach right now. Your ε is a good opportunity to consider where you want to get, but aren't quite there yet. Take the role that fits in your career (pro)gression while continuing to build your brand. 

 

 

giphy.gif

Parting words: always ask why

If you've made it this far, you should understand branding as more about knowing who you are and what work moves you than it is about choosing the right color scheme for the fall campaign.

When you take it seriously, developing your data science brand is very fun. You get to meet others like you, so always ask yourself why you are interested. Doors will open to a much more exciting career path if you stay true to you. 


footnotes

  1. Simon Sinek's excellent talk on how great leaders inspire action.
  2. I didn't want to get too into intentions in this post, but don't be One of Those People who enters the industry solely for the money.  If you are currently considering data science as a career, write down your reasons for pursuing data science and what you've liked and not liked about previous jobs. Be realistic about your current skills and abilities. Once you have this information, line up informational interviews to confirm with yourself. Hint: I have been known to occasionally accept informational interview requests
  3. Are you an expert at any of the prerequisites? Can you describe a situation that proves that? Then absolutely highlight that skill! For example: you've been writing data processing modules in Python for 3 years and have contributed to sci-kit learn - your Python skill is your brand. The majority of Python programmers can't say that about themselves! 

Recent Talks and Resources

I have had the honor of presenting my Supreme Court project, as well as presenting on the impact of data science on law and government a few times throughout the fall and winter (Open Data Science Conference West, Women Who Code, and most recently Demystifying Deep Learning and AI). I am continually impressed by how much the data science community is interested in making positive social impacts and I look forward to expanding on these talks into the future.

Resources

I wanted to share a few resources for finding civic data online. I will be updating this list periodically as I find more resources. 

  1. data.gov -- general and science oriented

  2. usa.gov/statistics -- census, social bureaus

  3. free.law -- court documents

  4. Human Rights Data Analysis Group -- international data

  5. National Archives -- Military records, ancestry records

  6. American Presidency Project -- actions of presidents

Demystifying Deep Learning Talk

A number of people requested that I put the slides from my Demystifying Deep Learning presentation online. If you have any lingering questions, please reach out to me through the contact form.

To those who came here to learn: pay special attention to the slides explaining non-negative matrix factorization - understanding how algorithms like this produce their outputs will help immensely when learning how to produce good models. 

github repo for supreme court project

visualization -- please let me know if you use my code to visualize a new dataset, I'd love to see it!

Thank you! 🙌

Applying Data Science to the Supreme Court: Topic Modeling Over Time with NMF (and a D3.js bonus)

I was 12 when Gore went up against Bush in the presidential election. For weeks, the only thing anyone seemed to be talking about was "hanging chads." (Who even is Chad, I wondered.) And the nightly news wasn't complete without a cut of an old dude holding up a ballot and peering into its negative space, inexplicably pointing out spots to another old dude. Then suddenly, the election was over and I didn't understand why. What was this "Supreme" Court, and who gave them the power to decide our election?

This experience was the inspiration for my final project for Metis. The Supreme Court is arguably the most important branch of government for guiding our future, but it's incredibly difficult for the average American to get a grasp of what's happening. I decided that a good start in closing this gap would be to model topics over time and create an interactive visualization for anyone with an interest and an internet connection to utilize to educate themselves.

If you'd like to skip the explanation and get straight to the javascript, click on the image below. I'll go over a few interesting points at the end of this post, but I'd love to hear from you about your experience studying this graph. 

How was this done?

The pipeline is below. All code for this project is here. You'll find several ipython notebooks of webscraping code I wrote in this repo. Supreme Court cases are publicly available, but it's not as though you can find them on a spreadsheet!

Natural Language Processing: a rundown

With natural language processing, we have a pile of documents (that's Supreme Court cases in this project), and we need to get to their true essence.

Most words aren't helpful in this process, so we drop them (stopwords). We also know that words like "liking" are really the same as "like" in this context (shh don't tell my literature professors from college I said that), so we lemmatize, which means we replace all those -ings with their roots.

After this, we have a few choices. We need to turn the words into "vectors" (fancy term for number, really) and use those vectors to inform our topic groups. The simplest form of this is Count Vectorizer (takes a count of words in the document as if they were differently colored marbles in a jar) - more on this and why I didn't use this later. The process I used was TFIDF, or, Term Frequency Inverse Document Frequency. This will take that simple counting style and downweight words that are common across all documents, while upweighting unique words.

From here, paths diverge based on what the project is. Herein lies my explanation for why NMF (non-negative matrix factorization).

Why NMF and not LDA or LSA?

This particular ipython notebook is a great tool to follow along with this section. LDA was the obvious choice to do first, as is evident when you google "topic modeling algorithm." It's Bayesian based and returns an actual probability that a document belongs in a topic. But every iteration I tried had it pulling out the legal terms but not the true essence of the cases themselves. It turns out that LDA is known to be not great for pulling out true meaning from documents with similar language (such as 22,000 cases with a ton of legal term overlap). I wanted to know why, and realized that you can't use TFIDF with LDA, which means that if your documents are similar, LDA will likely not pick up the important words because they don't happen as often.

I won't go much into LSA here as its performance was so bad (documented in my ipython notebooks) that I didn't get very far with it.

Then I read about Non-negative Matrix Factorization (NMF) and found that in uses similar to mine, its robustness far surpassed LDA. NMF extracts latent features via matrix decomposition, and you can use TFIDF which is a huge plus.

The making of the visualization

There are differing ideas as to how viz fits into data science. As far as I'm concerned, it's an integral part to sealing the deal on the pipeline. With this particular project, it simply wouldn't be enough to say "I did topic modeling of the Supreme Court" with a hard stop. I knew I needed to present this in a way that would be fun and as simple as possible. As my concepts of the "how" progressed, my project revealed itself to have a strong story of time as a factor, so I decided to present the topics as an area chart you could click and drag with the x axis as time.

Before I go any further, I wanted to say a special thanks to Metis TA Ramesh Sampath for helping me get my ideas into javascript, as I was still pretty new with D3.js at the time.

Here are a few interesting points I noticed once I got my data into my viz. My comments for each are in the captions.

First, let's look at a topic for the entire history of the court. For instance, see how cases related to violent crimes and the death penalty just exploded between 1970 - 1990. There are many conclusions you can draw as to what was happening in our country during this time period from seeing this trend. 

First, let's look at a topic for the entire history of the court. For instance, see how cases related to violent crimes and the death penalty just exploded between 1970 - 1990. There are many conclusions you can draw as to what was happening in our country during this time period from seeing this trend. 

Now, I'd like to show you a trend that's evident when you zoom in on a certain time period. The box is a bit in our way here (on the actual viz you can drag the area from side to side, so it's more clear there), but we can see that bankruptcy cases, while consistently a thing, became extremely prevalent in the Great Depression era. 

Now, I'd like to show you a trend that's evident when you zoom in on a certain time period. The box is a bit in our way here (on the actual viz you can drag the area from side to side, so it's more clear there), but we can see that bankruptcy cases, while consistently a thing, became extremely prevalent in the Great Depression era. 

The last thing I'd like to point out is for you to choose your own adventure. I've pointed out two peaks in the number of cases in a given year. The first peak is 1850. The second peak is the greatest number of cases in the entire court history: 1967. Go and explore these years and learn what kinds of things were happening in American history!

The last thing I'd like to point out is for you to choose your own adventure. I've pointed out two peaks in the number of cases in a given year. The first peak is 1850. The second peak is the greatest number of cases in the entire court history: 1967. Go and explore these years and learn what kinds of things were happening in American history!

Next steps and other use cases

I shared this on facebook a few weeks ago and was overwhelmed with the positive response it received, demonstrating that there's a real need here. I have already agreed to make a version of this for the Irish Supreme Court, and would love to do this for more countries.

Currently, there are a few things I'd really love to add to the project and viz. For the sake of time before presenting this, I only did a single example case for each topic. This is really a weakness in that it limits how much you can learn. I'd really love for you to be able to click through to all cases in a topic group that look interesting to you. Additionally, I'd like to add a toggle on - off feature for the way the vote swung (ie, for plaintiff or defendent). I'm not interested in adding a conservative / liberal element in order to stay true to the 'opinion-free version of the Supreme Court' theme. Additionally, I'd love to add a way where you can see landmark cases clearly, to see how those tend to impact case trends.

There are a ton of other ways this could be used both in terms of the natural language processing and visualization. I'd love to see more people use this style of viz for share-to-total over time, which is why I've made my code publicly available on github. (I use the term 'my' loosely here, since the majority of it was built off of various Bostock bl.ocks anyway).

NMF appears to be relatively new to topic modeling and there aren't a ton of great references to its API. However, given its apparent power in cutting through millions of words to get the true essence of the documents, I think we'll be seeing more of it into the future.

If you made it all the way to here, you are a hero among TL;DR internet attention spans. bravo and thanks for reading! ✌️🏁

Better Know a Justice

Supreme court data is bigger than Justice Taft's bathtub

Supreme court data is bigger than Justice Taft's bathtub

Note: This is a project post, so if you're not interested in learning how I did this, just scroll down and mouse over the viz 💁 (and then scroll back up and read this when you realize you need an explanation of what you're looking at).

The Supreme Court is a very complex and difficult to understand topic for most, but is arguably the most important government entity for determining the direction of this country. It's difficult to follow the news about the current nominee, but so important for the average American to grasp an understanding of how the next justice will fare when he (or she!) takes his seat. 

The visualization below is the first iteration of an on-going passion project I am working on in which I've gathered every Supreme Court opinion since 1790 (don't worry, I only took a sample of 10k opinions for this iteration), and am mapping the opinions in various ways in attempts to simplify the court for those that want to gain an understanding of this topic without cracking open a history book. 


The Visual

What am I looking at?   Every Supreme Court justice in the history of the court grouped by the similarity of their speech patterns. The size of each of their bubbles represents the uniqueness of their own speech. Hover over each bubble for more info!

How did you do this?    For the data science part of this project, I used Term Frequency Inverse Document Frequency (t-FIDF), which means that unique words each justice has in common with another justice are strongly up-weighted, while very common words that all justices say all the time are strongly down-weighted (for example, "habeas" holds more weight than "the").

The visualization pulls from this analysis to group similar justices into associated bubbles (I utilized a K-means clustering algorithm, if you want to get technical)  - the code for all of this is in various Jupyter Notebooks in this repo. The visualization was made with D3 - my code for this visualization can be found here.

Why did you do this?    For this iteration of the project, I wanted to get a sense of similarities in speech pattern as a predictor for what we could expect a nominee to be like once he's on the court. These clusters demonstrate that Garland is more similar to Chief Justice Roberts in language patterns than any other justice currently on the court. 

What's next?   There are so many things I want to do with this, both with data science and dataviz. Next I will be pulling in voting record and mapping that against the opinion text. After this, I plan to do a timeline of cases, associated news stories and text summarization for each opinion.