Recent Talks and Resources

I have had the honor of presenting my Supreme Court project, as well as presenting on the impact of data science on law and government a few times throughout the fall and winter (Open Data Science Conference West, Women Who Code, and most recently Demystifying Deep Learning and AI). I am continually impressed by how much the data science community is interested in making positive social impacts and I look forward to expanding on these talks into the future.

Resources

I wanted to share a few resources for finding civic data online. I will be updating this list periodically as I find more resources. 

  1. data.gov -- general and science oriented

  2. usa.gov/statistics -- census, social bureaus

  3. free.law -- court documents

  4. Human Rights Data Analysis Group -- international data

  5. National Archives -- Military records, ancestry records

  6. American Presidency Project -- actions of presidents

Demystifying Deep Learning Talk

A number of people requested that I put the slides from my Demystifying Deep Learning presentation online. If you have any lingering questions, please reach out to me through the contact form.

To those who came here to learn: pay special attention to the slides explaining non-negative matrix factorization - understanding how algorithms like this produce their outputs will help immensely when learning how to produce good models. 

github repo for supreme court project

visualization -- please let me know if you use my code to visualize a new dataset, I'd love to see it!

Thank you! 🙌

Applying Data Science to the Supreme Court: Topic Modeling Over Time with NMF (and a D3.js bonus)

I was 12 when Gore went up against Bush in the presidential election. For weeks, the only thing anyone seemed to be talking about was "hanging chads." (Who even is Chad, I wondered.) And the nightly news wasn't complete without a cut of an old dude holding up a ballot and peering into its negative space, inexplicably pointing out spots to another old dude. Then suddenly, the election was over and I didn't understand why. What was this "Supreme" Court, and who gave them the power to decide our election?

This experience was the inspiration for my final project for Metis. The Supreme Court is arguably the most important branch of government for guiding our future, but it's incredibly difficult for the average American to get a grasp of what's happening. I decided that a good start in closing this gap would be to model topics over time and create an interactive visualization for anyone with an interest and an internet connection to utilize to educate themselves.

If you'd like to skip the explanation and get straight to the javascript, click on the image below. I'll go over a few interesting points at the end of this post, but I'd love to hear from you about your experience studying this graph. 

How was this done?

The pipeline is below. All code for this project is here. You'll find several ipython notebooks of webscraping code I wrote in this repo. Supreme Court cases are publicly available, but it's not as though you can find them on a spreadsheet!

Natural Language Processing: a rundown

With natural language processing, we have a pile of documents (that's Supreme Court cases in this project), and we need to get to their true essence.

Most words aren't helpful in this process, so we drop them (stopwords). We also know that words like "liking" are really the same as "like" in this context (shh don't tell my literature professors from college I said that), so we lemmatize, which means we replace all those -ings with their roots.

After this, we have a few choices. We need to turn the words into "vectors" (fancy term for number, really) and use those vectors to inform our topic groups. The simplest form of this is Count Vectorizer (takes a count of words in the document as if they were differently colored marbles in a jar) - more on this and why I didn't use this later. The process I used was TFIDF, or, Term Frequency Inverse Document Frequency. This will take that simple counting style and downweight words that are common across all documents, while upweighting unique words.

From here, paths diverge based on what the project is. Herein lies my explanation for why NMF (non-negative matrix factorization).

Why NMF and not LDA or LSA?

This particular ipython notebook is a great tool to follow along with this section. LDA was the obvious choice to do first, as is evident when you google "topic modeling algorithm." It's Bayesian based and returns an actual probability that a document belongs in a topic. But every iteration I tried had it pulling out the legal terms but not the true essence of the cases themselves. It turns out that LDA is known to be not great for pulling out true meaning from documents with similar language (such as 22,000 cases with a ton of legal term overlap). I wanted to know why, and realized that you can't use TFIDF with LDA, which means that if your documents are similar, LDA will likely not pick up the important words because they don't happen as often.

I won't go much into LSA here as its performance was so bad (documented in my ipython notebooks) that I didn't get very far with it.

Then I read about Non-negative Matrix Factorization (NMF) and found that in uses similar to mine, its robustness far surpassed LDA. NMF extracts latent features via matrix decomposition, and you can use TFIDF which is a huge plus.

The making of the visualization

There are differing ideas as to how viz fits into data science. As far as I'm concerned, it's an integral part to sealing the deal on the pipeline. With this particular project, it simply wouldn't be enough to say "I did topic modeling of the Supreme Court" with a hard stop. I knew I needed to present this in a way that would be fun and as simple as possible. As my concepts of the "how" progressed, my project revealed itself to have a strong story of time as a factor, so I decided to present the topics as an area chart you could click and drag with the x axis as time.

Before I go any further, I wanted to say a special thanks to Metis TA Ramesh Sampath for helping me get my ideas into javascript, as I was still pretty new with D3.js at the time.

Here are a few interesting points I noticed once I got my data into my viz. My comments for each are in the captions.

First, let's look at a topic for the entire history of the court. For instance, see how cases related to violent crimes and the death penalty just exploded between 1970 - 1990. There are many conclusions you can draw as to what was happening in our country during this time period from seeing this trend. 

First, let's look at a topic for the entire history of the court. For instance, see how cases related to violent crimes and the death penalty just exploded between 1970 - 1990. There are many conclusions you can draw as to what was happening in our country during this time period from seeing this trend. 

Now, I'd like to show you a trend that's evident when you zoom in on a certain time period. The box is a bit in our way here (on the actual viz you can drag the area from side to side, so it's more clear there), but we can see that bankruptcy cases, while consistently a thing, became extremely prevalent in the Great Depression era. 

Now, I'd like to show you a trend that's evident when you zoom in on a certain time period. The box is a bit in our way here (on the actual viz you can drag the area from side to side, so it's more clear there), but we can see that bankruptcy cases, while consistently a thing, became extremely prevalent in the Great Depression era. 

The last thing I'd like to point out is for you to choose your own adventure. I've pointed out two peaks in the number of cases in a given year. The first peak is 1850. The second peak is the greatest number of cases in the entire court history: 1967. Go and explore these years and learn what kinds of things were happening in American history!

The last thing I'd like to point out is for you to choose your own adventure. I've pointed out two peaks in the number of cases in a given year. The first peak is 1850. The second peak is the greatest number of cases in the entire court history: 1967. Go and explore these years and learn what kinds of things were happening in American history!

Next steps and other use cases

I shared this on facebook a few weeks ago and was overwhelmed with the positive response it received, demonstrating that there's a real need here. I have already agreed to make a version of this for the Irish Supreme Court, and would love to do this for more countries.

Currently, there are a few things I'd really love to add to the project and viz. For the sake of time before presenting this, I only did a single example case for each topic. This is really a weakness in that it limits how much you can learn. I'd really love for you to be able to click through to all cases in a topic group that look interesting to you. Additionally, I'd like to add a toggle on - off feature for the way the vote swung (ie, for plaintiff or defendent). I'm not interested in adding a conservative / liberal element in order to stay true to the 'opinion-free version of the Supreme Court' theme. Additionally, I'd love to add a way where you can see landmark cases clearly, to see how those tend to impact case trends.

There are a ton of other ways this could be used both in terms of the natural language processing and visualization. I'd love to see more people use this style of viz for share-to-total over time, which is why I've made my code publicly available on github. (I use the term 'my' loosely here, since the majority of it was built off of various Bostock bl.ocks anyway).

NMF appears to be relatively new to topic modeling and there aren't a ton of great references to its API. However, given its apparent power in cutting through millions of words to get the true essence of the documents, I think we'll be seeing more of it into the future.

If you made it all the way to here, you are a hero among TL;DR internet attention spans. bravo and thanks for reading! ✌️🏁

Better Know a Justice

Supreme court data is bigger than Justice Taft's bathtub

Supreme court data is bigger than Justice Taft's bathtub

Note: This is a project post, so if you're not interested in learning how I did this, just scroll down and mouse over the viz 💁 (and then scroll back up and read this when you realize you need an explanation of what you're looking at).

The Supreme Court is a very complex and difficult to understand topic for most, but is arguably the most important government entity for determining the direction of this country. It's difficult to follow the news about the current nominee, but so important for the average American to grasp an understanding of how the next justice will fare when he (or she!) takes his seat. 

The visualization below is the first iteration of an on-going passion project I am working on in which I've gathered every Supreme Court opinion since 1790 (don't worry, I only took a sample of 10k opinions for this iteration), and am mapping the opinions in various ways in attempts to simplify the court for those that want to gain an understanding of this topic without cracking open a history book. 


The Visual

What am I looking at?   Every Supreme Court justice in the history of the court grouped by the similarity of their speech patterns. The size of each of their bubbles represents the uniqueness of their own speech. Hover over each bubble for more info!

How did you do this?    For the data science part of this project, I used Term Frequency Inverse Document Frequency (t-FIDF), which means that unique words each justice has in common with another justice are strongly up-weighted, while very common words that all justices say all the time are strongly down-weighted (for example, "habeas" holds more weight than "the").

The visualization pulls from this analysis to group similar justices into associated bubbles (I utilized a K-means clustering algorithm, if you want to get technical)  - the code for all of this is in various Jupyter Notebooks in this repo. The visualization was made with D3 - my code for this visualization can be found here.

Why did you do this?    For this iteration of the project, I wanted to get a sense of similarities in speech pattern as a predictor for what we could expect a nominee to be like once he's on the court. These clusters demonstrate that Garland is more similar to Chief Justice Roberts in language patterns than any other justice currently on the court. 

What's next?   There are so many things I want to do with this, both with data science and dataviz. Next I will be pulling in voting record and mapping that against the opinion text. After this, I plan to do a timeline of cases, associated news stories and text summarization for each opinion.

 

 

The Making of a Cheatsheet: Emoji Edition

I've mentioned this before, but I really love emoji. I spend so much of my time communicating with friends and family on chat that emoji bring necessary animation to my words that might otherwise look flat on the screen. 💁

Another thing I love is data science. The more I learn about machine learning algorithms, the more challenging it is to keep these subjects organized in my brain to recall at a later time. So, I decided to marry these two loves in as productive a fashion as possible.

(Obvious caveat: this is by no means a comprehensive guide to machine learning, but rather a study in the basics for myself and the likely small overlap of people who like machine learning and love emoji as much as I do).

All code snippets are in the (extremely dope) python library of sklearn. This sheet was made the manual way with Photoshop.

I didn't set out to make a cheatsheet, really. Nor did I set out to make an emoji cheatsheet. But a few things in my research on this subject lead me on this path:

  1. There aren't that many (good) machine learning cheatsheets that I could find. If you have one please share it!
  2. The machine learning cheatsheets I did find did nothing to demystify how to actually use the algorithms.
  3. The cheatsheets I found weren't fun to look at! I'm a highly visual person and having a box for regression, a box for classification and a box for clustering makes a lot of sense to me.

the details

The making of this sheet was much like building a model. I initially thought it would make most sense to split the algorithms up based on type of learning, but realized the amount of overlap would make that impossible.

Once I decided to split them based on type, it became shockingly obvious how many classification algorithms there are. It occurred to me that many data science problems are classification-based, which is obvious based on the volume of classifiers compared to the other types.

If you understand everything on this sheet, feel free to copy the image and leave this page forever. Otherwise, I'm going to walk through my logic below.

Note about the emoji: While my emoji choices are highly intenional, I'd prefer to leave my reasoning open-ended because you may gain a different understanding than I have for the choices, and I wouldn't want to spoil it for you.


a step-by-step guide for this sheet

Learning Styles

Initially, I just had the dots for supervised, unsupervised and reinforcement, but it was requested that I add a box for what makes each different. Like many fancy words (such as stochastic), supervised and unsupervised learning are tossed around like no one's business. Make sure you know what they mean when you dish them out. 

Regressions

Lots of things are "fundamental data science", but these things really are. Linear regression especially is a thing you learn constantly in other contexts and may not realize it's used in data science.

 

Classification

I've done a lot of visual design, and this may be one of those I'm most proud of. I think this cell communicates a lot of important information, and really makes sense of what I was trying to accomplish. Beginning with the categories of learning styles - it's clear that neural net is queen bey of complicated algorithms, but with great power comes great responsibility. 

The emoji choices throughout this sheet were very intentional. I could explain my logic for each, but we might be here all day. Random forest is really the least serious of them (but also my favorite). 

Also, shoutout to Naive Bayes for having at least three ways to call it from sklearn. 

 

Clustering

Clustering is an extremely useful subset of data science that's like classification, but not quite. Therefore, it needs its own cell. With teddy bears.

 

The Curse of Dimensionality

I added this section because the more research I did on the algorithms themselves, the more I realized that feature reduction is mega key to making any of them work. I have experienced this during projects and I would be surprised if any data scientist hadn't.

Note: tradeoff between calling t-SNE and possibly forgetting what the acronym stands for, and actually spelling "neighbor" correctly. 

 

Our * Wildcard * Section

There are a few important things in data science that might have just ended up having their own section in a perfect world where a 3-D cheatsheet is a thing. 

1. Bias Variance Tradeoff is the most baseline element that describes data science as an art - you will have to strike a balance between noisy data and biased but low variance data for models you create.

2. Underfitting/overfitting - this is similar to the bvt, but you need to make sure you have enough data to not overfit a model, and have the model be descriptive enough to generalize.

3. Inertia - entropy, in its simplest form. 

4.  We talk about these four items a lot with classification - I thought it was important to know what we're really talking about.

 

So that's that! Please feel free to use this cheatsheet for yourself or share it with others who might find it helpful. Please credit me where ever appropriate. 

If you spot errors or think I've left something important out, debate with me about this on twitter or send me an email