I was 12 when Gore went up against Bush in the presidential election. For weeks, the only thing anyone seemed to be talking about was "hanging chads." (Who even is Chad, I wondered.) And the nightly news wasn't complete without a cut of an old dude holding up a ballot and peering into its negative space, inexplicably pointing out spots to another old dude. Then suddenly, the election was over and I didn't understand why. What was this "Supreme" Court, and who gave them the power to decide our election?
This experience was the inspiration for my final project for Metis. The Supreme Court is arguably the most important branch of government for guiding our future, but it's incredibly difficult for the average American to get a grasp of what's happening. I decided that a good start in closing this gap would be to model topics over time and create an interactive visualization for anyone with an interest and an internet connection to utilize to educate themselves.
How was this done?
The pipeline is below. All code for this project is here. You'll find several ipython notebooks of webscraping code I wrote in this repo. Supreme Court cases are publicly available, but it's not as though you can find them on a spreadsheet!
Natural Language Processing: a rundown
With natural language processing, we have a pile of documents (that's Supreme Court cases in this project), and we need to get to their true essence.
Most words aren't helpful in this process, so we drop them (stopwords). We also know that words like "liking" are really the same as "like" in this context (shh don't tell my literature professors from college I said that), so we lemmatize, which means we replace all those -ings with their roots.
After this, we have a few choices. We need to turn the words into "vectors" (fancy term for number, really) and use those vectors to inform our topic groups. The simplest form of this is Count Vectorizer (takes a count of words in the document as if they were differently colored marbles in a jar) - more on this and why I didn't use this later. The process I used was TFIDF, or, Term Frequency Inverse Document Frequency. This will take that simple counting style and downweight words that are common across all documents, while upweighting unique words.
From here, paths diverge based on what the project is. Herein lies my explanation for why NMF (non-negative matrix factorization).
Why NMF and not LDA or LSA?
This particular ipython notebook is a great tool to follow along with this section. LDA was the obvious choice to do first, as is evident when you google "topic modeling algorithm." It's Bayesian based and returns an actual probability that a document belongs in a topic. But every iteration I tried had it pulling out the legal terms but not the true essence of the cases themselves. It turns out that LDA is known to be not great for pulling out true meaning from documents with similar language (such as 22,000 cases with a ton of legal term overlap). I wanted to know why, and realized that you can't use TFIDF with LDA, which means that if your documents are similar, LDA will likely not pick up the important words because they don't happen as often.
I won't go much into LSA here as its performance was so bad (documented in my ipython notebooks) that I didn't get very far with it.
Then I read about Non-negative Matrix Factorization (NMF) and found that in uses similar to mine, its robustness far surpassed LDA. NMF extracts latent features via matrix decomposition, and you can use TFIDF which is a huge plus.
The making of the visualization
There are differing ideas as to how viz fits into data science. As far as I'm concerned, it's an integral part to sealing the deal on the pipeline. With this particular project, it simply wouldn't be enough to say "I did topic modeling of the Supreme Court" with a hard stop. I knew I needed to present this in a way that would be fun and as simple as possible. As my concepts of the "how" progressed, my project revealed itself to have a strong story of time as a factor, so I decided to present the topics as an area chart you could click and drag with the x axis as time.
Here are a few interesting points I noticed once I got my data into my viz. My comments for each are in the captions.
Next steps and other use cases
I shared this on facebook a few weeks ago and was overwhelmed with the positive response it received, demonstrating that there's a real need here. I have already agreed to make a version of this for the Irish Supreme Court, and would love to do this for more countries.
Currently, there are a few things I'd really love to add to the project and viz. For the sake of time before presenting this, I only did a single example case for each topic. This is really a weakness in that it limits how much you can learn. I'd really love for you to be able to click through to all cases in a topic group that look interesting to you. Additionally, I'd like to add a toggle on - off feature for the way the vote swung (ie, for plaintiff or defendent). I'm not interested in adding a conservative / liberal element in order to stay true to the 'opinion-free version of the Supreme Court' theme. Additionally, I'd love to add a way where you can see landmark cases clearly, to see how those tend to impact case trends.
There are a ton of other ways this could be used both in terms of the natural language processing and visualization. I'd love to see more people use this style of viz for share-to-total over time, which is why I've made my code publicly available on github. (I use the term 'my' loosely here, since the majority of it was built off of various Bostock bl.ocks anyway).
NMF appears to be relatively new to topic modeling and there aren't a ton of great references to its API. However, given its apparent power in cutting through millions of words to get the true essence of the documents, I think we'll be seeing more of it into the future.
If you made it all the way to here, you are a hero among TL;DR internet attention spans. bravo and thanks for reading! ✌️🏁