Natural Language Processing: a rundown
With natural language processing, we have a pile of documents (that's Supreme Court cases in this project), and we need to get to their true essence.
Most words aren't helpful in this process, so we drop them (stopwords). We also know that words like "liking" are really the same as "like" in this context (shh don't tell my literature professors from college I said that), so we lemmatize, which means we replace all those -ings with their roots.
After this, we have a few choices. We need to turn the words into "vectors" (fancy term for number, really) and use those vectors to inform our topic groups. The simplest form of this is Count Vectorizer (takes a count of words in the document as if they were differently colored marbles in a jar) - more on this and why I didn't use this later. The process I used was TFIDF, or, Term Frequency Inverse Document Frequency. This will take that simple counting style and downweight words that are common across all documents, while upweighting unique words.
From here, paths diverge based on what the project is. Herein lies my explanation for why NMF (non-negative matrix factorization).
Why NMF and not LDA or LSA?
This particular ipython notebook is a great tool to follow along with this section. LDA was the obvious choice to do first, as is evident when you google "topic modeling algorithm." It's Bayesian based and returns an actual probability that a document belongs in a topic. But every iteration I tried had it pulling out the legal terms but not the true essence of the cases themselves. It turns out that LDA is known to be not great for pulling out true meaning from documents with similar language (such as 22,000 cases with a ton of legal term overlap). I wanted to know why, and realized that you can't use TFIDF with LDA, which means that if your documents are similar, LDA will likely not pick up the important words because they don't happen as often.
I won't go much into LSA here as its performance was so bad (documented in my ipython notebooks) that I didn't get very far with it.
Then I read about Non-negative Matrix Factorization (NMF) and found that in uses similar to mine, its robustness far surpassed LDA. NMF extracts latent features via matrix decomposition, and you can use TFIDF which is a huge plus.
The making of the visualization
There are differing ideas as to how viz fits into data science. As far as I'm concerned, it's an integral part to sealing the deal on the pipeline. With this particular project, it simply wouldn't be enough to say "I did topic modeling of the Supreme Court" with a hard stop. I knew I needed to present this in a way that would be fun and as simple as possible. As my concepts of the "how" progressed, my project revealed itself to have a strong story of time as a factor, so I decided to present the topics as an area chart you could click and drag with the x axis as time.
Here are a few interesting points I noticed once I got my data into my viz. My comments for each are in the captions.