Visualizing Null Data (in GIFs)

I recently did a presentation for my classmates at Metis that was so fun, I had to turn it into a blog post. Since the days of college where I was frequently shown world-wide data where at least 5 key nations had 'no data' as their data point, I've been very interested in the topic of visualizing no data. As Data Scientists, we must regularly decide how to handle a lack of data (not the least of which is my favorite thing KLQ has said during a lecture: "close your eyes and pretend it's not there.") So why are there not more resources, articles and research on the topic of visualizing this phenomenon?

So, I'm going to tell you my favorite story about visualizing a lack of data. This will be more interesting than it sounds, I promise. I've armed myself with GIFs to keep you interested.

1854, London. (oh dear reader, did I lose you already?). Dr. John Snow is assigned the saddening task of visiting each building in the Broad Street viscinity in order to take a tally of all of the people who perished during the Cholera outbreak. Up until now, it's widely accepted that Cholera is spread through 'bad air' (whatever that means). John Snow hypothesizes that Cholera is in fact spread through water, but everyone is like

John Snow


Vindication Through Visualization

So John Snow walks around the Broad Street region and talks to the residents, taking a tally of all deaths. He keeps a record of this on a map (the one that's flashing at you on this page, to be exact) marking a tick for each death.

Do you notice anything peculiar about this map? For the sake of time, I've circled it in green. Because John Snow created a data visualization for the Cholera outbreak, he was easily able to notice that no one in the brewery died of Cholera, despite the brewery's proximity to the neighborhood water pump. John Snow is a curious fellow, so he asks the brewers about their water drinking habits, and I imagine they drunkenly laughed at him, since every worker at the brewery got a daily ration of beer and thus didn't bother with their local water pump.

🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻

John Snow dealwithit

I won't go into details about how the Cholera got there in the first place (because it's super gross and off-topic, but still worth mentioning for context), but it is well-documented here.

So now John Snow is vindicated, solidified in history for standing by his instincts, and no one cares to ask for the names of his dissenters.

 

Applying this to modern data science

It's easy to think of a lot of ways John Snow is important to modern statistics - and it's obvious considering his data visualization is tossed around social science classrooms more often than hacky sacks. If you can see past the Cholera part of his work, it's possible to apply this strategy to anything in data science.

Data visualization as an exploration

When we first receive a dataset, we can visualize the relationships to look for outliers, and show patterns we likely wouldn't see without a visualization. Sometimes these patterns are a lack of data.

Data visualization as an explanation
We can provide context to a dataset's missing data through a visualization.

I recently did a project in which my team created a learning algorithm to categorize tweets based on a corpus of Yelp data (more on this later). Our algorithm proved extremely accurate against Yelp holdout data, but through rough human v algorithm testing (we harassed our classmates into reading tweets about Red Lobster), we found that our accuracy dropped significantly. There are a lot of theoretical explanations for this, but visualizing the difference in vocabulary amongst the top 100 words from each corpus provided the context that was missing in our analysis.

This visualization was made with D3 (and is a work in progress). I removed stopwords (aside from me, she and he, as I feel those provide context for restaurant reviews. With this visualization, you can easily see the lack of overlapΒ between the two the two corpuses. Don't get too caught up on the words themselves - the most relevant aspect of this visualization is recognizing the difference in vocabularies.

This visualization was made with D3 (and is a work in progress). I removed stopwords (aside from me, she and he, as I feel those provide context for restaurant reviews. With this visualization, you can easily see the lack of overlap between the two the two corpuses. Don't get too caught up on the words themselves - the most relevant aspect of this visualization is recognizing the difference in vocabularies.


Note: An earlier version of this post did not credit the person who pointed out the excellent Game of Thrones reference, Rumman.