I recently did a presentation for my classmates at Metis that was so fun, I had to turn it into a blog post. Since the days of college where I was frequently shown world-wide data where at least 5 key nations had 'no data' as their data point, I've been very interested in the topic of visualizing no data. As Data Scientists, we must regularly decide how to handle a lack of data (not the least of which is my favorite thing KLQ has said during a lecture: "close your eyes and pretend it's not there.") So why are there not more resources, articles and research on the topic of visualizing this phenomenon?
So, I'm going to tell you my favorite story about visualizing a lack of data. This will be more interesting than it sounds, I promise. I've armed myself with GIFs to keep you interested.
1854, London. (oh dear reader, did I lose you already?). Dr. John Snow is assigned the saddening task of visiting each building in the Broad Street viscinity in order to take a tally of all of the people who perished during the Cholera outbreak. Up until now, it's widely accepted that Cholera is spread through 'bad air' (whatever that means). John Snow hypothesizes that Cholera is in fact spread through water, but everyone is like
Vindication Through Visualization
So John Snow walks around the Broad Street region and talks to the residents, taking a tally of all deaths. He keeps a record of this on a map (the one that's flashing at you on this page, to be exact) marking a tick for each death.
Do you notice anything peculiar about this map? For the sake of time, I've circled it in green. Because John Snow created a data visualization for the Cholera outbreak, he was easily able to notice that no one in the brewery died of Cholera, despite the brewery's proximity to the neighborhood water pump. John Snow is a curious fellow, so he asks the brewers about their water drinking habits, and I imagine they drunkenly laughed at him, since every worker at the brewery got a daily ration of beer and thus didn't bother with their local water pump.
I won't go into details about how the Cholera got there in the first place (because it's super gross and off-topic, but still worth mentioning for context), but it is well-documented here.
So now John Snow is vindicated, solidified in history for standing by his instincts, and no one cares to ask for the names of his dissenters.
Applying this to modern data science
It's easy to think of a lot of ways John Snow is important to modern statistics - and it's obvious considering his data visualization is tossed around social science classrooms more often than hacky sacks. If you can see past the Cholera part of his work, it's possible to apply this strategy to anything in data science.
Data visualization as an exploration
When we first receive a dataset, we can visualize the relationships to look for outliers, and show patterns we likely wouldn't see without a visualization. Sometimes these patterns are a lack of data.
Data visualization as an explanation
We can provide context to a dataset's missing data through a visualization.
I recently did a project in which my team created a learning algorithm to categorize tweets based on a corpus of Yelp data (more on this later). Our algorithm proved extremely accurate against Yelp holdout data, but through rough human v algorithm testing (we harassed our classmates into reading tweets about Red Lobster), we found that our accuracy dropped significantly. There are a lot of theoretical explanations for this, but visualizing the difference in vocabulary amongst the top 100 words from each corpus provided the context that was missing in our analysis.
Note: An earlier version of this post did not credit the person who pointed out the excellent Game of Thrones reference, Rumman.