The Making of a Cheatsheet: Emoji Edition

I've mentioned this before, but I really love emoji. I spend so much of my time communicating with friends and family on chat that emoji bring necessary animation to my words that might otherwise look flat on the screen. 💁

Another thing I love is data science. The more I learn about machine learning algorithms, the more challenging it is to keep these subjects organized in my brain to recall at a later time. So, I decided to marry these two loves in as productive a fashion as possible.

(Obvious caveat: this is by no means a comprehensive guide to machine learning, but rather a study in the basics for myself and the likely small overlap of people who like machine learning and love emoji as much as I do).

All code snippets are in the (extremely dope) python library of sklearn. This sheet was made the manual way with Photoshop.

I didn't set out to make a cheatsheet, really. Nor did I set out to make an emoji cheatsheet. But a few things in my research on this subject lead me on this path:

  1. There aren't that many (good) machine learning cheatsheets that I could find. If you have one please share it!
  2. The machine learning cheatsheets I did find did nothing to demystify how to actually use the algorithms.
  3. The cheatsheets I found weren't fun to look at! I'm a highly visual person and having a box for regression, a box for classification and a box for clustering makes a lot of sense to me.

the details

The making of this sheet was much like building a model. I initially thought it would make most sense to split the algorithms up based on type of learning, but realized the amount of overlap would make that impossible.

Once I decided to split them based on type, it became shockingly obvious how many classification algorithms there are. It occurred to me that many data science problems are classification-based, which is obvious based on the volume of classifiers compared to the other types.

If you understand everything on this sheet, feel free to copy the image and leave this page forever. Otherwise, I'm going to walk through my logic below.

Note about the emoji: While my emoji choices are highly intenional, I'd prefer to leave my reasoning open-ended because you may gain a different understanding than I have for the choices, and I wouldn't want to spoil it for you.


a step-by-step guide for this sheet

Learning Styles

Initially, I just had the dots for supervised, unsupervised and reinforcement, but it was requested that I add a box for what makes each different. Like many fancy words (such as stochastic), supervised and unsupervised learning are tossed around like no one's business. Make sure you know what they mean when you dish them out. 

Regressions

Lots of things are "fundamental data science", but these things really are. Linear regression especially is a thing you learn constantly in other contexts and may not realize it's used in data science.

 

Classification

I've done a lot of visual design, and this may be one of those I'm most proud of. I think this cell communicates a lot of important information, and really makes sense of what I was trying to accomplish. Beginning with the categories of learning styles - it's clear that neural net is queen bey of complicated algorithms, but with great power comes great responsibility. 

The emoji choices throughout this sheet were very intentional. I could explain my logic for each, but we might be here all day. Random forest is really the least serious of them (but also my favorite). 

Also, shoutout to Naive Bayes for having at least three ways to call it from sklearn. 

 

Clustering

Clustering is an extremely useful subset of data science that's like classification, but not quite. Therefore, it needs its own cell. With teddy bears.

 

The Curse of Dimensionality

I added this section because the more research I did on the algorithms themselves, the more I realized that feature reduction is mega key to making any of them work. I have experienced this during projects and I would be surprised if any data scientist hadn't.

Note: tradeoff between calling t-SNE and possibly forgetting what the acronym stands for, and actually spelling "neighbor" correctly. 

 

Our * Wildcard * Section

There are a few important things in data science that might have just ended up having their own section in a perfect world where a 3-D cheatsheet is a thing. 

1. Bias Variance Tradeoff is the most baseline element that describes data science as an art - you will have to strike a balance between noisy data and biased but low variance data for models you create.

2. Underfitting/overfitting - this is similar to the bvt, but you need to make sure you have enough data to not overfit a model, and have the model be descriptive enough to generalize.

3. Inertia - entropy, in its simplest form. 

4.  We talk about these four items a lot with classification - I thought it was important to know what we're really talking about.

 

So that's that! Please feel free to use this cheatsheet for yourself or share it with others who might find it helpful. Please credit me where ever appropriate. 

If you spot errors or think I've left something important out, debate with me about this on twitter or send me an email

Visualizing Null Data (in GIFs)

I recently did a presentation for my classmates at Metis that was so fun, I had to turn it into a blog post. Since the days of college where I was frequently shown world-wide data where at least 5 key nations had 'no data' as their data point, I've been very interested in the topic of visualizing no data. As Data Scientists, we must regularly decide how to handle a lack of data (not the least of which is my favorite thing KLQ has said during a lecture: "close your eyes and pretend it's not there.") So why are there not more resources, articles and research on the topic of visualizing this phenomenon?

So, I'm going to tell you my favorite story about visualizing a lack of data. This will be more interesting than it sounds, I promise. I've armed myself with GIFs to keep you interested.

1854, London. (oh dear reader, did I lose you already?). Dr. John Snow is assigned the saddening task of visiting each building in the Broad Street viscinity in order to take a tally of all of the people who perished during the Cholera outbreak. Up until now, it's widely accepted that Cholera is spread through 'bad air' (whatever that means). John Snow hypothesizes that Cholera is in fact spread through water, but everyone is like

John Snow


Vindication Through Visualization

So John Snow walks around the Broad Street region and talks to the residents, taking a tally of all deaths. He keeps a record of this on a map (the one that's flashing at you on this page, to be exact) marking a tick for each death.

Do you notice anything peculiar about this map? For the sake of time, I've circled it in green. Because John Snow created a data visualization for the Cholera outbreak, he was easily able to notice that no one in the brewery died of Cholera, despite the brewery's proximity to the neighborhood water pump. John Snow is a curious fellow, so he asks the brewers about their water drinking habits, and I imagine they drunkenly laughed at him, since every worker at the brewery got a daily ration of beer and thus didn't bother with their local water pump.

🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻🍻

John Snow dealwithit

I won't go into details about how the Cholera got there in the first place (because it's super gross and off-topic, but still worth mentioning for context), but it is well-documented here.

So now John Snow is vindicated, solidified in history for standing by his instincts, and no one cares to ask for the names of his dissenters.

 

Applying this to modern data science

It's easy to think of a lot of ways John Snow is important to modern statistics - and it's obvious considering his data visualization is tossed around social science classrooms more often than hacky sacks. If you can see past the Cholera part of his work, it's possible to apply this strategy to anything in data science.

Data visualization as an exploration

When we first receive a dataset, we can visualize the relationships to look for outliers, and show patterns we likely wouldn't see without a visualization. Sometimes these patterns are a lack of data.

Data visualization as an explanation
We can provide context to a dataset's missing data through a visualization.

I recently did a project in which my team created a learning algorithm to categorize tweets based on a corpus of Yelp data (more on this later). Our algorithm proved extremely accurate against Yelp holdout data, but through rough human v algorithm testing (we harassed our classmates into reading tweets about Red Lobster), we found that our accuracy dropped significantly. There are a lot of theoretical explanations for this, but visualizing the difference in vocabulary amongst the top 100 words from each corpus provided the context that was missing in our analysis.

This visualization was made with D3 (and is a work in progress). I removed stopwords (aside from me, she and he, as I feel those provide context for restaurant reviews. With this visualization, you can easily see the  lack of overlap  between the two the two corpuses. Don't get too caught up on the words themselves - the most relevant aspect of this visualization is recognizing the difference in vocabularies.

This visualization was made with D3 (and is a work in progress). I removed stopwords (aside from me, she and he, as I feel those provide context for restaurant reviews. With this visualization, you can easily see the lack of overlap between the two the two corpuses. Don't get too caught up on the words themselves - the most relevant aspect of this visualization is recognizing the difference in vocabularies.


Note: An earlier version of this post did not credit the person who pointed out the excellent Game of Thrones reference, Rumman.