I've mentioned this before, but I really love emoji. I spend so much of my time communicating with friends and family on chat that emoji bring necessary animation to my words that might otherwise look flat on the screen. 💁
Another thing I love is data science. The more I learn about machine learning algorithms, the more challenging it is to keep these subjects organized in my brain to recall at a later time. So, I decided to marry these two loves in as productive a fashion as possible.
(Obvious caveat: this is by no means a comprehensive guide to machine learning, but rather a study in the basics for myself and the likely small overlap of people who like machine learning and love emoji as much as I do).
I didn't set out to make a cheatsheet, really. Nor did I set out to make an emoji cheatsheet. But a few things in my research on this subject lead me on this path:
- There aren't that many (good) machine learning cheatsheets that I could find. If you have one please share it!
- The machine learning cheatsheets I did find did nothing to demystify how to actually use the algorithms.
- The cheatsheets I found weren't fun to look at! I'm a highly visual person and having a box for regression, a box for classification and a box for clustering makes a lot of sense to me.
The making of this sheet was much like building a model. I initially thought it would make most sense to split the algorithms up based on type of learning, but realized the amount of overlap would make that impossible.
Once I decided to split them based on type, it became shockingly obvious how many classification algorithms there are. It occurred to me that many data science problems are classification-based, which is obvious based on the volume of classifiers compared to the other types.
If you understand everything on this sheet, feel free to copy the image and leave this page forever. Otherwise, I'm going to walk through my logic below.
Note about the emoji: While my emoji choices are highly intenional, I'd prefer to leave my reasoning open-ended because you may gain a different understanding than I have for the choices, and I wouldn't want to spoil it for you.
a step-by-step guide for this sheet
Initially, I just had the dots for supervised, unsupervised and reinforcement, but it was requested that I add a box for what makes each different. Like many fancy words (such as stochastic), supervised and unsupervised learning are tossed around like no one's business. Make sure you know what they mean when you dish them out.
Lots of things are "fundamental data science", but these things really are. Linear regression especially is a thing you learn constantly in other contexts and may not realize it's used in data science.
I've done a lot of visual design, and this may be one of those I'm most proud of. I think this cell communicates a lot of important information, and really makes sense of what I was trying to accomplish. Beginning with the categories of learning styles - it's clear that neural net is queen bey of complicated algorithms, but with great power comes great responsibility.
The emoji choices throughout this sheet were very intentional. I could explain my logic for each, but we might be here all day. Random forest is really the least serious of them (but also my favorite).
Also, shoutout to Naive Bayes for having at least three ways to call it from sklearn.
Clustering is an extremely useful subset of data science that's like classification, but not quite. Therefore, it needs its own cell. With teddy bears.
The Curse of Dimensionality
I added this section because the more research I did on the algorithms themselves, the more I realized that feature reduction is mega key to making any of them work. I have experienced this during projects and I would be surprised if any data scientist hadn't.
Note: tradeoff between calling t-SNE and possibly forgetting what the acronym stands for, and actually spelling "neighbor" correctly.
Our * Wildcard * Section
There are a few important things in data science that might have just ended up having their own section in a perfect world where a 3-D cheatsheet is a thing.
1. Bias Variance Tradeoff is the most baseline element that describes data science as an art - you will have to strike a balance between noisy data and biased but low variance data for models you create.
2. Underfitting/overfitting - this is similar to the bvt, but you need to make sure you have enough data to not overfit a model, and have the model be descriptive enough to generalize.
3. Inertia - entropy, in its simplest form.
4. We talk about these four items a lot with classification - I thought it was important to know what we're really talking about.