Data Science Interview Cheat Sheets
Quick reference guides organized by topic. These are meant for last-minute review before interviews.
π Statistics & Probability
Cheat Sheet: Descriptive Statistics
| Metric | Formula | Use Case |
|---|
| Mean | Ξ£x / n | Central tendency, continuous data |
| Median | Middle value | Skewed distributions, outliers present |
| Mode | Most frequent | Categorical data |
| Variance | Ξ£(x - ΞΌ)Β² / n | Data spread |
| Std Dev | βVariance | Same units as data |
Cheat Sheet: Probability Distributions
| Distribution | Type | Parameters | Use Case |
|---|
| Normal | Continuous | ΞΌ, Ο | Natural phenomena, errors |
| Binomial | Discrete | n, p | Success/failure trials |
| Poisson | Discrete | Ξ» | Rare events over time |
| Uniform | Continuous | a, b | Equal probability |
π¬ Hypothesis Testing Decision Tree
Start: Do you have a question about relationships?
β
ββ YES β What type of data?
β β
β ββ Categorical vs Categorical β Chi-Square Test
β β
β ββ Numerical vs Categorical (2 groups) β t-test
β β ββ Known population Ο? β Z-test
β β ββ Unknown Ο, small sample β t-test
β β
β ββ Numerical vs Categorical (3+ groups) β ANOVA
β β
β ββ Numerical vs Numerical β Correlation / Regression
β
ββ NO β EDA / Descriptive Statistics
Cheat Sheet: Hypothesis Tests Comparison
| Test | Data Types | Null Hypothesis | When to Use |
|---|
| Chi-Square | Cat vs Cat | No association | Independence, goodness-of-fit |
| t-test | Num vs Cat (2 groups) | Means are equal | Compare 2 group means |
| Z-test | Num vs Cat (2 groups) | Means are equal | Large sample, known Ο |
| ANOVA | Num vs Cat (3+ groups) | All means equal | Compare 3+ group means |
| F-test | Num vs Num | Variances equal | Compare variances |
π€ Machine Learning Algorithms
Cheat Sheet: Supervised Learning Algorithm Selection
| Algorithm | Problem Type | Pros | Cons | When to Use |
|---|
| Linear Regression | Regression | Fast, interpretable | Assumes linearity | Linear relationships |
| Logistic Regression | Classification | Interpretable, probabilities | Linear boundary | Binary/multi-class, need probabilities |
| Decision Tree | Both | Non-linear, interpretable | Overfits easily | Complex patterns, explainability needed |
| Random Forest | Both | Reduces overfitting, robust | Slow, black box | High accuracy, less interpretable OK |
| KNN | Both | Simple, no training | Slow prediction, sensitive to scale | Small datasets, simple patterns |
Cheat Sheet: Clustering Algorithms
| Algorithm | Type | Pros | Cons | When to Use |
|---|
| K-Means | Partitioning | Fast, scalable | Need to set K, spherical clusters | Large datasets, known # clusters |
| Hierarchical | Agglomerative/Divisive | No need to set K, dendrogram | Slow, memory intensive | Small datasets, explore # clusters |
π Model Evaluation Metrics
Cheat Sheet: Regression Metrics
| Metric | Formula | Range | Interpretation | When to Use |
|---|
| RMSE | β(Ξ£(y - Ε·)Β² / n) | [0, β] | Same units as target | Penalize large errors |
| MAE | Ξ£|y - Ε·| / n | [0, β] | Same units as target | Treat all errors equally |
| MAPE | (100/n) * Ξ£|y - Ε·|/|y| | [0, β]% | Percentage error | Relative error important |
| RΒ² | 1 - (SS_res / SS_tot) | (-β, 1] | Variance explained | Model comparison |
Cheat Sheet: Classification Metrics
| Metric | Formula | Range | When to Use |
|---|
| Accuracy | (TP + TN) / Total | [0, 1] | Balanced classes |
| Precision | TP / (TP + FP) | [0, 1] | Minimize false alarms |
| Recall | TP / (TP + FN) | [0, 1] | Find all positives (e.g., disease detection) |
| F1-Score | 2 * (Prec * Rec) / (Prec + Rec) | [0, 1] | Balance precision & recall |
| AUC-ROC | Area under ROC curve | [0, 1] | Overall classifier performance |
Confusion Matrix Quick Reference
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
- Precision = “Of all predicted positives, how many were correct?”
- Recall = “Of all actual positives, how many did we find?”
π― Overfitting vs Underfitting
| Aspect | Underfitting | Good Fit | Overfitting |
|---|
| Training Error | High | Low | Very Low |
| Validation Error | High | Low | High |
| Model Complexity | Too simple | Just right | Too complex |
| What’s happening | Not learning patterns | Learning generalizable patterns | Memorizing noise |
| Fix | More features, complex model | β Good to go | Regularization, more data, simpler model |
π§ Regularization
| Technique | Type | Formula | Effect | When to Use |
|---|
| Ridge (L2) | Linear | + λΣβ² | Shrinks coefficients | Multicollinearity, keep all features |
| Lasso (L1) | Linear | + λΣ|β| | Sets some β to 0 | Feature selection needed |
π² Ensemble Methods
| Method | Type | How it Works | Best For |
|---|
| Random Forest | Bagging | Average of many trees | Reduce variance, high accuracy |
| AdaBoost | Boosting | Sequential, focus on errors | Weak learners, binary classification |
| Gradient Boosting | Boosting | Sequential, fit residuals | High accuracy, regression/classification |
| XGBoost | Boosting | Optimized gradient boosting | Competition winning, production systems |
# Standard Error
SE = Ο / βn
# Z-Score
z = (x - ΞΌ) / Ο
# Confidence Interval
CI = xΜ Β± (z * SE)
# RΒ² (coefficient of determination)
RΒ² = 1 - (SS_residual / SS_total)
# Bias-Variance Tradeoff
Total Error = BiasΒ² + Variance + Irreducible Error
πΊοΈ Navigation
Pro Tip: Print these cheat sheets and review them the night before your interview!