Data Science Interview Cheat Sheets

Quick reference guides organized by topic. These are meant for last-minute review before interviews.


πŸ“Š Statistics & Probability

Cheat Sheet: Descriptive Statistics

MetricFormulaUse Case
MeanΞ£x / nCentral tendency, continuous data
MedianMiddle valueSkewed distributions, outliers present
ModeMost frequentCategorical data
VarianceΞ£(x - ΞΌ)Β² / nData spread
Std Dev√VarianceSame units as data

Cheat Sheet: Probability Distributions

DistributionTypeParametersUse Case
NormalContinuousΞΌ, ΟƒNatural phenomena, errors
BinomialDiscreten, pSuccess/failure trials
PoissonDiscreteΞ»Rare events over time
UniformContinuousa, bEqual probability

πŸ”¬ Hypothesis Testing Decision Tree

Start: Do you have a question about relationships?
  β”‚
  β”œβ”€ YES β†’ What type of data?
  β”‚   β”‚
  β”‚   β”œβ”€ Categorical vs Categorical β†’ Chi-Square Test
  β”‚   β”‚
  β”‚   β”œβ”€ Numerical vs Categorical (2 groups) β†’ t-test
  β”‚   β”‚   β”œβ”€ Known population Οƒ? β†’ Z-test
  β”‚   β”‚   └─ Unknown Οƒ, small sample β†’ t-test
  β”‚   β”‚
  β”‚   β”œβ”€ Numerical vs Categorical (3+ groups) β†’ ANOVA
  β”‚   β”‚
  β”‚   └─ Numerical vs Numerical β†’ Correlation / Regression
  β”‚
  └─ NO β†’ EDA / Descriptive Statistics

Cheat Sheet: Hypothesis Tests Comparison

TestData TypesNull HypothesisWhen to Use
Chi-SquareCat vs CatNo associationIndependence, goodness-of-fit
t-testNum vs Cat (2 groups)Means are equalCompare 2 group means
Z-testNum vs Cat (2 groups)Means are equalLarge sample, known Οƒ
ANOVANum vs Cat (3+ groups)All means equalCompare 3+ group means
F-testNum vs NumVariances equalCompare variances

πŸ€– Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

AlgorithmProblem TypeProsConsWhen to Use
Linear RegressionRegressionFast, interpretableAssumes linearityLinear relationships
Logistic RegressionClassificationInterpretable, probabilitiesLinear boundaryBinary/multi-class, need probabilities
Decision TreeBothNon-linear, interpretableOverfits easilyComplex patterns, explainability needed
Random ForestBothReduces overfitting, robustSlow, black boxHigh accuracy, less interpretable OK
KNNBothSimple, no trainingSlow prediction, sensitive to scaleSmall datasets, simple patterns

Cheat Sheet: Clustering Algorithms

AlgorithmTypeProsConsWhen to Use
K-MeansPartitioningFast, scalableNeed to set K, spherical clustersLarge datasets, known # clusters
HierarchicalAgglomerative/DivisiveNo need to set K, dendrogramSlow, memory intensiveSmall datasets, explore # clusters

πŸ“ˆ Model Evaluation Metrics

Cheat Sheet: Regression Metrics

MetricFormulaRangeInterpretationWhen to Use
RMSE√(Σ(y - ŷ)² / n)[0, ∞]Same units as targetPenalize large errors
MAEΣ|y - ŷ| / n[0, ∞]Same units as targetTreat all errors equally
MAPE(100/n) * Σ|y - ŷ|/|y|[0, ∞]%Percentage errorRelative error important
R²1 - (SS_res / SS_tot)(-∞, 1]Variance explainedModel comparison

Cheat Sheet: Classification Metrics

MetricFormulaRangeWhen to Use
Accuracy(TP + TN) / Total[0, 1]Balanced classes
PrecisionTP / (TP + FP)[0, 1]Minimize false alarms
RecallTP / (TP + FN)[0, 1]Find all positives (e.g., disease detection)
F1-Score2 * (Prec * Rec) / (Prec + Rec)[0, 1]Balance precision & recall
AUC-ROCArea under ROC curve[0, 1]Overall classifier performance

Confusion Matrix Quick Reference

                Predicted
              Pos     Neg
Actual  Pos   TP      FN
        Neg   FP      TN
  • Precision = “Of all predicted positives, how many were correct?”
  • Recall = “Of all actual positives, how many did we find?”

🎯 Overfitting vs Underfitting

AspectUnderfittingGood FitOverfitting
Training ErrorHighLowVery Low
Validation ErrorHighLowHigh
Model ComplexityToo simpleJust rightToo complex
What’s happeningNot learning patternsLearning generalizable patternsMemorizing noise
FixMore features, complex modelβœ“ Good to goRegularization, more data, simpler model

πŸ”§ Regularization

TechniqueTypeFormulaEffectWhen to Use
Ridge (L2)Linear+ λΣβ²Shrinks coefficientsMulticollinearity, keep all features
Lasso (L1)Linear+ λΣ|β|Sets some β to 0Feature selection needed

🎲 Ensemble Methods

MethodTypeHow it WorksBest For
Random ForestBaggingAverage of many treesReduce variance, high accuracy
AdaBoostBoostingSequential, focus on errorsWeak learners, binary classification
Gradient BoostingBoostingSequential, fit residualsHigh accuracy, regression/classification
XGBoostBoostingOptimized gradient boostingCompetition winning, production systems

⚑ Quick Interview Formulas

Must-Know Formulas

# Standard Error
SE = Οƒ / √n

# Z-Score
z = (x - ΞΌ) / Οƒ

# Confidence Interval
CI = xΜ„ Β± (z * SE)

# RΒ² (coefficient of determination)
RΒ² = 1 - (SS_residual / SS_total)

# Bias-Variance Tradeoff
Total Error = BiasΒ² + Variance + Irreducible Error

πŸ—ΊοΈ Navigation


Pro Tip: Print these cheat sheets and review them the night before your interview!