Arun Murali

I Ditched Obsidian Sync and Built My Own — Here's What Actually Happened

Sun, 31 May 2026 00:00:00 +0000

The $96/Year Problem

Obsidian Sync costs $8/month. That’s $96/year to sync markdown files — plain text — across your devices.

I already pay for a VPS. I already run Docker. And I was already staring at a bill for a note-syncing service that, fundamentally, moves .md files between computers.

So I asked myself: how hard could this be?

The answer: harder than it should be. But completely worth it.

My First Attempt Was Embarrassingly Wrong

My first instinct was to run Obsidian itself in a Docker container on the VPS and access it through a browser. No local app needed — just open a tab, take notes.

I Just Wanted to Review My Chess Games. I Built a Multiplayer App Instead.

Sun, 31 May 2026 00:00:00 +0000

It Started With a Simple Frustration

I lose a chess game. I want to know why.

Lichess and Chess.com both have analysis boards, but they’re cluttered, slow to load, and I don’t control my data. I just want to paste a PGN — the standard text format chess games are saved in — and step through my moves on a clean board.

That’s it. That’s the whole requirement.

What I shipped six weeks later: a retro terminal-style chess app with 3D WebGPU rendering, a Stockfish bot at three difficulty levels, real-time online multiplayer, JWT authentication, a PostgreSQL database tracking win/loss stats, and an AI lesson generator powered by LiteLLM.

Building a Couple Outfit Configurator with Layered SVG Avatars in React

Fri, 06 Mar 2026 00:00:00 +0000

A browser-based side-by-side bride and groom outfit configurator, built as a proof-of-concept to validate layered SVG avatar rendering, outfit switching, and coordinated couple palette application.

Live Demo: outfits.anmious.cloud

🎯 What It Does

The configurator lets you customize two avatars simultaneously:

Skin tone — light, medium, dark
Body type — petite to plus-size (bride), lean to stocky (groom)
Height — short through very tall, with realistic proportions
Hair style and color — updo, short bob, long straight (bride); multiple buzz cuts and spikes (groom)
Outfits — Western, Indian, and Casual categories per avatar
Glasses toggle — overlaid frame layer
Couple palettes — curated coordinated color sets (Classic White, Blush & Rose, Sage Garden, Midnight Blue, Golden Hour)
Manual color override — per-avatar primary color picker
Export to PNG — download the full couple preview as a retina-quality image

🏗️ Architecture: Why Layered SVGs?

The Core Idea

Each avatar is rendered as a stack of independent SVG layer components, all sharing the same viewBox="0 0 200 450" coordinate space:

How I Built Smart Debt Planner with AI Prompts: FastAPI, Docker, and CI/CD Deployment

Sun, 01 Mar 2026 00:00:00 +0000

If you want to build a production-ready app quickly, this is the workflow I used to create Smart Debt Planner: a FastAPI app for debt payoff simulations with a clean deployment pipeline.

This guide covers:

How I used prompts to build features faster
The exact project architecture
Docker + VPS deployment
GitHub Actions CI/CD
Real-world troubleshooting (SSH auth, Docker Compose mismatch, 502 errors)

🔄 Latest Update (March 2026)

New additions were shipped after the initial backend-only deployment:

Build a 3D Chess Replay Viewer with WebGPU in Under 30 Minutes

Sat, 24 Jan 2026 00:00:00 +0000

A complete step-by-step guide to creating an interactive 3D chess replay viewer using Babylon.js, React, and TypeScript. Watch chess games come to life with smooth animations, glowing highlights, and WebGPU-accelerated rendering!

🎯 What You’ll Build

An interactive 3D chess board that can:

Parse chess games in PGN notation
Replay games move by move with smooth 3D animations
Switch between different chess piece sets
Auto-play games with pause/resume controls
Support both WebGPU (modern) and WebGL (fallback) rendering

Live Demo Features:

Accuracy

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Accuracy is the proportion of correct predictions: (TP + TN) / Total. Simple and intuitive, but MISLEADING for imbalanced datasets. If 95% of emails are not spam, a model that always predicts “not spam” gets 95% accuracy but is useless. Use accuracy only for balanced datasets; prefer precision, recall, or F1 for imbalanced data.

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

When to Use Accuracy

[Content to be filled in]

AdaBoost (Adaptive Boosting)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

AdaBoost builds models sequentially, each focusing on examples the previous models got wrong by adjusting their weights. Combines weak learners (usually shallow trees/stumps) into a strong learner. Each model’s influence is weighted by its accuracy. Pros: simple, works well with weak learners. Cons: sensitive to outliers and noise, slower than Random Forest.

🟦 Core Notes (Must-Know)

How AdaBoost Works

[Content to be filled in]

The Algorithm

[Content to be filled in]

ANOVA (Analysis of Variance)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

ANOVA (Analysis of Variance) tests if means of 3+ groups are significantly different. Instead of multiple t-tests (which inflates Type I error), ANOVA does one omnibus test by comparing variance between groups to variance within groups. If significant, use post-hoc tests to find which specific groups differ. One-way ANOVA has one factor; two-way has two factors.

🟦 Core Notes (Must-Know)

What is ANOVA?

[Content to be filled in]

Chi-Square Tests (χ² Tests)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Chi-square tests determine if there’s a significant association between categorical variables. Common types: (1) Test of Independence (are two variables related?), (2) Goodness of Fit (does data match expected distribution?). Use when both variables are categorical. The test compares observed frequencies to expected frequencies.

🟦 Core Notes (Must-Know)

What is a Chi-Square Test?

[Content to be filled in]

Chi-Square Test of Independence

[Content to be filled in]

Classification Report (Precision, Recall, F1-Score)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Classification report provides a comprehensive view of model performance: precision, recall, F1-score, and support for each class. Shows how well the model performs overall and per-class. Essential for imbalanced datasets where accuracy alone is misleading. Use to identify which classes the model struggles with.

🟦 Core Notes (Must-Know)

What’s in a Classification Report?

[Content to be filled in]

Metrics Explained

[Content to be filled in]

Precision
Recall
F1-Score
Support

Macro vs Weighted Averages

[Content to be filled in]

Clustering Overview - Unsupervised Learning

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Clustering is unsupervised learning that groups similar data points together without predefined labels. Common algorithms: K-Means (fast, needs K), Hierarchical (dendrogram, no K needed), DBSCAN (density-based, finds arbitrary shapes). Use cases: customer segmentation, anomaly detection, data exploration. Unlike supervised learning, there’s no “correct” answer - evaluate with silhouette score, elbow method, domain knowledge.

🟦 Core Notes (Must-Know)

What is Clustering?

[Content to be filled in]

Supervised vs Unsupervised Learning

[Content to be filled in]

Confusion Matrix

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

A confusion matrix shows the performance of a classification model by comparing actual vs predicted labels. Four quadrants: True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN). All classification metrics (precision, recall, accuracy, F1) derive from these four values. Essential for understanding where your model makes mistakes.

🟦 Core Notes (Must-Know)

Structure of Confusion Matrix

[Content to be filled in]

 Predicted
 Pos Neg
Actual Pos TP FN
 Neg FP TN

The Four Quadrants

[Content to be filled in]

Cross Validation (CV)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Cross-validation (CV) evaluates model performance by splitting data into K folds, training on K-1 folds and testing on the remaining fold, repeating K times. Averages results for more robust estimate than single train-test split. Common: K-Fold (K=5 or 10), Stratified K-Fold (preserves class distribution), Leave-One-Out. Use to detect overfitting and compare models fairly.

🟦 Core Notes (Must-Know)

What is Cross Validation?

[Content to be filled in]

Types of Cross Validation

[Content to be filled in]

Data Science Interview Cheat Sheets Index

Sat, 10 Jan 2026 00:00:00 +0000

Data Science Interview Cheat Sheets

Quick reference guides organized by topic. These are meant for last-minute review before interviews.

📊 Statistics & Probability

Cheat Sheet: Descriptive Statistics

Metric	Formula	Use Case
Mean	`Σx / n`	Central tendency, continuous data
Median	Middle value	Skewed distributions, outliers present
Mode	Most frequent	Categorical data
Variance	`Σ(x - μ)² / n`	Data spread
Std Dev	`√Variance`	Same units as data

Cheat Sheet: Probability Distributions

Distribution	Type	Parameters	Use Case
Normal	Continuous	μ, σ	Natural phenomena, errors
Binomial	Discrete	n, p	Success/failure trials
Poisson	Discrete	λ	Rare events over time
Uniform	Continuous	a, b	Equal probability

🔬 Hypothesis Testing Decision Tree

Start: Do you have a question about relationships?
 │
 ├─ YES → What type of data?
 │ │
 │ ├─ Categorical vs Categorical → Chi-Square Test
 │ │
 │ ├─ Numerical vs Categorical (2 groups) → t-test
 │ │ ├─ Known population σ? → Z-test
 │ │ └─ Unknown σ, small sample → t-test
 │ │
 │ ├─ Numerical vs Categorical (3+ groups) → ANOVA
 │ │
 │ └─ Numerical vs Numerical → Correlation / Regression
 │
 └─ NO → EDA / Descriptive Statistics

Cheat Sheet: Hypothesis Tests Comparison

Test	Data Types	Null Hypothesis	When to Use
Chi-Square	Cat vs Cat	No association	Independence, goodness-of-fit
t-test	Num vs Cat (2 groups)	Means are equal	Compare 2 group means
Z-test	Num vs Cat (2 groups)	Means are equal	Large sample, known σ
ANOVA	Num vs Cat (3+ groups)	All means equal	Compare 3+ group means
F-test	Num vs Num	Variances equal	Compare variances

🤖 Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

Algorithm	Problem Type	Pros	Cons	When to Use
Linear Regression	Regression	Fast, interpretable	Assumes linearity	Linear relationships
Logistic Regression	Classification	Interpretable, probabilities	Linear boundary	Binary/multi-class, need probabilities
Decision Tree	Both	Non-linear, interpretable	Overfits easily	Complex patterns, explainability needed
Random Forest	Both	Reduces overfitting, robust	Slow, black box	High accuracy, less interpretable OK
KNN	Both	Simple, no training	Slow prediction, sensitive to scale	Small datasets, simple patterns

Cheat Sheet: Clustering Algorithms

Algorithm	Type	Pros	Cons	When to Use
K-Means	Partitioning	Fast, scalable	Need to set K, spherical clusters	Large datasets, known # clusters
Hierarchical	Agglomerative/Divisive	No need to set K, dendrogram	Slow, memory intensive	Small datasets, explore # clusters

📈 Model Evaluation Metrics

Cheat Sheet: Regression Metrics

Metric	Formula	Range	Interpretation	When to Use
RMSE	`√(Σ(y - ŷ)² / n)`	[0, ∞]	Same units as target	Penalize large errors
MAE	`Σ\|y - ŷ\| / n`	[0, ∞]	Same units as target	Treat all errors equally
MAPE	`(100/n) * Σ\|y - ŷ\|/\|y\|`	[0, ∞]%	Percentage error	Relative error important
R²	`1 - (SS_res / SS_tot)`	(-∞, 1]	Variance explained	Model comparison

Cheat Sheet: Classification Metrics

Metric	Formula	Range	When to Use
Accuracy	`(TP + TN) / Total`	[0, 1]	Balanced classes
Precision	`TP / (TP + FP)`	[0, 1]	Minimize false alarms
Recall	`TP / (TP + FN)`	[0, 1]	Find all positives (e.g., disease detection)
F1-Score	`2 * (Prec * Rec) / (Prec + Rec)`	[0, 1]	Balance precision & recall
AUC-ROC	Area under ROC curve	[0, 1]	Overall classifier performance

Confusion Matrix Quick Reference

 Predicted
 Pos Neg
Actual Pos TP FN
 Neg FP TN

Precision = “Of all predicted positives, how many were correct?”
Recall = “Of all actual positives, how many did we find?”

🎯 Overfitting vs Underfitting

Aspect	Underfitting	Good Fit	Overfitting
Training Error	High	Low	Very Low
Validation Error	High	Low	High
Model Complexity	Too simple	Just right	Too complex
What’s happening	Not learning patterns	Learning generalizable patterns	Memorizing noise
Fix	More features, complex model	✓ Good to go	Regularization, more data, simpler model

🔧 Regularization

Technique	Type	Formula	Effect	When to Use
Ridge (L2)	Linear	`+ λΣβ²`	Shrinks coefficients	Multicollinearity, keep all features
Lasso (L1)	Linear	`+ λΣ\|β\|`	Sets some β to 0	Feature selection needed

🎲 Ensemble Methods

Method	Type	How it Works	Best For
Random Forest	Bagging	Average of many trees	Reduce variance, high accuracy
AdaBoost	Boosting	Sequential, focus on errors	Weak learners, binary classification
Gradient Boosting	Boosting	Sequential, fit residuals	High accuracy, regression/classification
XGBoost	Boosting	Optimized gradient boosting	Competition winning, production systems

⚡ Quick Interview Formulas

Must-Know Formulas

# Standard Error
SE = σ / √n

# Z-Score
z = (x - μ) / σ

# Confidence Interval
CI = x̄ ± (z * SE)

# R² (coefficient of determination)
R² = 1 - (SS_residual / SS_total)

# Bias-Variance Tradeoff
Total Error = Bias² + Variance + Irreducible Error

Pro Tip: Print these cheat sheets and review them the night before your interview!

Data Science Interview Notes (Full-Stack Roadmap)

Sat, 10 Jan 2026 00:00:00 +0000

Data Science Interview Notes (Full-Stack Roadmap)

How to use this site

🟪 1-minute Summary = skim mode
🟦 Core Notes = must-know
🟨 Interview Triggers = what interviewers really test
🟥 Common Mistakes = traps
🟩 Mini Example = quick application

0) Start Here (Read First)

A) Statistics Foundations

A1: Statistics Basics

A2: Probability

A3: Random Variables & Distributions

B) Statistical Inference & Hypothesis Testing

B1: Hypothesis Testing Core (Master Template)

Hypothesis Testing: General Step-by-Step Framework

B2: Test Families (Each page repeats the same framework)

C) EDA & Data Preparation

C1: EDA Workflow

C2: Data Cleaning Modules

D) Machine Learning Core

D1: Unsupervised Learning (Clustering)

D2: Supervised Learning (Prediction)

E) Model Evaluation & Model Selection

E1: Regression Evaluation

E2: Classification Evaluation

E3: Model Selection Workflows

F) Generalization, Regularization, and Fit

F1: Fit & Generalization

F2: Regularization (Linear Models)

G) Feature Engineering & Non-Linear Modeling

H) Imbalanced Data Toolkit

I) Ensemble Methods

I1: Ensemble Overview

Ensemble Methods Overview

I2: Bagging & Forests

Random Forest

I3: Boosting Family

Decision Tree

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Decision trees make predictions by learning decision rules from features, creating a tree structure of if-then conditions. Splits data recursively to maximize purity (using Gini or entropy). Pros: highly interpretable, handles non-linear relationships, no scaling needed. Cons: prone to overfitting, unstable (small data changes = big tree changes). Control depth to prevent overfitting.

🟦 Core Notes (Must-Know)

How Decision Trees Work

[Content to be filled in]

Splitting Criteria

[Content to be filled in]

Dendrogram and Hierarchical Clustering

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Hierarchical clustering builds a tree (dendrogram) showing how data points group together at different similarity levels. Two types: Agglomerative (bottom-up: merge) and Divisive (top-down: split). Advantage: don’t need to specify K beforehand, dendrogram visualizes structure. Disadvantage: slow for large datasets. Cut the dendrogram at desired height to get clusters.

🟦 Core Notes (Must-Know)

What is Hierarchical Clustering?

[Content to be filled in]

Agglomerative vs Divisive

[Content to be filled in]

Descriptive vs Inferential Statistics

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Descriptive statistics summarize and describe the features of a dataset (mean, median, charts), while inferential statistics make predictions or inferences about a population based on a sample (hypothesis testing, confidence intervals). Descriptive = “What happened?” | Inferential = “What does this mean for the bigger picture?”

🟦 Core Notes (Must-Know)

What are Descriptive Statistics?

[Content to be filled in]

What are Inferential Statistics?

[Content to be filled in]

Drawing General Conclusions from Data (EDA)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

After completing EDA, you must synthesize findings into conclusions that guide modeling decisions. Key outputs: (1) Data quality assessment, (2) Feature insights (which matter, which don’t), (3) Recommended transformations, (4) Potential model approaches, (5) Known limitations. Good conclusions bridge EDA and modeling.

🟦 Core Notes (Must-Know)

What to Conclude From EDA

Data Quality Summary

[Content to be filled in]

Feature Insights

[Content to be filled in]

Distribution Patterns

[Content to be filled in]

Duplicate Treatment

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Duplicates are rows that appear multiple times in a dataset. Types: (1) Exact duplicates (all columns identical), (2) Partial duplicates (key columns identical). Detection: df.duplicated(). Treatment depends on context: remove if errors, keep if legitimate (e.g., multiple purchases by same customer). Always investigate before blindly dropping.

🟦 Core Notes (Must-Know)

Types of Duplicates

[Content to be filled in]

Why Duplicates Occur

[Content to be filled in]

Detection Methods

[Content to be filled in]

EDA General Steps - Master Checklist

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

EDA (Exploratory Data Analysis) is the systematic examination of data before modeling. Standard workflow: (1) Load and understand structure, (2) Check data types and memory, (3) Identify missing values, (4) Detect duplicates, (5) Find outliers, (6) Analyze distributions (univariate), (7) Explore relationships (bivariate/multivariate), (8) Document findings. EDA informs cleaning, feature engineering, and model selection.

🟦 Core Notes (Must-Know)

The EDA Checklist

Step 1: Load and Understand Structure

[Content to be filled in]

Ensemble Methods Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Ensemble methods combine multiple models to improve performance. Two main types: (1) Bagging (parallel, reduce variance) - Random Forest, (2) Boosting (sequential, reduce bias) - AdaBoost, Gradient Boosting, XGBoost. Generally outperform single models. Trade-off: better performance but less interpretable, slower, more complex.

🟦 Core Notes (Must-Know)

What are Ensemble Methods?

[Content to be filled in]

Bagging vs Boosting

[Content to be filled in]

Common Ensemble Algorithms

[Content to be filled in]

F-test - Comparing Variances

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

The F-test compares variances between two groups to determine if they’re significantly different. It’s the ratio of two variances: F = variance₁ / variance₂. Commonly used to test assumptions before t-tests (equal variance assumption) and as the foundation for ANOVA. F-distribution is right-skewed and always positive.

🟦 Core Notes (Must-Know)

What is an F-test?

[Content to be filled in]

F-statistic Formula

[Content to be filled in]

F-distribution

[Content to be filled in]

FPR (False Positive Rate)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

FPR (False Positive Rate) measures “of all actual negatives, how many did we incorrectly predict as positive?” Formula: FP / (FP + TN). Also called “fall-out”. Used in ROC curves (FPR on x-axis, TPR/Recall on y-axis). Lower FPR is better. Complement of specificity (Specificity = 1 - FPR = TN / (TN + FP)).

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

Interpretation

[Content to be filled in]

Gradient Boosting

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Gradient Boosting builds models sequentially, each trying to correct errors (residuals) of the previous ensemble. Uses gradient descent to minimize loss function. More flexible than AdaBoost (works for regression too). Hyperparameters: learning rate (shrinkage), n_estimators, max_depth. Pros: state-of-the-art performance. Cons: prone to overfitting, slow training, many hyperparameters to tune.

🟦 Core Notes (Must-Know)

How Gradient Boosting Works

[Content to be filled in]

The Algorithm

[Content to be filled in]

Grid Search CV (Hyperparameter Tuning)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Grid Search CV exhaustively tries all combinations of hyperparameters you specify, using cross-validation to evaluate each combination. Returns the best parameters and best score. Automates hyperparameter tuning. Pros: thorough, easy to use. Cons: computationally expensive (exponential with parameters). Alternative: RandomizedSearchCV for faster search.

🟦 Core Notes (Must-Know)

What is Grid Search?

[Content to be filled in]

How It Works

[Content to be filled in]

Parameters vs Hyperparameters

[Content to be filled in]

How to Study This Blog for Data Science Interviews

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

This blog is designed as a structured interview prep system, not a traditional textbook. Each lesson follows a 5-part color-coded template: Summary (skim) → Core Notes (must-know) → Interview Triggers (what’s tested) → Common Mistakes (traps) → Mini Example (application). Use tags and reading paths to navigate based on your timeline.

🟦 Core Study Strategies

Strategy 1: Sprint Mode (1-2 weeks before interview)

Follow Path 1 from the roadmap
Focus on 🟪 Summaries and 🟨 Interview Triggers only
Skip deep dives; prioritize breadth over depth
Review all 🟥 Common Mistakes sections

Strategy 2: Deep Study Mode (1-3 months prep)

Follow Path 2 from the roadmap
Read every section in sequence (A → I)
Complete all 🟩 Mini Examples with actual code
Create flashcards from 🟦 Core Notes

Strategy 3: Topic-Specific Review

Use tags to find related content:
- statistics - Statistical foundations
- hypothesis-testing - All hypothesis tests
- eda - Exploratory data analysis
- ml-supervised / ml-clustering - Machine learning
- evaluation - Model metrics
- ensembles - Ensemble methods
Search by category: “Data Science”

🟨 Interview Triggers (When to Use This Blog)

Use this blog when you need to:

Hypothesis Testing - General Step-by-Step Framework

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Hypothesis testing is a systematic way to determine if observed data provides enough evidence to reject a claim (null hypothesis). The universal framework: (1) State hypotheses (H₀ and H₁), (2) Choose significance level (α), (3) Calculate test statistic, (4) Find p-value or critical value, (5) Make decision (reject or fail to reject H₀), (6) Interpret in context. This same structure applies to all tests.

🟦 Core Notes (Must-Know)

The 6-Step Framework

Step 1: State the Hypotheses

[Content to be filled in]

Imbalanced Data Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Imbalanced data occurs when classes have very different frequencies (e.g., 95% no-fraud, 5% fraud). Problems: model biased toward majority class, accuracy misleading. Solutions: (1) Resampling (over/undersample), (2) Different metrics (precision/recall/F1, not accuracy), (3) Class weights, (4) Anomaly detection. Choose based on data size and importance of minority class.

🟦 Core Notes (Must-Know)

What is Imbalanced Data?

[Content to be filled in]

Why It’s a Problem

[Content to be filled in]

K-Means Clustering

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

K-Means partitions data into K clusters by minimizing within-cluster variance. Algorithm: (1) Initialize K centroids randomly, (2) Assign points to nearest centroid, (3) Update centroids, (4) Repeat until convergence. Choose K using elbow method or silhouette score. Pros: fast, scalable. Cons: need to specify K, assumes spherical clusters, sensitive to initialization and outliers.

🟦 Core Notes (Must-Know)

How K-Means Works

[Content to be filled in]

Algorithm Steps

[Content to be filled in]

K-Nearest Neighbors (KNN)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

KNN classifies a data point by looking at the K nearest neighbors and taking a majority vote (classification) or average (regression). It’s a “lazy learner” - no training phase, just stores data. Pros: simple, no assumptions, works for multi-class. Cons: slow prediction, sensitive to scale and irrelevant features, needs optimal K. Always scale features first!

🟦 Core Notes (Must-Know)

How KNN Works

[Content to be filled in]

Lasso Regression (L1 Regularization)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Lasso Regression adds L1 penalty (sum of absolute coefficients) to linear regression. Can shrink coefficients to EXACTLY zero, performing automatic feature selection. Hyperparameter α controls strength. Use when you have many features and want to identify the important ones. Sparse solutions make model more interpretable. Must scale features first.

🟦 Core Notes (Must-Know)

How Lasso Works

[Content to be filled in]

Formula

[Content to be filled in]

Linear Regression

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Linear regression models the relationship between independent variables (X) and a continuous dependent variable (y) using a straight line: y = β₀ + β₁x₁ + … + βₙxₙ. Finds coefficients that minimize error (typically using least squares). Assumptions: linearity, independence, homoscedasticity, normality of residuals. Evaluate with R², RMSE, MAE. Simple but powerful baseline model.

🟦 Core Notes (Must-Know)

What is Linear Regression?

[Content to be filled in]

Linear Regression Score (R² and Adjusted R²)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

R² (coefficient of determination) measures the proportion of variance in the target explained by the model. Range: -∞ to 1 (1 = perfect fit, 0 = model is no better than mean, negative = worse than mean). Formula: 1 - (SS_residual / SS_total). Adjusted R² penalizes adding useless features. Use for model comparison, but not as the only metric.

🟦 Core Notes (Must-Know)

What is R²?

[Content to be filled in]

Logistic Regression

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Logistic regression predicts binary outcomes (0/1, yes/no) by estimating probabilities using the sigmoid function. Despite the name, it’s CLASSIFICATION, not regression. Outputs probability (0 to 1), use threshold (default 0.5) to make final decision. Pros: interpretable, outputs probabilities, works well for linearly separable data. Evaluate with accuracy, precision, recall, ROC-AUC.

🟦 Core Notes (Must-Know)

What is Logistic Regression?

[Content to be filled in]

Sigmoid Function

[Content to be filled in]

MAPE (Mean Absolute Percentage Error)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

MAPE expresses error as a percentage of actual values: (100/n) * Σ|actual - predicted|/|actual|. Useful when relative error matters more than absolute error. Pros: scale-independent, interpretable. Cons: undefined when actual = 0, asymmetric (over-predictions penalized less), biased toward low forecasts. Use for business metrics (sales, revenue) where % error is meaningful.

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

Interpretation

[Content to be filled in]

Mean, Median, and Mode - Measures of Central Tendency

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Mean (average), median (middle value), and mode (most frequent) are the three ways to describe where the “center” of your data lies. Mean is sensitive to outliers, median is robust to them, and mode works for categorical data. In interviews, you’ll be asked when to use each.

🟦 Core Notes (Must-Know)

Mean (Average)

[Content to be filled in]

Median (Middle Value)

[Content to be filled in]

Mode (Most Frequent)

[Content to be filled in]

Non-Linear Modeling Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Not all relationships are linear. Non-linear modeling captures curves and interactions. Options: (1) Polynomial features (x, x², x³), (2) Interaction terms (x₁*x₂), (3) Non-linear algorithms (trees, neural nets), (4) Transformations (log, sqrt). Still use linear regression with polynomial features - it’s linear in coefficients, not features.

🟦 Core Notes (Must-Know)

When Linear Models Fail

[Content to be filled in]

Approaches to Non-Linearity

[Content to be filled in]

Normal Distribution (Gaussian Distribution)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

The normal distribution (bell curve) is a symmetric, continuous probability distribution defined by its mean (μ) and standard deviation (σ). It’s the foundation of many statistical methods because many natural phenomena approximate it, and the Central Limit Theorem says sample means tend toward normality. The 68-95-99.7 rule describes how data spreads.

🟦 Core Notes (Must-Know)

What is the Normal Distribution?

[Content to be filled in]

Key Properties

[Content to be filled in]

Null and Missing Value Treatment

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Missing values are inevitable in real datasets. Treatment options: (1) Drop (if < 5% missing or MCAR), (2) Impute (mean/median/mode for numerical, mode for categorical, or advanced methods like KNN/MICE), (3) Create missing indicator (if missingness is informative). Choice depends on missingness mechanism (MCAR, MAR, MNAR) and percentage missing.

🟦 Core Notes (Must-Know)

Types of Missingness

[Content to be filled in]

MCAR (Missing Completely At Random)
MAR (Missing At Random)
MNAR (Missing Not At Random)

Detection Strategies

[Content to be filled in]

Outlier Treatment

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Outliers are data points significantly different from others. Detection methods: (1) IQR method (Q3 + 1.5*IQR), (2) Z-score (>3 or <-3), (3) Visual inspection (box plots, scatter plots). Treatment: (1) Remove if errors, (2) Cap/floor (winsorization), (3) Transform (log), (4) Keep if legitimate. NEVER blindly remove - investigate first!

🟦 Core Notes (Must-Know)

What are Outliers?

[Content to be filled in]

Types of Outliers

[Content to be filled in]

Overfitting

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Overfitting occurs when a model learns the training data too well, including noise and outliers, performing poorly on new data. Signs: high training accuracy, low validation/test accuracy. Causes: model too complex, too little data, training too long. Solutions: regularization, more data, simpler model, cross-validation, early stopping, dropout (neural nets).

🟦 Core Notes (Must-Know)

What is Overfitting?

[Content to be filled in]

How to Detect Overfitting

[Content to be filled in]

Polynomial Features

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Polynomial features transform input features into higher-degree terms (x → x, x²) and interactions (x₁, x₂ → x₁, x₂, x₁², x₂², x₁x₂). Allows linear regression to fit curves. Degree 2 = quadratic, degree 3 = cubic. Warning: features grow exponentially (2 features, degree 3 = 9 features). Use regularization to prevent overfitting. Visualize first to choose appropriate degree.

🟦 Core Notes (Must-Know)

What are Polynomial Features?

[Content to be filled in]

Precision

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Precision measures “of all predicted positives, how many were actually positive?” Formula: TP / (TP + FP). High precision means low false alarm rate. Use when false positives are costly (e.g., spam filter marking important emails as spam, recommending irrelevant products). Trade-off with recall: being more selective (higher precision) means catching fewer positives (lower recall).

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

Interpretation

[Content to be filled in]

Probability Distributions Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

A probability distribution describes how likely different outcomes are for a random variable. Common discrete distributions include binomial (yes/no trials) and Poisson (rare events). Common continuous distributions include normal (bell curve) and uniform (equal probability). Choosing the right distribution is crucial for modeling and hypothesis testing.

🟦 Core Notes (Must-Know)

What is a Probability Distribution?

[Content to be filled in]

Common Discrete Distributions

[Content to be filled in]

Probability Fundamentals

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Probability measures the likelihood of an event occurring, ranging from 0 (impossible) to 1 (certain). Key concepts include sample space, events, independent vs dependent events, and conditional probability. Understanding probability is foundational for hypothesis testing, Bayes theorem, and machine learning.

🟦 Core Notes (Must-Know)

Basic Definitions

[Content to be filled in]

Sample Space and Events

[Content to be filled in]

Probability Rules

[Content to be filled in]

Random Forest

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Random Forest builds multiple decision trees on random subsets of data and features, then averages predictions (regression) or votes (classification). Bagging reduces variance and overfitting. Pros: high accuracy, handles non-linearity, robust to outliers, feature importance. Cons: less interpretable, slower, memory intensive. Often a go-to algorithm for tabular data.

🟦 Core Notes (Must-Know)

How Random Forest Works

[Content to be filled in]

Key Hyperparameters

[Content to be filled in]

Random Variables - Discrete vs Continuous

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

A random variable is a variable whose values are determined by chance. Discrete random variables take countable values (e.g., number of customers, dice rolls), while continuous random variables can take any value within a range (e.g., height, temperature). This distinction determines which probability distributions and statistical methods you use.

🟦 Core Notes (Must-Know)

What is a Random Variable?

[Content to be filled in]

Discrete Random Variables

[Content to be filled in]

Range, IQR, Variance, and Standard Deviation - Measures of Spread

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Measures of spread tell you how dispersed your data is. Range (max - min) is simple but sensitive to outliers. IQR (interquartile range) is robust. Variance measures average squared deviation from the mean. Standard deviation is the square root of variance and shares the same units as your data, making it most interpretable.

🟦 Core Notes (Must-Know)

Range

[Content to be filled in]

Interquartile Range (IQR)

[Content to be filled in]

Recall (Sensitivity, True Positive Rate)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Recall measures “of all actual positives, how many did we find?” Formula: TP / (TP + FN). High recall means low miss rate. Use when false negatives are costly (e.g., disease detection - missing a sick patient is worse than false alarm). Trade-off with precision: being less selective (higher recall) means more false alarms (lower precision).

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

Interpretation

[Content to be filled in]

Regression Metrics Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Regression metrics measure how well your model predicts continuous values. Common metrics: RMSE (penalizes large errors, same units), MAE (average absolute error, robust to outliers), MAPE (percentage error), R² (variance explained, 0-1). Choose based on context: RMSE for penalizing large errors, MAE for balanced view, MAPE for relative error, R² for overall fit.

🟦 Core Notes (Must-Know)

Common Regression Metrics

[Content to be filled in]

When to Use Each Metric

[Content to be filled in]

Regularization Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Regularization adds a penalty term to the loss function to discourage complex models and prevent overfitting. Main types: L1 (Lasso) adds |coefficient| penalty, L2 (Ridge) adds coefficient² penalty. L1 can zero out coefficients (feature selection), L2 shrinks all coefficients. Elastic Net combines both. Hyperparameter λ controls strength.

🟦 Core Notes (Must-Know)

What is Regularization?

[Content to be filled in]

Why Regularization Works

[Content to be filled in]

Reusable Lesson Template (Data Science Interview Prep)

Sat, 10 Jan 2026 00:00:00 +0000

[Topic Name Here]

🟪 1-Minute Summary

A 2-3 sentence explanation that captures the absolute essence. If you only read this section, you’d know enough to recognize when the topic is mentioned in an interview.

Example: “Linear Regression models the relationship between a dependent variable and one or more independent variables using a straight line. The goal is to find the best-fit line that minimizes prediction errors. It’s used when you need to predict continuous numerical values.”

Ridge Regression (L2 Regularization)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Ridge Regression adds L2 penalty (sum of squared coefficients) to linear regression. Shrinks all coefficients toward zero but never exactly zero. Good for multicollinearity. Hyperparameter α controls strength (higher α = more regularization). Must scale features first. Reduces variance at cost of slight bias. Use when you want to keep all features but reduce overfitting.

🟦 Core Notes (Must-Know)

How Ridge Works

[Content to be filled in]

RMSE (Root Mean Squared Error)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

RMSE measures the average magnitude of prediction errors, penalizing large errors more heavily due to squaring. Formula: √(Σ(actual - predicted)² / n). Same units as target variable. Lower is better. Use when large errors are particularly bad (e.g., price prediction). More sensitive to outliers than MAE.

🟦 Core Notes (Must-Know)

Formula

[Content to be filled in]

Interpretation

[Content to be filled in]

When to Use RMSE

[Content to be filled in]

ROC Curve and AUC (Area Under the Curve)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

ROC (Receiver Operating Characteristic) curve plots TPR (recall) vs FPR at all classification thresholds. AUC (Area Under Curve) summarizes ROC in one number (0 to 1). AUC = 1 is perfect, 0.5 is random guessing. Use to evaluate model’s ability to distinguish classes across all thresholds, independent of class distribution. Better than accuracy for imbalanced data.

🟦 Core Notes (Must-Know)

What is ROC Curve?

[Content to be filled in]

Standard Error (SE)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Standard Error (SE) measures the variability of a sample statistic (like the sample mean). Formula: SE = σ / √n. It gets smaller as sample size increases. SE is crucial for calculating confidence intervals and test statistics. Don’t confuse with standard deviation: SD describes data spread, SE describes estimate precision.

🟦 Core Notes (Must-Know)

What is Standard Error?

[Content to be filled in]

Formula

[Content to be filled in]

Standard Normal Distribution and Z-Scores

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

The standard normal distribution is a special case of the normal distribution with mean = 0 and standard deviation = 1. Z-scores transform any normal distribution to this standard form, allowing you to compare values from different distributions and look up probabilities in standard tables. Formula: z = (x - μ) / σ.

🟦 Core Notes (Must-Know)

What is the Standard Normal Distribution?

[Content to be filled in]

Supervised Learning Overview

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Supervised learning trains models on labeled data (input-output pairs) to predict outcomes for new data. Two types: Regression (continuous target: price, temperature) and Classification (categorical target: yes/no, categories). Process: train on labeled data → validate → test on unseen data. Success requires good features, sufficient data, and appropriate algorithm selection.

🟦 Core Notes (Must-Know)

What is Supervised Learning?

[Content to be filled in]

Regression vs Classification

[Content to be filled in]

t-test and z-test - Comparing Group Means

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

t-tests and z-tests compare means between groups or against a known value. Use z-test when you know the population standard deviation and have a large sample (n > 30). Use t-test when you don’t know population σ or have small samples. Common types: one-sample, two-sample (independent), and paired t-tests.

🟦 Core Notes (Must-Know)

When to Use t-test vs z-test

[Content to be filled in]

Types of t-tests

[Content to be filled in]

Types of Probability

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

There are three main types of probability: Classical (theoretical, based on equally likely outcomes), Empirical (based on observed data/experiments), and Subjective (based on judgment/belief). Data scientists primarily use empirical probability when working with real-world data.

🟦 Core Notes (Must-Know)

Classical Probability

[Content to be filled in]

Empirical Probability

[Content to be filled in]

Subjective Probability

[Content to be filled in]

When to Use Each

[Content to be filled in]

Underfitting

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Underfitting occurs when a model is too simple to capture underlying patterns in data. Both training and validation performance are poor. Causes: model too simple, insufficient features, over-regularization. Solutions: more complex model, add features, reduce regularization, train longer. Less common than overfitting but equally problematic.

🟦 Core Notes (Must-Know)

What is Underfitting?

[Content to be filled in]

How to Detect Underfitting

[Content to be filled in]

Causes of Underfitting

[Content to be filled in]

Undersampling

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

Undersampling reduces the majority class to match minority class size. Simple and fast. Types: random undersampling, Tomek links, NearMiss. Pros: faster training, reduces class imbalance. Cons: loses information, may underfit. Use when you have abundant data and can afford to discard some. Alternative: oversampling (SMOTE) when data is limited.

🟦 Core Notes (Must-Know)

What is Undersampling?

[Content to be filled in]

Types of Undersampling

[Content to be filled in]

Work Dashboard

Sat, 10 Jan 2026 00:00:00 +0000

Content Status Dashboard

Enter password to view all posts and their completion status.

XGBoost (Extreme Gradient Boosting)

Sat, 10 Jan 2026 00:00:00 +0000

🟪 1-Minute Summary

XGBoost is an optimized implementation of gradient boosting with built-in regularization, parallel processing, and tree pruning. Dominates Kaggle competitions. Key features: handles missing values, L1/L2 regularization, early stopping, feature importance. Faster than sklearn’s GradientBoosting. Hyperparameters similar to GB but with extras (reg_alpha, reg_lambda). Default choice for structured/tabular data competitions.

🟦 Core Notes (Must-Know)

What Makes XGBoost Special?

[Content to be filled in]

Key Features

[Content to be filled in]

Schedule a Meeting

Sun, 26 Oct 2025 00:00:00 +0000

Schedule a Meeting

Welcome! Use the calendar below to book a Google Hangout or video call with me. Pick a time that works for you, and you’ll receive a confirmation email with the meeting link.

Click the button below to book a meeting:

Book a Meeting

My Calendar (Booked & Upcoming Slots)

If you have any questions or need a different time, please contact me.

Setting Up Your Personal AI Playground: OpenWebUI + LiteLLM + Multiple LLM Models

Tue, 30 Sep 2025 00:00:00 +0000

Introduction

“The best investment you can make is in tools that create leverage for yourself.” - Naval Ravikant

Have you ever wanted to try out different AI models without paying for multiple subscriptions? Or perhaps share access to these powerful tools with family members without breaking the bank?

I recently discovered a brilliant solution after watching a NetworkChuck video: using APIs for various LLM (Large Language Model) services and displaying them all in one interface through OpenWebUI. The best part? You can evaluate all these models with just $5 worth of API credits and share access with your entire family!

Arun Murali - Senior Staff Engineer

Mon, 01 Jan 0001 00:00:00 +0000

Arun Murali

I build and scale distributed systems that handle millions of transactions. Currently at Gap Inc, I’ve scaled systems 10× in throughput while making them more reliable and faster.

Livermore, CA • arun.murali@outlook.com • LinkedIn • GitHub

What I Do

Backend architecture • Event-driven systems • Performance optimization • Reliability engineering

Core technologies: Java, Spring Boot, Kafka, PostgreSQL, Kubernetes, Python, React

Gap Inc

Staff Software Engineer → Senior Staff Engineer, Jul 2018 – Present

Arun Murali

I Ditched Obsidian Sync and Built My Own — Here's What Actually Happened

The $96/Year Problem

My First Attempt Was Embarrassingly Wrong

I Just Wanted to Review My Chess Games. I Built a Multiplayer App Instead.

It Started With a Simple Frustration

Building a Couple Outfit Configurator with Layered SVG Avatars in React

🎯 What It Does

🏗️ Architecture: Why Layered SVGs?

The Core Idea

How I Built Smart Debt Planner with AI Prompts: FastAPI, Docker, and CI/CD Deployment

🔄 Latest Update (March 2026)

Build a 3D Chess Replay Viewer with WebGPU in Under 30 Minutes

🎯 What You’ll Build

Accuracy

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

Formula

When to Use Accuracy

AdaBoost (Adaptive Boosting)

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

How AdaBoost Works

The Algorithm

ANOVA (Analysis of Variance)

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

What is ANOVA?

Chi-Square Tests (χ² Tests)

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

What is a Chi-Square Test?

Chi-Square Test of Independence

Classification Report (Precision, Recall, F1-Score)

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

What’s in a Classification Report?

Metrics Explained

Macro vs Weighted Averages

Clustering Overview - Unsupervised Learning

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

What is Clustering?

Supervised vs Unsupervised Learning

Confusion Matrix

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

Structure of Confusion Matrix

The Four Quadrants

Cross Validation (CV)

🟪 1-Minute Summary

🟦 Core Notes (Must-Know)

What is Cross Validation?

Types of Cross Validation

Data Science Interview Cheat Sheets Index

Data Science Interview Cheat Sheets

📊 Statistics & Probability

Cheat Sheet: Descriptive Statistics

Cheat Sheet: Probability Distributions

🔬 Hypothesis Testing Decision Tree

Cheat Sheet: Hypothesis Tests Comparison

🤖 Machine Learning Algorithms

Cheat Sheet: Supervised Learning Algorithm Selection

Cheat Sheet: Clustering Algorithms

📈 Model Evaluation Metrics

Cheat Sheet: Regression Metrics

Cheat Sheet: Classification Metrics

Confusion Matrix Quick Reference

🎯 Overfitting vs Underfitting

🔧 Regularization

🎲 Ensemble Methods

⚡ Quick Interview Formulas

Must-Know Formulas

🗺️ Navigation

Data Science Interview Notes (Full-Stack Roadmap)

Data Science Interview Notes (Full-Stack Roadmap)

0) Start Here (Read First)

A) Statistics Foundations

A1: Statistics Basics

A2: Probability