- Missing values (NaN), duplicates, inconsistent formats
- The 80/20 rule: 80% of effort is data preparation
df.isnull().sum() — count missing per columndf.duplicated().sum() — count duplicate rows
df.dropna() — remove rows/columns with NaNdf.fillna(value) — fill with mean, median, or constantdf.fillna(method="ffill") — forward fill time series- Decide: drop vs. fill based on % missing and context
df["col"].astype(float) — convert dtypepd.to_datetime(df["date"]) — parse dates- Strip whitespace:
df["col"].str.strip() - Standardize case:
.str.lower(), .str.title()
- IQR method: Q1 − 1.5×IQR to Q3 + 1.5×IQR fence
- Z-score: flag values beyond ±3 standard deviations
- Visual check: boxplot before deciding to remove
- Business context determines if outlier is error or signal
- Derive new columns: revenue × margin = profit
- Bin continuous values:
pd.cut(), pd.qcut() - Encode categoricals:
pd.get_dummies() - Date features: year, month, day_of_week from datetime
- Part A: Create a messy dataset with intentional errors
- Part B: Data quality assessment — find all issues
- Part C: Step-by-step cleaning — fix each problem
- Part D: Outlier analysis with IQR and visualization