Module 1 · Foundations of Python and Data Handling

Section 4: Data Cleaning and Preparation

Missing Data · Duplicates · Type Conversion · Outliers · Feature Engineering

🩺 Data Quality Issues
  • Missing values (NaN), duplicates, inconsistent formats
  • The 80/20 rule: 80% of effort is data preparation
  • df.isnull().sum() — count missing per column
  • df.duplicated().sum() — count duplicate rows
🕳️ Handling Missing Data
  • df.dropna() — remove rows/columns with NaN
  • df.fillna(value) — fill with mean, median, or constant
  • df.fillna(method="ffill") — forward fill time series
  • Decide: drop vs. fill based on % missing and context
🔄 Type Conversion
  • df["col"].astype(float) — convert dtype
  • pd.to_datetime(df["date"]) — parse dates
  • Strip whitespace: df["col"].str.strip()
  • Standardize case: .str.lower(), .str.title()
📐 Outlier Detection
  • IQR method: Q1 − 1.5×IQR to Q3 + 1.5×IQR fence
  • Z-score: flag values beyond ±3 standard deviations
  • Visual check: boxplot before deciding to remove
  • Business context determines if outlier is error or signal
🛠️ Feature Engineering
  • Derive new columns: revenue × margin = profit
  • Bin continuous values: pd.cut(), pd.qcut()
  • Encode categoricals: pd.get_dummies()
  • Date features: year, month, day_of_week from datetime
🧪 Lab 4 — Data Cleaning Pipeline
  • Part A: Create a messy dataset with intentional errors
  • Part B: Data quality assessment — find all issues
  • Part C: Step-by-step cleaning — fix each problem
  • Part D: Outlier analysis with IQR and visualization