Statistical Visualization Workshop

Matplotlib and Seaborn for Analysis-Ready Charts

1 Section 1: Basic Plotting with Matplotlib

1.1 Creating Fundamental Plots

Explanation: Start with core plot types to explore relationships: line plots for trends and scatter plots for correlations.

Requirements: Generate a line plot and scatter plot from simple numerical data, customize labels, title, colors, and legend. One snippet, one chart.

import matplotlib.pyplot as plt
import numpy as np

# Sample data
time = np.linspace(0, 10, 100)
signal = np.sin(time) + 0.2 * np.random.randn(100)

# Line plot
plt.figure(figsize=(9, 4))
plt.plot(time, signal, color='steelblue', linewidth=2, label='Sensor signal')
plt.title('Signal Over Time')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Scatter plot
sales = np.linspace(50, 200, 50)
costs = 40 + 1.8 * sales + np.random.normal(0, 30, 50)

plt.figure(figsize=(9, 4))
plt.scatter(sales, costs, c='darkorange', edgecolor='black', alpha=0.8, label='Samples')
plt.title('Costs vs Sales')
plt.xlabel('Sales ($k)')
plt.ylabel('Costs ($k)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Chart Descriptions: The line plot shows a noisy sine-like wave over time, illustrating signal fluctuation. The scatter plot shows a positive linear relationship between sales and costs with moderate spread.

1.2 Understanding Distributions

Explanation: Histograms summarize distribution shape, while box plots highlight quartiles and outliers.

Requirements: Plot a histogram with custom bin selection and a box plot of the same data to contrast views. One snippet, one chart.

data = np.random.normal(loc=70, scale=15, size=200)

plt.figure(figsize=(7, 4))
plt.hist(data, bins=15, color='slateblue', edgecolor='white', alpha=0.8)
plt.title('Histogram of Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(5, 4))
plt.boxplot(data, vert=True, patch_artist=True,
            boxprops=dict(facecolor='lightgreen', color='darkgreen'),
            medianprops=dict(color='black'))
plt.title('Box Plot of Scores')
plt.ylabel('Score')
plt.show()

Chart Descriptions: The histogram shows a roughly bell-shaped distribution centered near 70 with moderate spread. The box plot displays the median, interquartile range, and points beyond the whiskers as potential outliers.

1.3 Bar Plots and Subplots

Explanation: Bar plots compare categories, while subplots help arrange multiple views in one figure.

Requirements: Build a categorical bar plot and a simple two-panel subplot layout. One snippet, one chart.

categories = ['A', 'B', 'C', 'D']
values = [120, 95, 140, 110]

plt.figure(figsize=(7, 4))
plt.bar(categories, values, color=['#4C72B0', '#55A868', '#C44E52', '#8172B3'])
plt.title('Category Performance')
plt.ylabel('Value')
plt.show()

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

x = np.linspace(0, 5, 50)
axes[0].plot(x, np.exp(-x), color='teal', linewidth=2)
axes[0].set_title('Exponential Decay')
axes[0].grid(True, alpha=0.3)

axes[1].scatter(np.random.randn(50), np.random.randn(50), c='firebrick', alpha=0.7)
axes[1].set_title('Random Scatter')
axes[1].grid(True, alpha=0.3)

fig.suptitle('Two-Panel Subplot Layout', y=1.02)
plt.tight_layout()
plt.show()

Chart Descriptions: The bar chart compares category values. The two-panel subplot shows a decaying curve and a random scatter in one figure for side-by-side comparison.

2 Section 2: Statistical Visualizations with Seaborn

2.1 Distribution Analysis

Explanation: Seaborn simplifies distribution plots with attractive defaults, offering histograms, KDE, and combined views.

Requirements: Create a histogram with KDE overlay and compare two distributions with filled KDE plots. One snippet, one chart.

import seaborn as sns
import pandas as pd

sns.set_theme()

dist1 = np.random.gamma(shape=2, scale=2, size=400)
dist2 = np.random.gamma(shape=4, scale=1.3, size=400)

plt.figure(figsize=(7, 4))
sns.histplot(dist1, kde=True, color='skyblue')
plt.title('Gamma Distribution (k=2, theta=2)')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(7, 4))
sns.kdeplot(dist1, fill=True, alpha=0.5, label='Group 1')
sns.kdeplot(dist2, fill=True, alpha=0.5, label='Group 2')
plt.title('Comparing Two Distributions')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()

Chart Descriptions: The first panel shows a right-skewed gamma distribution with a smooth KDE curve. The second overlays two filled KDE curves, revealing Group 2 is more concentrated with a higher peak.

2.2 Categorical Visualizations

Explanation: Count plots, bar plots, strip plots, violin plots, and swarm plots reveal structure in categorical data.

Requirements: Visualize counts per category, show average values with error bars, and compare distributions with violin and swarm plots. One snippet, one chart.

np.random.seed(0)
orders = pd.DataFrame({
    'product': np.random.choice(['Alpha', 'Beta', 'Gamma', 'Delta'], 300),
    'channel': np.random.choice(['Online', 'Retail'], 300),
    'units': np.random.randint(1, 8, 300),
    'price': np.random.normal(45, 12, 300)
})
orders['revenue'] = orders['units'] * orders['price']

plt.figure(figsize=(8, 4))
sns.countplot(data=orders, x='product', hue='channel')
plt.title('Order Counts by Product and Channel')
plt.xlabel('Product')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(8, 4))
sns.barplot(data=orders, x='product', y='revenue', errorbar='sd', hue='channel')
plt.title('Average Revenue by Product (Std Dev)')
plt.xlabel('Product')
plt.ylabel('Revenue')
plt.show()

plt.figure(figsize=(9, 4))
sns.violinplot(data=orders, x='product', y='revenue', inner=None, color='lightgray')
sns.swarmplot(data=orders.sample(120, random_state=1), x='product', y='revenue',
              hue='channel', dodge=True, size=3)
plt.title('Revenue Distribution with Individual Points')
plt.xlabel('Product')
plt.ylabel('Revenue')
plt.legend(title='Channel', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

Chart Descriptions: Count plot shows purchase frequency split by channel. Bar plot shows mean revenue per product with variability. Violin plus swarm plot outlines full revenue distributions while individual points highlight channel differences.

2.3 Joint Plots, Heatmaps, and Pair Plots

Explanation: Seaborn streamlines bivariate and multivariate analysis with joint plots, correlation heatmaps, and pair plots.

Requirements: Produce a joint plot with KDE contours, a correlation heatmap, and a pair plot for multivariate exploration. One snippet, one chart.

metrics = pd.DataFrame({
    'sessions': np.random.poisson(200, 150),
    'conversion_rate': np.random.beta(2, 8, 150),
    'avg_order_value': np.random.normal(80, 15, 150),
    'region': np.random.choice(['NA', 'EU', 'APAC'], 150)
})
metrics['revenue_per_session'] = metrics['conversion_rate'] * metrics['avg_order_value']

# Joint plot
sns.jointplot(data=metrics, x='sessions', y='revenue_per_session',
              kind='kde', fill=True, cmap='mako')
plt.suptitle('Traffic vs Revenue Density', y=1.02)
plt.show()

# Correlation heatmap
corr = metrics[['sessions', 'conversion_rate', 'avg_order_value', 'revenue_per_session']].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, linewidths=0.8, square=True)
plt.title('Metric Correlations')
plt.show()

# Pair plot
sns.pairplot(metrics, hue='region', diag_kind='kde', height=2.2)
plt.suptitle('Pair Plot by Region', y=1.02)
plt.show()

Chart Descriptions: The joint plot shows dense probability contours linking sessions to revenue per session. The heatmap visualizes correlation strengths, highlighting how revenue per session depends on both conversion rate and order value. The pair plot reveals regional clustering across all metrics.

3 Section 3: Exploring Relationships and Patterns

3.1 Regression and Confidence Intervals

Explanation: Regression lines and confidence intervals help quantify relationships in scatter plots.

Requirements: Build a scatter plot with a regression line and 95 percent confidence interval.

student_perf = pd.DataFrame({
    'hours': np.random.uniform(0, 12, 80),
    'prep_quality': np.random.uniform(0.6, 1.0, 80)
})
student_perf['score'] = 55 + 4.5 * student_perf['hours'] * student_perf['prep_quality'] + np.random.normal(0, 8, 80)

plt.figure(figsize=(9, 4))
sns.regplot(data=student_perf, x='hours', y='score', scatter_kws={'alpha': 0.6})
plt.title('Study Hours vs Exam Score')
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.show()

Chart Description: Points trend upward with a fitted line and shaded band, indicating higher study hours align with higher scores and showing the uncertainty around the fit.

3.2 Pair Plots and Facet Grids

Explanation: Pair plots reveal multivariate structure; facet grids split plots by category to show conditional patterns.

Requirements: Generate a pair plot by segment and a facet grid of scatter plots conditioned on a categorical variable.

cars = pd.DataFrame({
    'mpg': np.random.normal(30, 5, 120),
    'weight': np.random.normal(3200, 400, 120),
    'hp': np.random.normal(200, 40, 120),
    'origin': np.random.choice(['Domestic', 'Import'], 120)
})

sns.pairplot(cars, hue='origin', diag_kind='hist', height=2.2)
plt.suptitle('Vehicle Metrics by Origin', y=1.02)
plt.show()

grid = sns.FacetGrid(cars, col='origin', height=4, aspect=1.1)
grid.map_dataframe(sns.scatterplot, x='weight', y='mpg', alpha=0.7)
grid.set_axis_labels('Weight (lbs)', 'MPG')
grid.fig.suptitle('MPG vs Weight by Origin', y=1.05)
plt.show()

Chart Descriptions: Pair plot surfaces clusters and relationships across mpg, weight, and horsepower by origin. Facet grid shows heavier cars tend to have lower mpg, split cleanly by domestic vs import.

3.3 Residual and Anomaly Checks

Explanation: Residual plots assess regression fit; anomaly highlighting points out unusual observations.

Requirements: Fit a simple linear model, plot residuals, and mark potential anomalies.

from sklearn.linear_model import LinearRegression

x = cars[['weight']]
y = cars['mpg']
model = LinearRegression().fit(x, y)
cars['pred'] = model.predict(x)
cars['residual'] = cars['mpg'] - cars['pred']

plt.figure(figsize=(9, 4))
sns.residplot(x='pred', y='residual', data=cars, lowess=True, scatter_kws={'alpha': 0.6})
plt.axhline(0, color='red', linestyle='--')
plt.title('Residual Plot for MPG Model')
plt.xlabel('Predicted MPG')
plt.ylabel('Residual')
plt.show()

Chart Description: Residuals scatter around zero with a smooth lowess curve; points far from zero highlight possible anomalies or non-linear effects.

4 Section 4: Advanced Statistical Visualizations

4.1 Error Bars and Significance

Explanation: Error bars show variability; annotations can mark significance thresholds.

Requirements: Plot group means with standard error bars and add significance markers.

grouped = orders.groupby(['product', 'channel'])['revenue'].agg(['mean', 'sem']).reset_index()

plt.figure(figsize=(8, 5))
sns.barplot(data=grouped, x='product', y='mean', hue='channel', palette='muted', errorbar=None)

# Add manual error bars
for i, row in grouped.iterrows():
    plt.errorbar(i // 2 + (0.2 if row['channel'] == 'Online' else -0.2),
                 row['mean'], yerr=row['sem'], fmt='none', ecolor='black', capsize=4)

plt.title('Mean Revenue with Standard Error')
plt.xlabel('Product')
plt.ylabel('Mean revenue')
plt.legend(title='Channel')
plt.text(-0.3, grouped['mean'].max() + 10, '* p < 0.05 (illustrative)', fontsize=9)
plt.tight_layout()
plt.show()

Chart Description: Bar heights show mean revenue per product and channel with caps indicating standard error; a text note signals where a significant difference could be annotated.

4.2 Time Series and Decomposition

Explanation: Time series visuals show trends and seasonality; decomposition separates components.

Requirements: Plot a synthetic seasonal time series, add a rolling mean, and decompose trend-seasonal-noise.

import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose

dates = pd.date_range('2023-01-01', periods=48, freq='M')
trend = np.linspace(100, 180, 48)
season = 15 * np.sin(np.linspace(0, 8 * np.pi, 48))
noise = np.random.normal(0, 8, 48)
series = trend + season + noise
ts = pd.Series(series, index=dates, name='kpi')

fig, ax = plt.subplots(figsize=(10, 4))
ts.plot(ax=ax, label='KPI', color='navy')
ts.rolling(window=6).mean().plot(ax=ax, label='6-month mean', color='orange')
ax.set_title('Time Series with Rolling Mean')
ax.set_ylabel('Value')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

C:\Users\Brian\AppData\Local\Temp\ipykernel_21552\1172713386.py:4: FutureWarning:

'M' is deprecated and will be removed in a future version, please use 'ME' instead.

decomp = seasonal_decompose(ts, model='additive', period=12)
decomp.plot()
plt.suptitle('Additive Decomposition', y=1.02)
plt.show()

Chart Descriptions: The time series plot shows oscillating seasonality on top of an upward trend with a smoothed rolling mean. The decomposition charts separate observed data into trend, seasonal pattern, and residual noise.

4.3 Density, Contour, and Q-Q Plots

Explanation: Density and contour plots capture 2D distributions; Q-Q plots test normality.

Requirements: Draw a 2D density/contour plot and a Q-Q plot against the normal distribution.

from scipy import stats

x = np.random.normal(0, 1, 500)
y = 0.6 * x + np.random.normal(0, 0.8, 500)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
sns.kdeplot(x=x, y=y, fill=True, cmap='Purples', thresh=0.05)
plt.title('2D Density and Contours')
plt.xlabel('X')
plt.ylabel('Y')

plt.subplot(1, 2, 2)
stats.probplot(x, dist="norm", plot=plt)
plt.title('Q-Q Plot vs Normal')

plt.tight_layout()
plt.show()

Chart Descriptions: The density plot shows contour lines and filled regions for joint probability of X and Y. The Q-Q plot compares sample quantiles to theoretical normal quantiles; deviations from the diagonal reveal non-normality.

5 Section 5: Best Practices and Real-World Project

5.1 Visualization Guidelines

Explanation: Choosing the right chart and avoiding pitfalls improves clarity and credibility.

Requirements: Summarize dos and donts and show a clean, publication-ready style example.

# Clean style example: minimalist bar plot for publication
plt.style.use('seaborn-v0_8-whitegrid')

channels = ['Email', 'Social', 'Search', 'Display']
ctr = [4.2, 3.1, 5.6, 2.8]

plt.figure(figsize=(7, 4))
bars = plt.bar(channels, ctr, color='#4C72B0', edgecolor='black')
plt.title('Click-Through Rate by Channel')
plt.ylabel('CTR (%)')

for bar, val in zip(bars, ctr):
    plt.text(bar.get_x() + bar.get_width() / 2, val + 0.1, f'{val:.1f}',
             ha='center', va='bottom')

plt.ylim(0, 6.5)
plt.tight_layout()
plt.show()

Chart Description: A minimalist bar chart with crisp gridlines, clear labels, and value annotations demonstrates publication-ready styling without clutter.

Best Practices Checklist:

Match chart type to question: trend (line), distribution (hist/kde/violin), relationship (scatter/regplot), composition (stacked bar).
Avoid misleading axes; start at zero for bars and label any breaks.
Use consistent color palettes; rely on hue sparingly and respect accessibility contrast.
Limit ink: remove redundant spines, reduce grid opacity, and label directly when possible.
Document data source and transformations; save figures with tight layout and sufficient DPI (e.g., 300+ for print).

6 Section 6: Hands-on Project

6.1 Project Setup

Explanation: Complete a mini-workflow: load data, explore with visuals, test hypotheses visually, and assemble a report.

Requirements: Use a synthetic customer dataset to perform EDA with both Matplotlib and Seaborn, and create a small visualization report.

# Synthetic dataset
np.random.seed(42)
customers = pd.DataFrame({
    'customer_id': range(1, 301),
    'segment': np.random.choice(['SMB', 'Enterprise', 'Consumer'], 300),
    'tenure_months': np.random.randint(1, 48, 300),
    'monthly_spend': np.random.normal(120, 35, 300),
    'support_tickets': np.random.poisson(2, 300)
})
customers['churned'] = np.where(np.random.rand(300) < 0.25, 'Yes', 'No')
customers['total_value'] = customers['monthly_spend'] * customers['tenure_months']

# 1. Distribution of spend
plt.figure(figsize=(8, 4))
sns.histplot(customers['monthly_spend'], kde=True, color='seagreen')
plt.title('Monthly Spend Distribution')
plt.xlabel('Monthly spend ($)')
plt.show()

# 2. Relationship: tenure vs total value by churn
plt.figure(figsize=(8, 4))
sns.scatterplot(data=customers, x='tenure_months', y='total_value',
                hue='churned', alpha=0.7)
plt.title('Lifetime Value vs Tenure by Churn')
plt.xlabel('Tenure (months)')
plt.ylabel('Total value ($)')
plt.show()

# 3. Categorical comparison: tickets by segment
plt.figure(figsize=(8, 4))
sns.boxplot(data=customers, x='segment', y='support_tickets', hue='churned')
plt.title('Support Tickets by Segment and Churn')
plt.xlabel('Segment')
plt.ylabel('Tickets')
plt.show()

# 4. Correlation heatmap
numeric_cols = ['tenure_months', 'monthly_spend', 'support_tickets', 'total_value']
corr_proj = customers[numeric_cols].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr_proj, annot=True, cmap='RdBu', center=0)
plt.title('Customer Metrics Correlation')
plt.show()

Chart Descriptions:

Histogram shows central tendency and spread of customer spend.
Scatter plot highlights how churned vs retained customers differ in tenure-value space.
Box plots compare ticket volume across segments, separated by churn state.
Heatmap summarizes linear relationships among numeric metrics.

6.2 Reporting Steps

Capture each chart with fig.savefig('name.png', dpi=300, bbox_inches='tight').
Write short interpretations beneath each figure focusing on the business question.
Summarize key findings (e.g., churners have lower tenure and value; SMB tickets are higher).
Recommend next actions (targeted retention offers, support staffing for high-ticket segments).

Final Deliverable: A concise Quarto/Notebook report combining code, figures, and written insights suitable for stakeholders.