Predicting Dropouts: How Regression Models Reveal Hidden Patterns in New York’s High Schools

A step-by-step data science journey from messy datasets to meaningful insights

Quick Overview — What You’ll Learn

Before we dive in, here’s what this article covers:

Understanding the Problem: Predicting high school dropout rates — why it matters and how data can help.
From Raw to Ready: Cleaning and preparing real-world education data for machine learning.
Exploring the Story in the Data (EDA): How visualization uncovers trends and data integrity issues.
Choosing the Right Model: Why “one-size-fits-all” doesn’t work for predicting counts like dropouts.
Modeling in Action: Comparing Linear Regression, Poisson, and Negative Binomial models.
Evaluating and Validating: How cross-validation ensures that your model isn’t just lucky.
Lessons and Takeaways: What this project teaches us about modeling, data storytelling, and real-world decision-making.

1. Understanding the Problem

Each year, educators and policymakers grapple with the same question: why do some students drop out while others persist?

For this project, we explored a dataset of New York State high schools (2018-2019). Each row represented a school district or subgroup — with metrics on enrollment, academic performance, and demographics.

The goal: build regression models to predict the number of student dropouts based on those attributes.

This isn’t just an academic exercise. Accurately predicting dropout risk helps school districts:

Identify struggling schools before crisis points,
Allocate resources more effectively, and
Craft targeted intervention programs.

Our task was to turn raw data into actionable insight — the essence of every good data science story.

2. From Raw to Ready — Data Preparation

Real data is never pristine. The original dataset had:

Missing values,
Percentage columns unsuitable for direct modeling,
Inconsistent subgroup names, and
Large differences in district sizes (from tiny rural schools to massive city systems).

Cleaning involved:

Handling missing and invalid entries,
Encoding categorical variables (like subgroup names and district codes),
Removing percentage columns, and
Creating a crucial new variable: an exposure term, representing enrollment size.

Why exposure? Because a school with 1,000 students and 10 dropouts (1%) isn’t the same as a school with 50 students and 10 dropouts (20%).
Modeling counts per exposure helps us compare fairly across districts of different sizes.

3. Exploring the Story in the Data (EDA)

Before running models, we explored the data visually and statistically:

Histograms revealed a right-skewed target — most districts had few dropouts, but a few had very high counts.
Scatter plots showed clear positive correlations between district size and dropout counts.
Missingness maps highlighted which columns needed imputation or removal.

After cleaning, a Post-Data Prep EDA confirmed improvement:

The target distribution (dropouts) became more stable when scaled per 100 students.
Outliers were less influential after transformation.
The cleaned dataset was now model-ready — balanced, interpretable, and exposure-aware.

📈 Insert Visual 2: Histogram comparing raw dropout counts vs. dropout rates (per 100 students) — showing variance stabilization.
📉 Insert Visual 3: Cook’s Distance plot — visualizing high-leverage observations to ensure no single school dominates model learning.

⚙️ 4. Choosing the Right Model

Not all regression models are created equal. For count data, like dropout numbers, classic Linear Regression struggles because:

It assumes normally distributed residuals,
It can predict negative counts, and
It doesn’t adapt to skewed, non-negative outcomes.

So, we explored three main model families:

Model	Strengths	Weaknesses	Ideal Use
Multiple Linear Regression (OLS)	Simple, interpretable	Poor fit for counts; assumes constant variance	Baseline
Poisson Regression (GLM)	Designed for count data; interprets effects multiplicatively	Sensitive to over-dispersion	When mean ≈ variance
Negative Binomial Regression (GLM)	Handles over-dispersion by adding an extra parameter	Slightly more complex	When variance > mean (common in education data)

5. Modeling in Action

5.1 The Baseline: Multiple Linear Regression

Our OLS model achieved:

R² ≈ 0.842,
AIC ≈ 246,900,
but residuals showed heteroskedasticity and poor handling of high-variance districts.
OLS gave us a reference point but clearly wasn’t capturing the count structure.

5.2 The Poisson Model

The Poisson regression improved interpretability and produced a tighter fit:

Log-Likelihood ≈ −90,703,
Deviance ≈ 100,600,
but variance in dropout counts was still much greater than the mean — a red flag for over-dispersion.

5.3 The Negative Binomial Model — Our Champion

Switching to a Negative Binomial (NB) model dramatically improved fit:

Log-Likelihood ≈ −71,972,
Deviance ≈ 15,841,
Hold-out RMSE ≈ 23.64.

These metrics — along with the residual patterns — confirmed that NB was capturing the underlying variability better than OLS or Poisson.

Interpretation was straightforward:

For example, if a feature’s coefficient (β) = 0.12, then exp(β) = 1.13 → a 13% increase in expected dropouts, holding other factors constant.

6. Evaluating Model Stability with K-Fold Cross-Validation

To ensure the NB model wasn’t just lucky on one split, we ran 5-Fold Cross-Validation.

Each fold trained on 80% of data and validated on 20%, producing per-fold log-likelihood and deviance metrics. The results:

Fold	Validation Log-Likelihood	Validation Deviance
1	−14,319.26	3,144.95
2	−14,401.47	3,212.94
3	−14,651.54	3,047.33
4	−14,272.19	3,297.41
5	−14,371.44	3,226.63
Mean ± SD	−14,403.18 ± 131.83	3,185.85 ± 84.50

Such consistent results across folds demonstrate stability and generalizability — the hallmark of a reliable model.

7. Key Insights & Takeaways

a. Modeling counts requires the right tool

Linear regression is intuitive but inadequate for count-based targets. Poisson and Negative Binomial GLMs naturally handle non-negative, skewed outcomes and interpret coefficients in intuitive multiplicative terms.

b. Exposure terms make comparisons fair

Normalizing by enrollment (our exposure) ensured that large schools didn’t unfairly skew predictions.

c. Stability matters as much as accuracy

Cross-validation confirmed that the model’s performance was consistent across data subsets — crucial for real-world reliability.

d. Transparency beats complexity

Even though NB is more advanced than OLS, its coefficients remain interpretable — letting educators and policymakers trust and act on the insights.

e. Data storytelling drives action

Visualizing “before vs. after” improvements (target distributions, leverage checks, residuals) turns abstract math into clear narratives — a must-have skill for any data analyst.

Final Thoughts

This project demonstrates the full data science lifecycle — from raw messy data to a validated predictive model ready for deployment or policy use.

What makes it special is not the code, but the thought process:

Understanding the problem,
Respecting the data’s structure, and
Choosing methods that fit reality, not convenience.

By the end, we didn’t just predict dropouts; we built a replicable framework for how data-driven thinking can uncover patterns hidden in plain sight.

Reflection for Data Practitioners

Projects like this go beyond passing a course. They sharpen three vital muscles for data scientists:

Critical modeling judgment — knowing why a model fits.
Reproducible analysis — structuring notebooks cleanly for peer review.
Narrative translation — turning technical results into clear, actionable insights.

In short, great analysts don’t just build models — they tell the story of what the data means.

Search This Blog

Artificial IntelliTools