Harvard Extension School Intro to Statistical Modeling (STAT E-109) Course Review

I took Intro to Statistical Modeling in the 2025 spring semester as a follow up to Statistics and Applied Data Analysis (STAT S-100) and as another building block toward my Data Analytics Graduate Certificate. This was a very important class for me and one I took very seriously because it is the one “required” course that you must take to obtain a Data Analytics Graduate Certificate. The other three courses are “intro” and/or “elective” classes.

Per the official description, this is a second course in statistical inference—deeper, sharper, and much more “hands-on” than an intro. We covered:

t-tools & permutation/bootstrapping alternatives: classic one- and two-sample t-tests and confidence intervals, plus nonparametric routes when assumptions wobble
ANOVA: comparing more than two groups, interpreting F-tests, and following up with multiple comparisons while minding Type I error
Linear regression: simple → multiple; interpreting coefficients vs. practical effect sizes; interaction terms; transformations when relationships aren’t linear on the raw scale
Model checking & refinement: residual plots, leverage/influence (hello, Cook’s distance), diagnosing heteroskedasticity and non-normality, and knowing when to transform, re-specify, or simplify
Simulation & statistical computing: using R to simulate sampling distributions, verify theoretical results, and sanity-check analytic work
R fundamentals (applied): data wrangling, reproducible scripts, clear comments, and workflow hygiene so your work stands up when a TF (or future you) revisits it

By the end of the course we were expected to state hypotheses, explore data rigorously, choose appropriate models/tests, check assumptions, and communicate conclusions in plain, defensible language.

Structure, pacing, and assessments

This class was difficult—in a good way. The cadence of regular quizzes and weekly assignments kept me honest. It starts gently (reviewing inference) and then ramps fast once regression, diagnostics, and resampling land. The rhythm looked like:

Short quizzes to confirm you really absorbed the week’s ideas
Problem sets that force you to operationalize them in R
Frequent encounters with “what if assumptions fail?”—and then using bootstraps/permutes or re-specification to adapt

My final project: predicting diabetes with logistic regression and random forest

The “last hurrah” of the course was a wide open final project, where we could tackle any dataset we wanted, and use it to answer a research question. I chose a dataset from Kaggle called the Healthcare Diabetes Dataset, which includes variables like glucose, insulin, BMI, age, blood pressure, and family history. My research question was: What factors are most predictive of diabetes risk in adults, and how accurately can we model that risk using statistical methods?

I applied two different modeling approaches: logistic regression and random forest. Logistic regression gave me interpretable coefficients and confirmed that glucose level was a clear driver of diabetes risk. However, the model’s limitations showed up in diagnostic plots, where non-linear relationships were evident. Logistic regression delivered test accuracy around 78% with an AUC of 0.84, solid but not perfect.

The random forest model, on the other hand, blew those results out of the water. It achieved 98% test accuracy, with near-perfect sensitivity and specificity, plus an AUC of 0.99. The variable importance plot from the random forest highlighted glucose as the most significant predictor, followed by Diabetes Pedigree Function and age—a finding that challenged my original hypothesis about insulin being a top driver.

This project was deeply personal for me, as I’ve seen the devastating effects of diabetes in my family. Beyond the technical exercise, it was a meaningful chance to validate what drives risk and to better understand which variables matter most in prediction. I’ll be adding the paper and R code to my portfolio section.

Who this course is for

Builders who want applied, code-first statistical modeling
Analysts who plan to use R professionally and need a disciplined approach to inference and modeling
Anyone pursuing the Data Analytics Graduate Certificate—this is the keystone

Keys for success

You need to be very strong in R at this stage to do all the things the course demands
Comfort with data frames, transformation, plotting, writing clean scripts, and troubleshooting errors will save you hours every week
Consider skipping the textbook. I didn’t need it
The course materials, slides, and labs covered everything I used. If you love a reference, great—but it’s not required to succeed
Keep up. The course starts slow but accelerates very quickly
Don’t bank on “catching up later.” Do the quizzes and problem sets on pace; they stack
Start early on the final project
You’re building from scratch. Data selection, cleaning, model refinement, diagnostics, revisions. This all takes multiple iterations

Final take

Challenging? Oh yeah. Worth it? Absolutely. Intro to Statistical Modeling sharpened my statistical instincts, leveled up my R workflow, and produced a portfolio-grade project I’m proud of. If you’re serious about the Data Analytics Graduate Certificate—or you want to do real, defensible analysis at work—this is the course that turns “I know some stats” into “I can model this properly and explain it.”

Cheers,

Tom

Structure, pacing, and assessments

My final project: predicting diabetes with logistic regression and random forest

Who this course is for

Keys for success

Final take

You Might Also Like

Harvard Extension School Introduction to Statistics and Applied Data Analysis (STAT S-100) Course Review

Harvard Extension School Intro to Computer Science with Python (CSCI S-7) Course Review

Harvard Extension School Data Mining for Business (CSCI E-96) Course Review