Predicting Cardiovascular Diseases

Using machine learning on behavioral risk factor data to predict heart disease

Cardiovascular diseases (CVD) are among the leading causes of death worldwide. In this project, we developed machine learning models to help with early detection of CVDs using data from the Behavioral Risk Factor Surveillance System (BRFSS), a large-scale survey of U.S. residents’ health behaviors and risk factors.

All models were implemented from scratch using Python native libraries - no machine learning frameworks like scikit-learn or PyTorch were used.

The Challenge

The BRFSS dataset contains responses from 328,135 individuals across 322 features, covering demographics, health conditions, lifestyle factors, and behaviors. Our goal was to predict whether someone has been diagnosed with coronary heart disease (MICHD), which includes heart attacks and angina.

The dataset presented several challenges:

  • Class imbalance: Less than 9% of respondents had CVD, making accuracy a poor metric
  • Feature heterogeneity: Mix of binary (34%), categorical (18%), and numerical (48%) features
  • Missing values: Survey structure meant many missing values had semantic meaning
  • Confounders: Many features unrelated to the prediction task

Three Preprocessing Strategies

We experimented with three different approaches to handling this complex dataset:

“The Good”: Carefully selected 122 relevant features with informed preprocessing that respected the semantic meaning of each feature. This included mapping invalid values, converting answers to meaningful scales, and intelligently filling missing values based on question context.

“The Bad”: Used all 322 raw features with minimal preprocessing - simply mapping missing values to -1 and applying min-max normalization.

“The Ugly”: A hybrid approach using all features, where our selected 122 were preprocessed with the informed strategy and the rest with the default approach.

Model and Optimization

We trained logistic regression models with L2 regularization, optimizing the following loss function:

\[\mathcal{L}(w):=\frac{1}{N} \sum_{n=1}^N-y_n x_n^{\top} w+\log \left(1+e^{x_n^{\top} w}\right) + \frac{\lambda}{2} \|w\|_2^2\]

For each preprocessing strategy, we performed grid search over:

  • Regularization coefficient $\lambda \in {0.00001, 0.0001, 0.001, 0.01, 0.1}$
  • Learning rate: $\gamma \in {0.01, 0.05, 0.1, 0.5, 1}$
  • Batch size: $b \in {500, 5000, 10000}$

We trained each model for 5000 epochs using mini-batch gradient descent on a 90-10 train-validation split. Since our primary concern was catching cases of CVD (high recall), we optimized the decision threshold for F1 score on the validation set.

Results

Model Accuracy Precision Recall F₁ Score Threshold
The Good 86.78% 34.99% 57.96% 43.63% 0.186
The Ugly 87.83% 36.91% 53.26% 43.60% 0.206
The Bad 85.95% 33.07% 57.71% 42.05% 0.181
Best model performance on validation set. The best test submission was "The Ugly", achieving $F_1$ = 0.442

Surprisingly, all three preprocessing strategies achieved similar performance metrics. “The Ugly” performed best on the test set, though “The Good” achieved the highest recall - critical for identifying at-risk individuals.

The Real Value: Interpretability

While the performance differences were minimal, the preprocessing strategy had a dramatic impact on model interpretability. By examining the features with the largest absolute weights, we discovered why careful preprocessing matters:

"The Good" - Top Features

  • Age (w = 0.82)
  • Cardiac rehabilitation (w = 0.72)
  • General health (w = 0.43)
  • Sex (w = -0.42)
  • Physical activity (w = -0.39)

"The Ugly" - Top Features

  • Cardiac rehabilitation (w = 0.61)
  • Missing fruit responses (w = 0.60)
  • Out of range fruit data (w = 0.55)
  • Phone number confirmation (w = 0.45)
  • High blood pressure (w = 0.44)
Features in red are confounders - artifacts of data collection rather than meaningful health indicators.

The difference is striking. “The Good” model highlights medically meaningful factors like age, rehabilitation history, and physical activity. “The Ugly” model, despite similar predictive performance, gives high importance to data collection artifacts like missing survey responses and phone number confirmations.