Predicting Malaria Incidence from Climate Data Using Machine Learning

  • Client or Institution: Research Collaboration on Malaria
  • Project Goal: Develop a predictive system and visualization platform for malaria incidence based on climatic and geographical data across 98+ countries
  • Tools Used: Python, Scikit-learn, CatBoost, Streamlit, Plotly
  • Duration: 1 month
  • Outcome: Deployed ML models for regression with a Streamlit web app for predictions and global data exploration
  • Live App: Malaria Predictor Web App
  • Publication: Malaria Incidence Prediction Using Climate Factors with Machine Learning Models, UBMK 2024
  • GitHub: Visit Codebase

A Global Web Tool for Environmental Health Insight

Abstract

Malaria incidence is strongly influenced by environmental factors such as rainfall, temperature, and location. In this project, we developed machine learning models to predict malaria incidence based on these parameters using curated data from WHO, World Bank Climate Knowledge Portal, and Google Public Data. The models were trained on multi-year datasets from over 98 countries. Our best model (CatBoost Regressor) achieved a correlation score of 96.7%. This model powers a user-facing Streamlit app for country-specific predictions and global map visualizations.

🧠 Background & Problem Statement

With over 600,000 deaths annually, malaria remains a major global health challenge—especially in sub-Saharan Africa. Climate conditions significantly influence mosquito breeding and malaria outbreaks, but real-time tools to leverage this data for incidence prediction are limited.

This project aimed to:

  • Analyze how climate (precipitation, air temperature) and geography (lat/lon) affect malaria incidence
  • Train and evaluate both regression and classification models
  • Deploy the best-performing model in a web app
  • Offer a globally accessible tool for researchers, policymakers, and the public

🗂 Dataset

📌 Sources:

  1. WHO Global Health Repository: Malaria incidence data (2000–2021)
  2. World Bank CCKP: Monthly rainfall and temperature (1901–2022)
  3. Google Public Data: Country latitude and longitude

🧹 Preprocessing Steps:

  • Filtered data for matching country entries
  • Aggregated monthly values into yearly averages
  • Created new features through pairwise interaction, mean, std, etc.
  • Created a binary label: high incidence (≥10) vs low incidence (<10)

Final features included:

  • year, precipitation, AvMeanSurAirTemp, AvMaxSurAirTemp, AvMinSurAirTemp, longitude, latitude
  • Target: incidence (for regression) or group (for classification)

⚙️ Approach & Methodology

Algorithms Used

Regression Models (9 total):

  • Linear, Ridge, Lasso
  • K-Nearest Neighbors, Decision Tree, Random Forest
  • XGBoost, CatBoost, AdaBoost

Classification Models (10 total):

  • Random Forest, Gradient Boosting, MLP, SVC
  • Logistic Regression, XGBoost, CatBoost, AdaBoost, Decision Tree, KNN

🔬 Evaluation Metrics

  • Regression: RMSE, MAE, R²
  • Classification: F1 Score, Accuracy, ROC AUC

Cross-validation: 5-fold stratified
Hyperparameter tuning: GridSearchCV
Feature Engineering: Pairwise interactions, max/min/mean aggregations
Feature Selection: Recursive addition/elimination

📈 Key Results

  • Best Regression Model: CatBoost Regressor (Test R²: 0.9774)
  • Best Classifier: MLPClassifier (Test F1 Score: 0.990)
  • Predictions closely mirrored real-world WHO data across years
  • Web app includes:
    • 🌍 Interactive choropleth maps (per year)
    • 🧮 Prediction panel with live model inference
    • 🕹 Input sliders for temperature, rainfall, location

🌍 Application Features

Prediction Panel

Users can input:

  • Country
  • Year
  • Precipitation
  • Average/Max/Min surface temperatures
  • Location (lat/lon)
    And get predicted malaria incidence.
Example prediction for Congo in 2020.

🗺 Interactive Map

Choropleth maps of actual vs predicted malaria incidence across years. Users can animate across 2000–2020.

💡 Lessons Learned / Innovations

  • Feature selection improved interpretability but not always model performance
  • Normalization had marginal benefit—especially for linear models
  • Model accuracy dropped significantly when removing year or country from features
  • Feature engineering boosted performance for tree-based models, especially CatBoost

💬 Discussion

This project demonstrates how environmental datasets can be used to predict malaria burden with high accuracy. Unlike black-box epidemiological models, this ML-driven approach is transparent, testable, and scalable.

Its greatest strength lies in its accessibility: anyone can simulate future scenarios or test real-world conditions using the live app.

👉 Try the tool now

malaria-incidence-predictor-webapp.streamlit.app


Discover more from Your Bioinformatics Developer

Subscribe to get the latest posts sent to your email.

Discover more from Your Bioinformatics Developer

Subscribe now to keep reading and get access to the full archive.

Continue reading