AutomatedMLPack: A Python Package for End-to-End Automated Machine Learning

  • Client or Institution: Internal / Random Project
  • Project Goal: To streamline the end-to-end machine learning workflow—from data ingestion to evaluation and visualization—using a command-line-enabled Python package
  • Tools Used: Python, Scikit-learn, Matplotlib, Seaborn, NumPy, Pandas, argparse
  • Duration: 2 weeks
  • Outcome: Fully functional and reusable Python package supporting classification and regression with CLI control
  • Link: GitHub Repository

Command-Line Simplicity Meets Full-Stack ML Functionality

Abstract

AutomatedMLPack is a Python-based automation toolkit for machine learning that simplifies the data science pipeline. I built this package to support automated training and evaluation of multiple models on tabular datasets, using simple command-line arguments. The package includes modular scripts for ingestion, preprocessing, model training, performance comparison, and result visualization. Its CLI interface (run_train_pipeline.py) enables fast, repeatable experiments on classification or regression tasks with feature selection, model tuning, and logging.

🧠 Background & Problem Statement

Despite the growing demand for ML solutions, many workflows remain inefficient—requiring repetitive code and manual parameter tracking. The goal of this project was to automate key ML stages while keeping the interface accessible via a simple CLI.

The challenge was to design a package that is flexible for both beginners and experienced practitioners—supporting any tabular dataset, custom feature engineering, optional scaling, and advanced model benchmarking.

🗂 Dataset

  • Heart Disease Classification Dataset
  • Data Compatibility: Any structured/tabular dataset (CSV/TSV)
  • Preprocessing Options:
    • Standard scaling (optional)
    • Train/test split (customizable)
    • Feature selection using multiple strategies
    • New feature engineering support

⚙️ Approach & Methodology

📁 Project Structure

The core codebase is organized in:

  • /modules/components: ingestion, transformation, and model training classes
  • /utils: visualization functions and object serialization
  • run_train_pipeline.py: the CLI tool that orchestrates the workflow

🛠 Pipeline Features

  • Flexible CLI with argparse
    Define input files, target column, model type, feature selection strategy, and more directly from the terminal.
CLI script with customizable training flags and pipeline logic.
  • Modeling
    Supports ensemble methods, boosting, neural nets, SVMs, logistic regression, and decision trees using scikit-learn.
  • Feature Selection
    Offers 8 rankers and 13 aggregation methods for optimal subset selection.
  • Visualization
    Generates comparative bar plots and error bars for model performance.
  • Evaluation Reports
    Generates .txt reports including precision, recall, F1, support, and accuracy for all classes.
Exported text report with detailed classification metrics.

📈 Key Results

  • 📦 Built a reusable, installable Python library for automated ML
  • 🧪 Enabled quick benchmarking across 10+ models
  • 📊 Delivered plug-and-play visualizations and performance comparisons
  • 🖥 CLI allows running complex experiments with a single command

💡 Lessons Learned / Innovations

  • Implementing ensemble feature selection required handling different score scales, which was solved through robust scaling
  • Learned to write clean, modular code compatible with argparse and reusable for scripting
  • The CLI interface dramatically reduced experimentation time compared to manual scripting

💬 Discussion

AutomatedMLPack brings the principles of AutoML into a lightweight, customizable toolkit suited for research, hackathons, and rapid prototyping. It balances automation with flexibility, allowing users to override defaults and control how their pipeline behaves.

The ability to quickly test multiple models, visualize outcomes, and export reports—without writing redundant code—marks a major productivity boost.

📣 Call to Action

This was a self-initiated project aimed at improving machine learning productivity. If you’re building ML pipelines or teaching ML to students, check out the repo or start a conversation.

🔭 Final Thoughts & Future Directions

There’s great potential to extend this tool:

  • Add Docker support for containerized experiments
  • Introduce hyperparameter search grids with cross-validation
  • Extend support for deep learning via TensorFlow or PyTorch
  • Integrate streamlit or Gradio as a web interface wrapper


Discover more from Your Bioinformatics Developer

Subscribe to get the latest posts sent to your email.

Discover more from Your Bioinformatics Developer

Subscribe now to keep reading and get access to the full archive.

Continue reading