< Machine Learning / Data Science / >

Telecom Customer Churn Prediction (ML + Streamlit)

npm run case_study

Team-based, end-to-end machine learning project that predicts telecom customer churn using a reproducible preprocessing pipeline, compares Logistic Regression vs Random Forest, persists the best model with joblib, and deploys an interactive Streamlit app with real-time risk scoring and SHAP-based explainability. I led the modeling, preprocessing, and deployment pipeline work.

Project Overview

This team project builds a complete churn prediction workflow from raw telecom data to a deployed web app. The pipeline performs robust preprocessing (cleaning, encoding, scaling, and stratified splitting), trains and evaluates multiple models, selects the best model using ROC-AUC, and serves predictions through a professional Streamlit UI. The app includes consistent feature alignment with saved preprocessing artifacts to ensure training/serving parity and provides explainability via SHAP feature contributions. I was responsible for leading the modeling and deployment pipeline and resolving key data and production issues.

Core Technologies

Python, pandas, NumPy, scikit-learn (StandardScaler, LogisticRegression, RandomForestClassifier, metrics), joblib for model/artifact persistence, and Streamlit for deployment. SHAP is integrated for interpretability with compatibility handling for newer NumPy versions.

My Role & Contributions

🗹Designed and implemented the main preprocessing script (data_preprocessing.py): dropping irrelevant columns, handling missing values, and preparing features for modeling.

🗹Implemented feature engineering for churn: binary encoding of Yes/No-style variables and one-hot encoding of categorical features (including State) with drop_first to reduce multicollinearity.

🗹Improved numeric robustness by explicitly converting key numeric columns to numeric (errors='coerce') and applying median imputation, fixing multiple production data errors.

🗹Set up consistent scaling by selecting numeric columns only before StandardScaler, and persisted scaler, feature column order, and numeric medians for reuse at prediction time.

🗹Implemented stratified Train/Validation/Test splitting to preserve churn ratio across splits.

🗹Developed the training pipeline (train.py / pipeline.py) comparing Logistic Regression and Random Forest, handling class imbalance using class_weight='balanced', and selecting the best model by ROC-AUC.

🗹Refactored model evaluation and artifact saving so the chosen best model is persisted as best_model.pkl along with scaler and feature_columns.pkl for reliable deployment.

🗹Reworked the Streamlit app’s preprocessing to exactly mirror the training pipeline, fixed feature-mismatch bugs, and aligned all inputs with the saved scaler and model.

🗹Redesigned the Streamlit UI sections (Customer Profile, Services & Features, Contract & Billing) with professional HTML/CSS icons and a 3-tier churn risk display (Low/Medium/High).

🗹Integrated SHAP-based explainability for both Random Forest and Logistic Regression, including NumPy 2.x compatibility fixes and visualization of top contributing features.

Key Outcomes

Delivered a stable training-to-deployment pipeline by saving and reusing preprocessing artifacts (scaler, feature columns, numeric medians) and ensuring the Streamlit app reproduces the exact same transformations as training. Resolved multiple data type and feature-mismatch issues, and added SHAP-based interpretability so users can see which features drive churn vs retention for each prediction.

Next Improvements (Planned)

Add hyperparameter tuning (GridSearchCV/RandomizedSearchCV), calibration for better probability estimates, more robust categorical handling via ColumnTransformer/Pipeline, and monitoring for data drift if deployed in production.