An end-to-end machine learning system that predicts loan default risk using the Home Credit dataset, combining LightGBM modeling, SHAP/LIME explainability, and RAG for regulatory-aligned reporting.
Key findings: Borrowers asking for a higher loan amount and those having existing debt burdens were significantly more likely to default.
- Author
- Table of Contents
- Business Context
- Data source
- Methods
- Live Demo
- Tech Stack
- Quick glance at the results
- Lessons learned and recommendation
- Limitation and what can be improved
- Repository structure
This model predicts whether a loan applicant will repay or default using data from 250k+ applications. The predictions are made with LightGBM and explained with SHAP and LIME for transparency and outputs are structured in line with EBA compliance standards.
- Feature Engineering: Cleaned and Merged data from 8 data sources, then engineered 58 predictive features.
- Automated Optimization: Optuna was used for hyperparameter tuning and benchmarking Logistic Regression against XGBoost and LightGBM to find the best model
- Explainable AI (XAI): Integrated SHAP and LIME to point out major financial drivers
- Regulatory RAG: Built a RAG pipeline to reference model outputs with the EBA (European Banking Authority) standards.
The trained LightGBM model is deployed as an interactive web application hosted on Hugging Face Spaces using Streamlit where you can:
- Generate random credit risk predictions
- Upload CSV files for batch predictions
- View model performance metrics
- Python (refer to requirement.txt for the packages used in this project)
- Duckdb (aggregating and joining multiple csvs)
- Scikit-learn, XGBoost, LightGBM (machine learning )
- SHAP & LIME (model explainability)
- Optuna (Hyperparameter tuning)
- LangChain (for RAG explanations)
- Pinecone (vector Database)
- Streamlit (interactive web application & deployment)
- Hugging Face(Model deployment)
Target distribution between the features.
Summary bar of major features
Confusion matrix of LightGBM.
ROC curve of LightGBM.
Top 3 models
| Model | AUC-ROC score |
|---|---|
| LightGBM(tuned) | 72.53% |
| XGboost (tuned) | 72.56% |
| LightGBM(Baseline) | 72.08% |
-
The final model used is: LightGBM because it maximizes recall, catching a larger fraction of potential defaults.
-
Metrics used: Recall, AUC-ROC, AUC-PR, Precision, F1-score, KS, Gini
Primary Metric: ROC-AUC Credit risk data is very imbalanced, so ROC-AUC is best here as it measures how well the model does in separating defaulters from non defaulters
Supporting Metrics: Precision, Recall, F1
- Recall is critical as missing a high-risk borrower leads to real financial loss.
- Precision helps ensure we donβt wrongly reject too many good borrowers
- F1 balances both precision and recall.
What I found:
-
Based on the analysis in this project it was found that loan amount, existing debt ratio, and age were the strongest predictors of default
-
Hyperparameter tuning barely helped improve the model performance. This suggests that features matter more than tuning
-
For imbalanced data, AUC-ROC matters way more than accuracy, and the 0.5 threshold doesn't hold up. For example for Logistic Regression, the optimal threshold was 0.121
Recommendations:
- Recommendation would be to focus more on the loan amount when deciding since they carry the most risk and also accept that precision will be low, you'll reject some good customers to catch defaults
- Low precision means 80% of rejected applicants are false positives (lost customers)
- Hyperparameter tuning with Optuna takes 1+ hours
- Incorporate additional external data sources
- Monitor model performance over time and retrain quarterly
Repository Structure (click to expand)
credit-risk-model/
βββ assets/ # Images used in the README
β βββ confusion_matrix_lgbm_tuned.png
β βββ credit.gif
β βββ roc_curve.png
β βββ shap_summary_bar.png
β βββ target_dist.png
β
βββ data/ # All data (raw, processed, samples)
β βββ data_sample/ # small samples for quick loading
β β βββ application_test_sample.csv
β β βββ application_train_sample.csv
β β βββ bureau_balance_sample.csv
β β βββ bureau_sample.csv
β β βββ credit_card_balance_sample.csv
β β βββ installments_payments_sample.csv
β β βββ POS_CASH_balance_sample.csv
β β βββ previous_application_sample.csv
β β
β βββ processed/ # cleaned + feature engineered datasets (not tracked in git)
β β βββ agg_main.csv
β β βββ cleaned_train.csv
β β βββ cleaned_val.csv
β β βββ feature_engineered_val.csv
β β βββ feature_engineered.csv
β β βββ target_train.csv
β β βββ target_val.csv
β β
β βββ raw/ # original home credit datasets (not tracked in git)
β βββ application_test.csv
β βββ application_train.csv
β βββ bureau_balance.csv
β βββ bureau.csv
β βββ credit_card_balance.csv
β βββ installments_payments.csv
β βββ POS_CASH_balance.csv
β βββ previous_application.csv
β βββ README.md # Download instructions
β
βββ models/ # Saved trained models (not tracked in git)
β βββ LightGBM.joblib
β βββ log_reg.joblib
β βββ XGBoost.joblib
β
βββ notebooks/ # Jupyter notebooks for analysis + modelling + interpretation
β βββ 01_eda.ipynb
β βββ 02_modelling.ipynb
β βββ 03_explainability.ipynb
β
βββ results/ # Generated plots and outputs
β βββ EDA/ # EDA visualisations
β β βββ CODE_GENDER_target_relationship.png
β β βββ CODE_GENDER_value_counts.png
β β βββ correlation_matrix.png
β β βββ missing_value_map.png
β β βββ NAME_CONTRACT_TYPE_target_relationship.png
β β βββ NAME_CONTRACT_TYPE_value_counts.png
β β βββ numeric_boxplots.png
β β βββ numeric_histograms.png
β β βββ OCCUPATION_TYPE_target_relationship.png
β β βββ OCCUPATION_TYPE_value_counts.png
β β βββ target_dist.png
β β
β βββ explainability/ # Model interpretation outputs
β βββ lime_0.png
β βββ lime_1.png
β βββ lime_2.png
β βββ roc_curve.png
β βββ shap_dependence_age_years.png
β βββ shap_dependence_avg_debt_ratio.png
β βββ shap_dependence_CODE_GENDER.png
β βββ shap_dependence_total_credit_requested.png
β βββ shap_dependence_value_of_goods_financed.png
β βββ shap_summary_bar.png
β βββ shap_summary.png
β
βββ src/ # Python modules
β βββ credit_risk_model/ # Package folder for imports
β β βββ __init__.py
β β βββ aggregations.py # Aggregations
β β βββ config.py # Paths and constants
β β βββ data_cleaning.py # Cleaning + preprocessing logic
β β βββ data_ingestion.py # DuckDB ingestion + joins
β β βββ feat_eng.py # Feature engineering functions
β β βββ model.py # Training + evaluation
β β
β
βββ .gitignore # Files/folders ignored by git
βββ home_credit.duckdb # DuckDB database file
βββ pyproject.toml # Build system config
βββ README.md # Project overview
βββ requirements.txt # Required python packages




