Bank Marketing Classification 🏦

An end-to-end Machine Learning pipeline built with PySpark's MLlib to predict whether a bank customer will subscribe to a term deposit.

🧠 Overview

This script performs a binary classification task using historical bank marketing data. It constructs a full machine learning pipeline, from data ingestion and preprocessing to model training and evaluation, all within the distributed computing framework of PySpark.

Goal: Predict the binary target variable y ("yes" or "no") which indicates if the client has subscribed to a term deposit.

⚙️ ML Pipeline Steps

The solution is structured into four main phases, implemented using PySpark's Pipeline API for seamless execution.

1. Dataset Ingestion and Schema Inference

The dataset is loaded directly into a Spark DataFrame.

Setting	Value	Description
Source	CSV file	UCI Bank Marketing Data (Assumed)
Delimiter	`;` (Semicolon)	Original dataset format.
Headers	`True`	The first row is treated as column names.
Schema	`inferSchema=True`	PySpark automatically detects column data types.

Key Columns (Target):

y (string: "yes" or "no") → Target Variable

2. Data Preprocessing Pipeline 🧹

A sequential pipeline transforms the raw data into the format required for the ML model.

A. Categorical Feature Encoding

All categorical features (e.g., job, marital, education) are transformed:

String Indexing: Converts string categories into a numeric index ("admin." → 0, "technician" → 1).
One-Hot Encoding: Converts the indices into a sparse binary vector, preventing the model from assuming an ordinal relationship between categories.

B. Label Transformation

The target variable y is converted into a numeric label column:

"yes" → $1.0$
"no" → $0.0$

C. Feature Assembly

A VectorAssembler combines all encoded categorical vectors and numeric columns (e.g., age, balance, duration) into a single features vector, which is the required input format for PySpark's MLlib models.

3. Model Training & Evaluation 📈

A. Data Split

The prepared DataFrame is split for training and testing:

$80%$ for Training
$20%$ for Testing
seed=42 for reproducible results.

B. Model Selection

A LogisticRegression model is chosen for this binary classification task. The final Pipeline object includes all preprocessing steps and the model.

C. Model Fitting and Prediction

The pipeline is fitted on the training data, and then used to generate predictions on the unseen test data.

D. Performance Metric

Evaluation is performed using the BinaryClassificationEvaluator to calculate the Area Under the Curve (AUC) metric based on the Receiver Operating Characteristic (ROC).

Result: An AUC $\approx 0.91$ indicates a strong ability to distinguish between customers who will and will not subscribe.

💾 Final Output

The script generates the following output files in the output/ directory:

File	Description
`model_auc.txt`	A text file containing the final calculated AUC score.
`roc_curve.png`	A visualization of the ROC curve, showing the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
output		output
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_bank_ml.bat		run_bank_ml.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bank Marketing Classification 🏦

🧠 Overview

⚙️ ML Pipeline Steps

1. Dataset Ingestion and Schema Inference

2. Data Preprocessing Pipeline 🧹

A. Categorical Feature Encoding

B. Label Transformation

C. Feature Assembly

3. Model Training & Evaluation 📈

A. Data Split

B. Model Selection

C. Model Fitting and Prediction

D. Performance Metric

💾 Final Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bank Marketing Classification 🏦

🧠 Overview

⚙️ ML Pipeline Steps

1. Dataset Ingestion and Schema Inference

2. Data Preprocessing Pipeline 🧹

A. Categorical Feature Encoding

B. Label Transformation

C. Feature Assembly

3. Model Training & Evaluation 📈

A. Data Split

B. Model Selection

C. Model Fitting and Prediction

D. Performance Metric

💾 Final Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages