Skip to content

hakant66/BankMarketingMLPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bank Marketing Classification 🏦

An end-to-end Machine Learning pipeline built with PySpark's MLlib to predict whether a bank customer will subscribe to a term deposit.


🧠 Overview

This script performs a binary classification task using historical bank marketing data. It constructs a full machine learning pipeline, from data ingestion and preprocessing to model training and evaluation, all within the distributed computing framework of PySpark.

Goal: Predict the binary target variable y ("yes" or "no") which indicates if the client has subscribed to a term deposit.


βš™οΈ ML Pipeline Steps

The solution is structured into four main phases, implemented using PySpark's Pipeline API for seamless execution.

1. Dataset Ingestion and Schema Inference

The dataset is loaded directly into a Spark DataFrame.

Setting Value Description
Source CSV file UCI Bank Marketing Data (Assumed)
Delimiter ; (Semicolon) Original dataset format.
Headers True The first row is treated as column names.
Schema inferSchema=True PySpark automatically detects column data types.

Key Columns (Target):

  • y (string: "yes" or "no") β†’ Target Variable

2. Data Preprocessing Pipeline 🧹

A sequential pipeline transforms the raw data into the format required for the ML model.

A. Categorical Feature Encoding

All categorical features (e.g., job, marital, education) are transformed:

  1. String Indexing: Converts string categories into a numeric index ("admin." β†’ 0, "technician" β†’ 1).
  2. One-Hot Encoding: Converts the indices into a sparse binary vector, preventing the model from assuming an ordinal relationship between categories.

B. Label Transformation

The target variable y is converted into a numeric label column:

  • "yes" β†’ $1.0$
  • "no" β†’ $0.0$

C. Feature Assembly

A VectorAssembler combines all encoded categorical vectors and numeric columns (e.g., age, balance, duration) into a single features vector, which is the required input format for PySpark's MLlib models.

3. Model Training & Evaluation πŸ“ˆ

A. Data Split

The prepared DataFrame is split for training and testing:

  • $80%$ for Training
  • $20%$ for Testing
  • seed=42 for reproducible results.

B. Model Selection

A LogisticRegression model is chosen for this binary classification task. The final Pipeline object includes all preprocessing steps and the model.

C. Model Fitting and Prediction

The pipeline is fitted on the training data, and then used to generate predictions on the unseen test data.

D. Performance Metric

Evaluation is performed using the BinaryClassificationEvaluator to calculate the Area Under the Curve (AUC) metric based on the Receiver Operating Characteristic (ROC).

Result: An AUC $\approx 0.91$ indicates a strong ability to distinguish between customers who will and will not subscribe.


πŸ’Ύ Final Output

The script generates the following output files in the output/ directory:

File Description
model_auc.txt A text file containing the final calculated AUC score.
roc_curve.png A visualization of the ROC curve, showing the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR).

About

Initial commit: PySpark ML pipeline for bank marketing dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors