An end-to-end Machine Learning pipeline built with PySpark's MLlib to predict whether a bank customer will subscribe to a term deposit.
This script performs a binary classification task using historical bank marketing data. It constructs a full machine learning pipeline, from data ingestion and preprocessing to model training and evaluation, all within the distributed computing framework of PySpark.
Goal: Predict the binary target variable y ("yes" or "no") which indicates if the client has subscribed to a term deposit.
The solution is structured into four main phases, implemented using PySpark's Pipeline API for seamless execution.
The dataset is loaded directly into a Spark DataFrame.
| Setting | Value | Description |
|---|---|---|
| Source | CSV file | UCI Bank Marketing Data (Assumed) |
| Delimiter | ; (Semicolon) |
Original dataset format. |
| Headers | True |
The first row is treated as column names. |
| Schema | inferSchema=True |
PySpark automatically detects column data types. |
Key Columns (Target):
y(string: "yes" or "no") β Target Variable
A sequential pipeline transforms the raw data into the format required for the ML model.
All categorical features (e.g., job, marital, education) are transformed:
- String Indexing: Converts string categories into a numeric index (
"admin."β0,"technician"β1). - One-Hot Encoding: Converts the indices into a sparse binary vector, preventing the model from assuming an ordinal relationship between categories.
The target variable y is converted into a numeric label column:
-
"yes"β$1.0$ -
"no"β$0.0$
A VectorAssembler combines all encoded categorical vectors and numeric columns (e.g., age, balance, duration) into a single features vector, which is the required input format for PySpark's MLlib models.
The prepared DataFrame is split for training and testing:
-
$80%$ for Training -
$20%$ for Testing -
seed=42for reproducible results.
A LogisticRegression model is chosen for this binary classification task. The final Pipeline object includes all preprocessing steps and the model.
The pipeline is fitted on the training data, and then used to generate predictions on the unseen test data.
Evaluation is performed using the BinaryClassificationEvaluator to calculate the Area Under the Curve (AUC) metric based on the Receiver Operating Characteristic (ROC).
Result: An AUC
$\approx 0.91$ indicates a strong ability to distinguish between customers who will and will not subscribe.
The script generates the following output files in the output/ directory:
| File | Description |
|---|---|
model_auc.txt |
A text file containing the final calculated AUC score. |
roc_curve.png |
A visualization of the ROC curve, showing the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR). |