Skip to content

MUsman054/Data-Analytics-Internship-Phase1

Repository files navigation

πŸ“Š Data Analytics Internship

A collection of five end-to-end data science projects covering exploratory data analysis, classification, and regression β€” implemented in Python using real-world datasets.


πŸ“ Projects Overview

# Project Type Dataset Key Algorithm
1 Iris Dataset Visualization EDA & Classification Iris Dataset K-Nearest Neighbors
2 Credit Risk Prediction Binary Classification Loan Prediction Dataset Logistic Regression & Decision Tree
3 Customer Churn Prediction Binary Classification Churn Modelling Dataset Random Forest
4 Insurance Claim Amount Prediction Regression Medical Cost Personal Dataset Linear Regression & Random Forest
5 Personal Loan Acceptance Prediction Binary Classification UCI Bank Marketing Dataset Logistic Regression & Decision Tree

πŸ”§ Tech Stack

  • Language: Python 3.10+
  • Data Manipulation: pandas, numpy
  • Visualization: matplotlib, seaborn
  • Machine Learning: scikit-learn
  • Environment: Jupyter Notebook

πŸ“‚ Repository Structure

πŸ“¦ Data-Analytics-Internship/
β”œβ”€β”€ πŸ““ task1_iris_visualization.ipynb
β”œβ”€β”€ πŸ““ task2_credit_risk_prediction.ipynb
β”œβ”€β”€ πŸ““ task3_customer_churn_prediction.ipynb
β”œβ”€β”€ πŸ““ task4_insurance_claim_prediction.ipynb
β”œβ”€β”€ πŸ““ task5_loan_acceptance_prediction.ipynb
β”œβ”€β”€ πŸ“„ task2_dataset.csv
β”œβ”€β”€ πŸ“„ task3_dataset.csv
β”œβ”€β”€ πŸ“„ task4_dataset.csv
β”œβ”€β”€ πŸ“„ task5_dataset.csv
└── πŸ“„ README.md

πŸ—‚οΈ Project Details

1. Iris Dataset Visualization

Notebook: task1_iris_visualization.ipynb

Objective: Explore and visualize the classic Iris dataset to understand feature distributions and inter-species relationships.

Highlights:

  • Loaded and inspected dataset using .shape, .columns, and .head()
  • Built scatter plots, histograms, and box plots to analyze feature distributions
  • Generated a full pair plot and correlation heatmap
  • Trained a KNN classifier as a baseline β€” confirmed strong species separability

Key Insight: Petal length and petal width are far more discriminative than sepal features. Setosa is linearly separable from the other two species.

Skills Demonstrated: Data loading & inspection Β· Visualization with matplotlib & seaborn Β· Basic classification


2. Credit Risk Prediction

Notebook: task2_credit_risk_prediction.ipynb
Dataset: task2_dataset.csv β€” 367 loan applicants (Kaggle Loan Prediction Dataset)

Objective: Predict whether a loan applicant has good or bad credit history as a proxy for default risk.

Highlights:

  • Handled missing values in 6 columns using mode/median imputation
  • Encoded categorical features (Gender, Education, Property_Area, etc.) with Label Encoding
  • Visualized loan amount, income, and credit distributions with histograms and box plots
  • Compared Logistic Regression and Decision Tree classifiers
  • Evaluated using accuracy, confusion matrix, ROC-AUC, and classification report

Key Insight: Credit_History, Property_Area, and LoanAmount are the strongest predictors of credit risk. Semiurban applicants and graduates show higher good-credit rates.

Skills Demonstrated: Missing value imputation Β· Categorical encoding Β· Binary classification Β· Model evaluation


3. Customer Churn Prediction

Notebook: task3_customer_churn_prediction.ipynb
Dataset: task3_dataset.csv β€” 10,000 bank customers (Churn Modelling Dataset)

Objective: Identify bank customers likely to leave (churn) based on their profile and account behaviour.

Highlights:

  • Dropped non-predictive identifier columns (RowNumber, CustomerId, Surname)
  • Applied Label Encoding for Gender and One-Hot Encoding for Geography
  • Analyzed churn by age, balance, number of products, active membership, and geography
  • Trained a Random Forest classifier with feature importance analysis
  • Validated with 5-fold cross-validation and ROC curve

Key Insight: Age, balance, and number of products are top churn drivers. Customers with 3+ products and inactive members churn at significantly higher rates. German customers show the highest churn rate by geography.

Skills Demonstrated: Categorical encoding (Label + One-Hot) Β· Feature importance analysis Β· Random Forest Β· Cross-validation


4. Insurance Claim Amount Prediction

Notebook: task4_insurance_claim_prediction.ipynb
Dataset: task4_dataset.csv β€” 1,338 policyholders (Medical Cost Personal Dataset)

Objective: Estimate individual medical insurance charges based on personal and lifestyle attributes.

Highlights:

  • Explored charge distributions (raw and log-transformed)
  • Visualized the impact of BMI, age, and smoking status on charges with scatter plots and box plots
  • Trained and compared Linear Regression and Random Forest Regressor
  • Evaluated models using MAE, RMSE, and RΒ²
  • Plotted residuals to assess model fit

Key Insight: Smoking is the single most powerful predictor β€” smokers are charged 3Γ— more on average. BMI and smoking interact strongly: high-BMI smokers face the highest charges. Linear Regression shows clear heteroscedasticity, while Random Forest handles non-linearities better.

Metric Linear Regression Random Forest
MAE ~$4,200 ~$2,600
RMSE ~$6,000 ~$4,500
RΒ² ~0.78 ~0.87

Skills Demonstrated: Regression modeling Β· MAE & RMSE evaluation Β· Residual analysis Β· Feature importance


5. Personal Loan Acceptance Prediction

Notebook: task5_loan_acceptance_prediction.ipynb
Dataset: task5_dataset.csv β€” 41,188 customers (UCI Bank Marketing Dataset)

Objective: Predict which bank customers are likely to accept a personal loan / term deposit offer from a marketing campaign.

Highlights:

  • Loaded semicolon-delimited UCI dataset with 21 features including economic indicators
  • Replaced 'unknown' string values with NaN and imputed with mode
  • Encoded all categorical variables with Label Encoding
  • Performed EDA on age, call duration, job type, marital status, and campaign contacts
  • Trained Logistic Regression and Decision Tree models
  • Evaluated with accuracy, ROC-AUC, confusion matrix, and 5-fold cross-validation

Key Insight: Call duration and economic indicators (euribor3m, nr.employed) are the strongest predictors. Previous campaign successes are highly predictive of future acceptance. The dataset has an ~11.3% acceptance rate, making ROC-AUC the critical evaluation metric.

Skills Demonstrated: Large dataset handling Β· Feature encoding Β· Imbalanced classification Β· Cross-validation Β· ROC analysis


πŸš€ Getting Started

Prerequisites

pip install pandas numpy matplotlib seaborn scikit-learn jupyter

Running the Notebooks

  1. Clone this repository:

    git clone https://github.com/MUsman054/Data-Analytics-Internship.git
    cd Data-Analytics-Internship
  2. Launch Jupyter Notebook:

    jupyter notebook
  3. Open any .ipynb file and run all cells (Cell β†’ Run All).

Note: Each notebook loads its dataset from a CSV file in the same directory. Make sure the .csv files are present alongside the notebooks before running.


πŸ“ˆ Skills Demonstrated Across All Projects

Skill Projects
Data loading & inspection (pandas) All
Handling missing values 2, 3, 5
Label Encoding & One-Hot Encoding 2, 3, 4, 5
Exploratory Data Analysis (EDA) All
Data visualization (matplotlib, seaborn) All
Binary classification 2, 3, 5
Regression modeling 4
Model evaluation (Accuracy, Confusion Matrix) 2, 3, 5
MAE & RMSE evaluation 4
ROC-AUC & ROC curves 2, 3, 5
Feature importance analysis 2, 3, 4, 5
Cross-validation 3, 5

This portfolio was built as part of a hands-on data science learning journey, demonstrating practical skills in EDA, machine learning, and model evaluation using real-world datasets.

About

End-to-end Data Science & ML projects: EDA, classification, and regression on real-world datasets using Python, pandas, scikit-learn, and seaborn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors