Hello

I'm S M Boni Sadar
a Data Scientist
base in Bangladesh.

More About Me Get In Touch

Get My CV

About Me.

A Data Scientist and former electrical engineer with 7+ years of hands-on experience operating large-scale power plants—where precision, systems thinking, and problem-solving were part of everyday life. After transitioning to run my family's retail business, I unexpectedly fell in love with coding—especially Python—and found my next calling in data science. I recently completed an intensive, project-driven Data Science program at Springboard, where I developed practical skills in Python, SQL, machine learning, and turning raw data into real insights. My engineering mindset, business experience, and passion for lifelong learning now fuel my drive to solve real-world problems with data.

Beside being a Data Scientist, I'm a proud father of two amazing daughters, living in Rajshahi—a riverside city known for its summer fruits and unforgettable sunsets over the Padma River. Whether debugging a neural network or hiking along the riverbank, I’m always driven by curiosity, clarity, and the pursuit of elegant solutions.

When I'm not working with data, you'll find me immersed in books — especially Bengali detective novels. I'm a huge fan of Feluda by the legendary Satyajit Ray and Kakababu by Sunil Gangopadhyay, both of which sparked my early love for mysteries. I also enjoy the works of Saratchandra Chattopadhyay, whose stories offer timeless reflections on life and society. Travel is another passion of mine — though occasional, it's always enriching. I've had the chance to explore stunning destinations like Kenya, Nepal, and India. The Maasai Mara Reserve in Kenya stands out as an unforgettable experience — witnessing lions roaming in prides, jaguars lounging in the shade, and zebras thundering across the plains was truly breathtaking. Above all, I value quality time with my family. Whether it’s a quiet evening at home or a shared adventure, those moments mean the most to me.

Skills

Languages Python SQL
Machine Learning Libraries Scikit learn Tensorflow
Experiment Tracking & Workflow Orchestration MLflow Prefect 2.x
Cloud & Deployment Google Cloud Platform (Cloud Run, VM Services) Terraform (IaC)
Visualization & BI Tools Matplotlib Seaborn Tableau
Databasees & Tools MySQL PostgreSQL
Version Control & Collaboration Git GitHub

Certificates

Springboard Data Science Career Track - Verified Certificate of completion 147699764
DataTalks Club MLOPS Zoomcamp - Certificate No. 279F71
Convelutional Neural Network with Python – 365 Data Science Certificate ID: CC-57F890E918
SQL Certificate – 365 Data Science Certificate ID: CC-E8A57DEC50 | Covers SQL querying, joins, subqueries & real-world data handling
EF SET English Certificate – C2 Proficient Issued by Education First (EF); CEFR-aligned, native-level proficiency

Selected Works.

Dhaka City Precipitation Forecast

This project aims to forecast 24-hour precipitation in Dhaka city using a regression model trained on 20 years of historical weather data sourced from the Open-Meteo API.

View GitHub Repo

Precipitation Forecast Project Thumbnail

Project Overview

Rainfall in Bangladesh specially in a busy city like Dhaka can cause severe urban disruption. The goal is to:

Forecast rainfall for the next 24 hours in Dhaka.
Retrain the model monthly on freshly fetched weather data.
Detect drift daily to monitor model health.
Send email notifications (via SendGrid) for model status and alerts.
Momitor model metrics in Grafana.
Model deployed using GCP Cloud Run/Services.

Tech Stack Overview

Layer	Tool / Framework
Cloud Infrastructure	Google Cloud Platform (GCP)
Orchestration	Prefect 2.x
Experiment Tracking	MLflow
Data Fetching	Open-Meteo API
Model	XGBoost Regression
Infra-as-Code	Terraform
Monitoring & Notification	Grafana, SendGrid
Scheduling	Prefect Scheduled Flows
Cloud Deployment	GCP Cloud Run / Services

Phishing URL Detection

A machine learning model trained to distinguish phishing URLs from legitimate ones.

View GitHub Repo

Phishing URL Detection Project Thumbnail

Project Overview

Phishing URLs are deceptive links crafted by cybercriminals to steal sensitive user information. These malicious URLs are commonly spread via spam emails, fraudulent messages, and compromised websites.

Phishing attacks pose severe risks to individuals and organizations, leading to financial losses and data breaches. In 2023, phishing attacks surged by 173% compared to the previous quarter.

This project aims to build an efficient and scalable machine learning model to detect phishing URLs and mitigate cybersecurity threats.

Methodology

This approach focuses on analyzing the inherent characteristics of phishing URLs, avoiding dependence on external data sources like robots.txt or WHOIS records. Instead, it extracts features directly from the URL, ensuring faster and more robust detection.

Key Methodological Choices

URL-Based Feature Engineering
Caching with Python’s @lru_cache() for performance
Machine Learning Classifiers trained on structured URL features

Features Used for Detection

URL Length
Extracting the Fully Qualified Domain Name (FQDN)


# Extracting the Fully Qualified Domain Name (FQDN) 
# (containing domain, subdomain, TLD, port, etc.)
df['FQDN'] = df['url'].str.split('/').str[0]

Top-Level Domain (TLD) Similarity Scores

Used a pandas DataFrame containing legitimate top level domains extracted from wikipedia https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Special Characters Distribution
Entropy Analysis
Obfuscation Detection (Hex IPs, Encoded Characters)
Text Similarity (Levenshtein, Jaro-Winkler, etc.)
Word Patterns (via NLTK Corpus)
Log Transformations for normalization

Data Source

Dataset sourced from Kaggle containing labeled phishing and legitimate URLs.

Data Cleaning

Cleaned non-ASCII characters, extra prefixes (like extra https or www), and malformed entries.

Feature Engineering Steps

FQDN & TLD Extraction
Shannon Entropy Calculation — for randomness measurement
Text Similarity Matching using:
- Levenshtein Distance
- Damerau-Levenshtein
- Jaro-Winkler
Obfuscation Patterns (e.g. %2F instead of /, hex IPs)
Log Transforms to reduce skewness in features

Machine Learning Models

We tested multiple models using scikit-learn:

K-Nearest Neighbors (KNN)
Support Vector Classifier (SVC)
XGBoost Classifier

TF-IDF + Dimensionality Reduction

Text-based feature extraction using TF-IDF and compression via TruncatedSVD.


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

tfidf = TfidfVectorizer()
X_text = tfidf.fit_transform(urls)

svd = TruncatedSVD(n_components=100)
X_reduced = svd.fit_transform(X_text)

Future Improvements

Integrate Word2Vec for smarter domain comparisons
Dynamic similarity index from live web data
Explore BERT or LSA via NLTK for deeper language modeling
Apply TF-GNN for graph-based URL analysis

Defective Jar Lid Detection

Computer vision project detecting defective jar lids using image classification.

View GitHub Repo

Project Overview

In modern manufacturing, automation plays a critical role in enhancing efficiency, consistency, and quality control. One key area where automation can add significant value is in defect detection during the production process. This project focuses on automating the inspection of jar lids by developing a Convolutional Neural Network (CNN) model to classify defective versus non-defective lids.

Manual inspection is time-consuming and error-prone, making it unsuitable for large-scale operations. By leveraging deep learning and computer vision techniques, we aim to streamline the quality control process, reduce inspection time, and increase the accuracy of defect identification.

Our CNN-based model is trained on labeled image data to learn distinguishing features between acceptable and defective lids, enabling real-time and scalable quality assessment on the production line — contributing to the broader vision of smart manufacturing and Industry 4.0.

Data Source

A labeled image dataset sourced from Kaggle was used for the binary classification task. The dataset included images of both intact and damaged jar lids.

Methodology

1. Data Augmentation

To expand and diversify the dataset, the following image transformation techniques were used:

Rotations: 90°, +35°, and -35° angles
Flipping: Horizontal mirroring of images
Lighting Adjustments: Brightness increased and decreased to simulate varied lighting

2. Image Manipulation for Feature Enhancement

X-ray Style Transformation: Inverting color channels to enhance contrast
Edge Detection using Sobel Filter
Image Sharpening: To emphasize defect features

These preprocessing techniques helped create a more robust training dataset by exposing the model to a variety of realistic scenarios.

Model Architecture

The CNN model was built using TensorFlow/Keras for binary classification.

Input

Image Size: 128×128
Channels: 1 (Grayscale)
Input Shape: (128, 128, 1)

Convolutional Layers

Conv2D(32, kernel_size=(5, 5), padding='same', kernel_initializer='he_normal')
→ LeakyReLU(alpha=0.1) → MaxPooling2D(pool_size=(2, 2))
Conv2D(64, kernel_size=(5, 5), padding='same')
→ LeakyReLU → MaxPooling2D
Conv2D(128, kernel_size=(5, 5), padding='same')
→ LeakyReLU → MaxPooling2D

Dense Layers

Flatten()
Dense(256, activation='relu', kernel_initializer='he_normal')
Dropout(0.3)
Dense(2) (output logits)

Compilation & Training


model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    metrics=['accuracy']
)

model.fit(
    train_data,
    validation_data=val_data,
    batch_size=256,
    epochs=50,
    callbacks=[EarlyStopping(patience=3)]
)

Model Evaluation

Accuracy Trends

Training Accuracy: Improved from ~51% to ~93% over 32 epochs
Validation Accuracy: Plateaued around 90% by epoch 20
Training Loss: Decreased to ~0.05
Validation Loss: Plateaued around ~0.35

Validation Set Performance


Class     Correct  Incorrect  Total   Accuracy (%)
Intact    2017     149        2166    93.1%
Damaged   1769     249        2018    87.7%
Overall   —        —          4184    90.4%

Test Set Performance


Class     Correct  Incorrect  Total   Accuracy (%)
Intact    2035     132        2167    93.9%
Damaged   1759     259        2018    87.2%
Overall   —        —          4185    90.5%

ROC Curve & AUC

AUC Score: 0.97
High true positive rate, low false positive rate
Excellent class separation ability

Summary

Validation & test accuracy ~90%
AUC score of 0.97
Mild overfitting after epoch 20, but performance remained stable
Model is reliable and production-ready for automated inspection

Future Improvements

Transfer Learning with VGG16, ResNet50, EfficientNet
Deeper CNN Architectures for abstract feature learning
Mixed Activation Functions for better non-linearity
Inception Modules to capture multiscale features
Advanced Augmentation: elastic deformation, occlusion, noise injection

These future directions aim to improve performance and generalization, making the system even more suitable for real-time deployment in modern manufacturing pipelines.

I'm S M Boni Sadar
a Data Scientist
base in Bangladesh.

About Me.

Skills

Certificates

Selected Works.

Dhaka City Precipitation Forecast

Project Overview

Tech Stack Overview

Phishing URL Detection

Project Overview

Methodology

Key Methodological Choices

Features Used for Detection

Data Source

Data Cleaning

Feature Engineering Steps

Machine Learning Models

TF-IDF + Dimensionality Reduction

Future Improvements

Defective Jar Lid Detection

Project Overview

Data Source

Methodology

1. Data Augmentation

2. Image Manipulation for Feature Enhancement

Model Architecture

Input

Convolutional Layers

Dense Layers

Compilation & Training

Model Evaluation

Accuracy Trends

Validation Set Performance

Test Set Performance

ROC Curve & AUC

Summary

Future Improvements

Hours of Course materials

Hands on Projects

Lines of Code

I'm S M Boni Sadar a Data Scientist base in Bangladesh.

About Me.

Skills

Certificates

Selected Works.

Dhaka City Precipitation Forecast

Project Overview

Tech Stack Overview

Phishing URL Detection

Project Overview

Methodology

Key Methodological Choices

Features Used for Detection

Data Source

Data Cleaning

Feature Engineering Steps

Machine Learning Models

TF-IDF + Dimensionality Reduction

Future Improvements

Defective Jar Lid Detection

Project Overview

Data Source

Methodology

1. Data Augmentation

2. Image Manipulation for Feature Enhancement

Model Architecture

Input

Convolutional Layers

Dense Layers

Compilation & Training

Model Evaluation

Accuracy Trends

Validation Set Performance

Test Set Performance

ROC Curve & AUC

Summary

Future Improvements

Hours of Course materials

Hands on Projects

Lines of Code

I'm S M Boni Sadar
a Data Scientist
base in Bangladesh.