Hello

I'm S M Boni Sadar
a Data Scientist
base in Bangladesh.

About Me.

A Data Scientist and former electrical engineer with 7+ years of hands-on experience operating large-scale power plants—where precision, systems thinking, and problem-solving were part of everyday life. After transitioning to run my family's retail business, I unexpectedly fell in love with coding—especially Python—and found my next calling in data science. I recently completed an intensive, project-driven Data Science program at Springboard, where I developed practical skills in Python, SQL, machine learning, and turning raw data into real insights. My engineering mindset, business experience, and passion for lifelong learning now fuel my drive to solve real-world problems with data.

Beside being a Data Scientist, I'm a proud father of two amazing daughters, living in Rajshahi—a riverside city known for its summer fruits and unforgettable sunsets over the Padma River. Whether debugging a neural network or hiking along the riverbank, I’m always driven by curiosity, clarity, and the pursuit of elegant solutions.

When I'm not working with data, you'll find me immersed in books — especially Bengali detective novels. I'm a huge fan of Feluda by the legendary Satyajit Ray and Kakababu by Sunil Gangopadhyay, both of which sparked my early love for mysteries. I also enjoy the works of Saratchandra Chattopadhyay, whose stories offer timeless reflections on life and society. Travel is another passion of mine — though occasional, it's always enriching. I've had the chance to explore stunning destinations like Kenya, Nepal, and India. The Maasai Mara Reserve in Kenya stands out as an unforgettable experience — witnessing lions roaming in prides, jaguars lounging in the shade, and zebras thundering across the plains was truly breathtaking. Above all, I value quality time with my family. Whether it’s a quiet evening at home or a shared adventure, those moments mean the most to me.

Skills

  • Languages Python SQL
  • Machine Learning Libraries Scikit learn Tensorflow
  • Visualization & BI Tools Matplotlib Seaborn Tableau
  • Databasees & Tools MySQL PostgreSQL
  • Version Control & Collaboration Git GitHub

Certificates

  • Springboard Data Science Career Track - Verified Certificate of completion 147699764
  • SQL Certificate – 365 Data Science Certificate ID: CC-E8A57DEC50 | Covers SQL querying, joins, subqueries & real-world data handling
  • EF SET English Certificate – C2 Proficient Issued by Education First (EF); CEFR-aligned, native-level proficiency

Selected Works.

Phishing URL Detection

A machine learning model trained to distinguish phishing URLs from legitimate ones.

View GitHub Repo
Phishing URL Detection Project Thumbnail

Project Overview

Phishing URLs are deceptive links crafted by cybercriminals to steal sensitive user information. These malicious URLs are commonly spread via spam emails, fraudulent messages, and compromised websites.

Phishing attacks pose severe risks to individuals and organizations, leading to financial losses and data breaches. In 2023, phishing attacks surged by 173% compared to the previous quarter.

This project aims to build an efficient and scalable machine learning model to detect phishing URLs and mitigate cybersecurity threats.

Methodology

This approach focuses on analyzing the inherent characteristics of phishing URLs, avoiding dependence on external data sources like robots.txt or WHOIS records. Instead, it extracts features directly from the URL, ensuring faster and more robust detection.

Key Methodological Choices
  • URL-Based Feature Engineering
  • Caching with Python’s @lru_cache() for performance
  • Machine Learning Classifiers trained on structured URL features
Features Used for Detection
  • URL Length
  • Extracting the Fully Qualified Domain Name (FQDN)
  • 
    # Extracting the Fully Qualified Domain Name (FQDN) 
    # (containing domain, subdomain, TLD, port, etc.)
    df['FQDN'] = df['url'].str.split('/').str[0]
                                        
  • Top-Level Domain (TLD) Similarity Scores
  • Used a pandas DataFrame containing legitimate top level domains extracted from wikipedia https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

  • Special Characters Distribution
  • Entropy Analysis
  • Obfuscation Detection (Hex IPs, Encoded Characters)
  • Text Similarity (Levenshtein, Jaro-Winkler, etc.)
  • Word Patterns (via NLTK Corpus)
  • Log Transformations for normalization

Data Source

Dataset sourced from Kaggle containing labeled phishing and legitimate URLs.

Data Cleaning

Cleaned non-ASCII characters, extra prefixes (like extra https or www), and malformed entries.

Feature Engineering Steps

  • FQDN & TLD Extraction
  • Shannon Entropy Calculation — for randomness measurement
  • Text Similarity Matching using:
    • Levenshtein Distance
    • Damerau-Levenshtein
    • Jaro-Winkler
  • Obfuscation Patterns (e.g. %2F instead of /, hex IPs)
  • Log Transforms to reduce skewness in features

Machine Learning Models

We tested multiple models using scikit-learn:

  • K-Nearest Neighbors (KNN)
  • Support Vector Classifier (SVC)
  • XGBoost Classifier
TF-IDF + Dimensionality Reduction

Text-based feature extraction using TF-IDF and compression via TruncatedSVD.


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

tfidf = TfidfVectorizer()
X_text = tfidf.fit_transform(urls)

svd = TruncatedSVD(n_components=100)
X_reduced = svd.fit_transform(X_text)
                                

Future Improvements

  • Integrate Word2Vec for smarter domain comparisons
  • Dynamic similarity index from live web data
  • Explore BERT or LSA via NLTK for deeper language modeling
  • Apply TF-GNN for graph-based URL analysis

Defective Jar Lid Detection

Computer vision project detecting defective jar lids using image classification.

View GitHub Repo
Jar Lid Detection Project Thumbnail

Project Overview

In modern manufacturing, automation plays a critical role in enhancing efficiency, consistency, and quality control. One key area where automation can add significant value is in defect detection during the production process. This project focuses on automating the inspection of jar lids by developing a Convolutional Neural Network (CNN) model to classify defective versus non-defective lids.

Manual inspection is time-consuming and error-prone, making it unsuitable for large-scale operations. By leveraging deep learning and computer vision techniques, we aim to streamline the quality control process, reduce inspection time, and increase the accuracy of defect identification.

Our CNN-based model is trained on labeled image data to learn distinguishing features between acceptable and defective lids, enabling real-time and scalable quality assessment on the production line — contributing to the broader vision of smart manufacturing and Industry 4.0.

Data Source

A labeled image dataset sourced from Kaggle was used for the binary classification task. The dataset included images of both intact and damaged jar lids.

Methodology

1. Data Augmentation

To expand and diversify the dataset, the following image transformation techniques were used:

  • Rotations: 90°, +35°, and -35° angles
  • Flipping: Horizontal mirroring of images
  • Lighting Adjustments: Brightness increased and decreased to simulate varied lighting
2. Image Manipulation for Feature Enhancement
  • X-ray Style Transformation: Inverting color channels to enhance contrast
  • Edge Detection using Sobel Filter
  • Image Sharpening: To emphasize defect features

These preprocessing techniques helped create a more robust training dataset by exposing the model to a variety of realistic scenarios.

Model Architecture

The CNN model was built using TensorFlow/Keras for binary classification.

Input
  • Image Size: 128×128
  • Channels: 1 (Grayscale)
  • Input Shape: (128, 128, 1)
Convolutional Layers
  • Conv2D(32, kernel_size=(5, 5), padding='same', kernel_initializer='he_normal')
    LeakyReLU(alpha=0.1)MaxPooling2D(pool_size=(2, 2))
  • Conv2D(64, kernel_size=(5, 5), padding='same')
    LeakyReLUMaxPooling2D
  • Conv2D(128, kernel_size=(5, 5), padding='same')
    LeakyReLUMaxPooling2D
Dense Layers
  • Flatten()
  • Dense(256, activation='relu', kernel_initializer='he_normal')
  • Dropout(0.3)
  • Dense(2) (output logits)
Compilation & Training

model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    metrics=['accuracy']
)

model.fit(
    train_data,
    validation_data=val_data,
    batch_size=256,
    epochs=50,
    callbacks=[EarlyStopping(patience=3)]
)
                            

Model Evaluation

Accuracy Trends
  • Training Accuracy: Improved from ~51% to ~93% over 32 epochs
  • Validation Accuracy: Plateaued around 90% by epoch 20
  • Training Loss: Decreased to ~0.05
  • Validation Loss: Plateaued around ~0.35
Validation Set Performance

Class     Correct  Incorrect  Total   Accuracy (%)
Intact    2017     149        2166    93.1%
Damaged   1769     249        2018    87.7%
Overall   —        —          4184    90.4%
                            
Test Set Performance

Class     Correct  Incorrect  Total   Accuracy (%)
Intact    2035     132        2167    93.9%
Damaged   1759     259        2018    87.2%
Overall   —        —          4185    90.5%
                            
ROC Curve & AUC
  • AUC Score: 0.97
  • High true positive rate, low false positive rate
  • Excellent class separation ability

Summary

  • Validation & test accuracy ~90%
  • AUC score of 0.97
  • Mild overfitting after epoch 20, but performance remained stable
  • Model is reliable and production-ready for automated inspection

Future Improvements

  • Transfer Learning with VGG16, ResNet50, EfficientNet
  • Deeper CNN Architectures for abstract feature learning
  • Mixed Activation Functions for better non-linearity
  • Inception Modules to capture multiscale features
  • Advanced Augmentation: elastic deformation, occlusion, noise injection

These future directions aim to improve performance and generalization, making the system even more suitable for real-time deployment in modern manufacturing pipelines.

600 +
Hours of Course materials

Over the course of 9-month Springboard Data Science bootcamp, completed 600+ hours of rigorous study — including structured modules from DataCamp, LinkedIn Learning, and original blog content, supplemented by countless hours of deep dives into YouTube tutorials and independent exploration.

15 +
Hands on Projects

Built 15+ hands-on projects using real-world datasets, applying techniques like Bayesian Optimization, Unsupervised Learning, and probabilistic modeling to solve meaningful data challenges.

9K +
Lines of Code

In total, written over 9,000 lines of code, each one sharpening skills in Python, machine learning, and data storytelling.