Sayan Banerjee

I'm

ABOUT ME

profile

I am a data analyst and data science enthusiast with a strong foundation in statistical analysis, machine learning, and deep learning. Proficient in Excel, SQL, and Power BI, I specialize in transforming data into actionable insights and building data-driven solutions. Currently pursuing a Master's in Statistics, I combine analytical rigor with practical experience to solve real-world problems. My work focuses on uncovering patterns, automating workflows, and visualizing data to support strategic decision-making across domains.

EXPERIENCE

The Sparks Foundation

Data Science Intern

Febuary 2024 - April 2024

Chegg Inc.

Subject Matter Expert

October 2022 - December 2024

PROJECTS

Swiggy Delivery Time Prediction

Built a high-performance regression model to predict delivery times using an ensemble of XGBoost and LightGBM, achieving an R² score of 0.88. This end-to-end project included: Data Preprocessing: Handled missing values using multiple imputation strategies; selected dropping missing entries for optimal baseline results. Applied standardization and encoding techniques. Exploratory Data Analysis: Identified key delivery drivers, outliers, and skewed variables using distribution plots, correlation heatmaps, and pairwise analysis. Feature Selection: Compared Forward Selection, RFE, and VIF; retained all features due to positive contribution to model performance. Modeling & Optimization: Evaluated XGBoost, LightGBM, Random Forest, SVR, and Linear Regression. Final model used Voting Regressor with Bayesian hyperparameter tuning via Optuna. 🔥 Result: Ensemble model significantly outperformed individual learners with a strong R² of 0.88.

Multiple Disease Prediction System

Developed a full-stack web-based diagnostic tool that predicts five critical diseases using both clinical data and medical imaging. Integrated machine learning for tabular predictions and deep learning (CNNs) for image classification. Built with Scikit-learn, TensorFlow/Keras, and Streamlit. 🧠 Capabilities: Heart, Diabetes, Kidney Disease: Real-time predictions using Random Forest, SVM, Decision Trees. Pneumonia & Diabetic Retinopathy: Image-based detection via CNNs trained on Kaggle datasets. User Interface: Tab-based navigation, responsive UI, instant feedback with model confidence scores. 🔬 Key Features: Tabular input for clinical diseases; image upload for vision-based diagnostics Preprocessing pipelines for both structured and unstructured data Streamlined model inference and deployment via Streamlit Modular code design with expandability for additional diseases

Customer Churn Prediction

Analyzed customer churn behavior by combining Survival Analysis and ML-based classification techniques. Used Kaplan-Meier curves and a Stratified Cox Proportional Hazards model to uncover time-to-churn patterns and covariate effects over time. 🧪 Key Steps: Performed deep EDA to identify churn drivers across segments Modeled survival probabilities and risk using KM curves and Cox models Trained multiple classifiers (e.g., XGBoost) to predict churn labels 🎯 Results: Achieved 87% accuracy and 82% F1 score with XGBoost classifier

Optimizing Revenue Leakage in Hospitality Sector

Developed a data-driven solution to identify revenue leakages and improve profitability in the hospitality sector. Leveraged advanced SQL for multi-source data analysis focusing on KPIs like RevPAR, ADR, occupancy%, and cancellation rates. Built interactive Power BI dashboards to uncover booking and channel trends Proposed data-backed pricing & operational strategies Demonstrated clear potential for revenue uplift and cost control

Sentiment Based Product Recommendations

Built a complete end-to-end recommendation system combining sentiment analysis and user behavior to deliver personalized product suggestions. Achieved 94% sentiment classification accuracy using a Stacking Classifier (Random Forest, XGBoost, LightGBM) Applied robust NLP preprocessing: POS-aware lemmatization, TF-IDF vectorization Developed a hybrid recommendation engine using sentiment scores + user-product interaction Deployed via Streamlit with two core functionalities: 🔹 Review Sentiment Prediction 🔹 Top-5 Product Recommendations for a given user

Stock Price Forecasting Using Time Series Models

Stock Price Forecasting Using Time Series Models

EDUCATION

Pursuing a Master's in Statistics and Computing at Banaras Hindu University (BHU)

August 2024 — Present

Graduated from Sister Nivedita University with a Bachelor of Science in Statistics(Hons)

August 2021 — June 2024

SKILLS