Data Analyst · MSc Data Science
I turn messy, fragmented data into pipelines, dashboards, and ML systems that actually hold up. Previously at IIT Bombay's DRF. Looking for analyst roles where the problems are hard and the data is real.
Final-semester MSc Data Science student at the University of Mumbai, with a BA in Mathematics from the University of Delhi. The maths background isn't decoration — it means I understand why models behave the way they do, not just how to call them.
Most substantial work to date: IIT Bombay's DRF, where I built an entity resolution pipeline across Salesforce, Raiser's Edge, and Almabase. Benchmarked embedding approaches, found field-level fuzzy matching outperformed semantic methods for exact-identity verification, and shipped something that was used in production.
Outside of data, I'm a wildlife and bird photographer — jungle safaris, waiting for the right shot. That patience makes me a better analyst. You notice what's actually there when you stop trying to force an answer.
Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase via field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores made manual review structured rather than guesswork.
STEM research discovery platform on arXiv and Semantic Scholar. RAG for paper summarisation, PAIS influence scoring via LightGBM, BERTopic for daily topic clusters. Open-source embeddings + Groq inference, Streamlit frontend with ranked recommendations.
Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. This fixes both.
MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite database shared between Streamlit app and MCP server.
// yes, .lol is intentional.
delhi/ncr · data analyst
Everything I've built, shipped, or experimented with — pipelines, ML systems, LLM tooling, and whatever comes next. More of a field notebook than a portfolio.
Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase using field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores turned weeks of manual reconciliation into a structured, reviewable output. Built during internship at IIT Bombay's DRF.
STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, a PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference. MSc thesis project.
Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite backend shared between Streamlit app and MCP server.
Final-semester MSc Data Science student with a data analytics internship at IIT Bombay's Development & Relations Foundation (DRF), where I built a pipeline reconciling 60,000+ alumni records across three CRM systems. Comfortable working across Python, SQL, and Power BI — from messy raw data to dashboards senior stakeholders actually use. Background in Mathematics gives me a sharper read on statistical outputs than most analytics candidates at this level.
Final-semester MSc Data Science student with hands-on experience building NLP pipelines, information retrieval systems, and ML-scored ranking models. Built RAG-based research discovery tools using LightGBM, BERTopic, and Sentence Transformers; developed entity resolution systems benchmarking embedding approaches against fuzzy matching baselines. Background in Mathematics underpins a first-principles approach to model design and evaluation.
Reconciled 60,000+ alumni records across three CRM platforms via field-level RapidFuzz token-set matching. Benchmarked embedding and TF-IDF approaches; selected fuzzy matching for reliable exact-identity resolution. Returned 43,000+ matches with per-field confidence scores, collapsing weeks of manual reconciliation into structured, reviewable output.
STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference; results served via Streamlit with ranked paper recommendations.
STEM research discovery platform with RAG-based summarisation, influence-scored ranking (LightGBM), and BERTopic topic trend detection across arXiv and Semantic Scholar data. Groq API for LLM inference; Streamlit frontend with daily cluster outputs and ranked recommendations.
Entity resolution system reconciling 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase. Benchmarked sentence embedding and TF-IDF against field-level fuzzy matching — semantic approaches underperformed; final system uses RapidFuzz with per-field confidence scoring.
MCP server with six read/write tools giving Claude Desktop live access to project state, timestamps, and log history. Fixes the fundamental LLM context-loss problem between sessions. Single SQLite backend shared between Streamlit app and MCP server.
MCP (Model Context Protocol) server enabling persistent LLM context across sessions — addressing fundamental statelessness in LLM inference. Six read/write tool endpoints expose project state, temporal context, and log history to the model mid-conversation.