about exp projects contact

Data Analyst · MSc Data Science

Chinmayi
Dixit.

I turn messy, fragmented data into pipelines, dashboards, and ML systems that actually hold up. Previously at IIT Bombay's DRF. Looking for analyst roles where the problems are hard and the data is real.

education
MSc Data Science
University of Mumbai
2024 → Expected 2026
BA Mathematics
Kalindi College, University of Delhi
2020 → 2023
skills

What I work
with.

Languages
PythonSQL
ML / AI
Scikit-learnLightGBM RegressionClassification ClusteringNLP RAGBERTopic Sentence TransformersFuzzy Matching Record LinkageTF-IDF
Data Eng
PandasNumPy ETL PipelinesData Cleaning Power QueryAPI Integration MongoDBSQLite
Viz & Tools
Power BIStreamlit MatplotlibSeaborn GitJupyter LinuxExcel
currently learning
Linux (deep) DSA Computer Vision
about

The person
behind the pipeline.

Final-semester MSc Data Science student at the University of Mumbai, with a BA in Mathematics from the University of Delhi. The maths background isn't decoration — it means I understand why models behave the way they do, not just how to call them.

Most substantial work to date: IIT Bombay's DRF, where I built an entity resolution pipeline across Salesforce, Raiser's Edge, and Almabase. Benchmarked embedding approaches, found field-level fuzzy matching outperformed semantic methods for exact-identity verification, and shipped something that was used in production.

Outside of data, I'm a wildlife and bird photographer — jungle safaris, waiting for the right shot. That patience makes me a better analyst. You notice what's actually there when you stop trying to force an answer.

experience

Where I've
done the work.

Feb 2026
May 2026
Data Management & Analytics Intern
IIT Bombay Development & Relations Foundation (DRF)
  • Cleaned and standardised 70,000+ alumni records across 700+ fields using Python, SQL, Pandas, NumPy, and Power Query.
  • Built Power BI dashboards covering donation trends, donor segmentation, and campaign performance for senior stakeholder reporting.
  • ID mapping pipeline unifying alumni identities across two institutional databases, producing Salesforce-ready output.
  • Pulled Hurun + Forbes data into a prospect scoring dataset for the institute's HNI outreach programme.
Mar 2025
Apr 2025
Data Engineering Intern
National Centre of Nano Sciences and Nanotechnology, University of Mumbai
  • MongoDB backend for lab resource and facility data, cutting manual lookup time for internal teams.
  • Connected pipeline to a web app that replaced the team's spreadsheet workflow entirely.
projects

Things built
and shipped.

Entity Resolution
Alumni Entity Resolution Tool

Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase via field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores made manual review structured rather than guesswork.

PythonRapidFuzzPandasRecordlinkageStreamlit
Information Retrieval
The Re-Search Engine

STEM research discovery platform on arXiv and Semantic Scholar. RAG for paper summarisation, PAIS influence scoring via LightGBM, BERTopic for daily topic clusters. Open-source embeddings + Groq inference, Streamlit frontend with ranked recommendations.

PythonLightGBMBERTopicRAGGroq APIStreamlit
LLM Tooling · Systems
Reality Tracker

Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. This fixes both.

MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite database shared between Streamlit app and MCP server.

PythonSQLiteMCPStreamlit
// personal tooling — no public repo

reach me at
chinmayi.lol

// yes, .lol is intentional.

delhi/ncr · data analyst

the lab.

Everything I've built, shipped, or experimented with — pipelines, ML systems, LLM tooling, and whatever comes next. More of a field notebook than a portfolio.

03projects
02public repos
ongoingstatus
filter //
01
Entity Resolution Analytics
Alumni Entity Resolution Tool
// 60K records. three CRMs. one clean output.

Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase using field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores turned weeks of manual reconciliation into a structured, reviewable output. Built during internship at IIT Bombay's DRF.

PythonRapidFuzzPandasRecordlinkageStreamlit
02
ML NLP
The Re-Search Engine
// rank papers. detect trends. skip the noise.

STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, a PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference. MSc thesis project.

PythonLightGBMBERTopicRAGGroq APIStreamlit
03
LLM Tooling Systems
Reality Tracker
// giving LLMs memory they were never built to have.

Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite backend shared between Streamlit app and MCP server.

PythonSQLiteMCP ProtocolStreamlit
viewing as: Data Analytics
Chinmayi Dixit chinmayi.lol
Data Analytics AI · Machine Learning
Summary

Final-semester MSc Data Science student with a data analytics internship at IIT Bombay's Development & Relations Foundation (DRF), where I built a pipeline reconciling 60,000+ alumni records across three CRM systems. Comfortable working across Python, SQL, and Power BI — from messy raw data to dashboards senior stakeholders actually use. Background in Mathematics gives me a sharper read on statistical outputs than most analytics candidates at this level.

Final-semester MSc Data Science student with hands-on experience building NLP pipelines, information retrieval systems, and ML-scored ranking models. Built RAG-based research discovery tools using LightGBM, BERTopic, and Sentence Transformers; developed entity resolution systems benchmarking embedding approaches against fuzzy matching baselines. Background in Mathematics underpins a first-principles approach to model design and evaluation.

Experience
Data Management & Analytics Intern Feb 2026 – May 2026
IIT Bombay Development & Relations Foundation (DRF)
  • Cleaned and standardised 70,000+ alumni records across 700+ fields using Python (Pandas, NumPy), SQL, and Power Query — establishing a single source of truth for alumni engagement.
  • Built Power BI dashboards covering donation trends, donor segmentation, and campaign performance, used directly in senior stakeholder reporting.
  • Benchmarked all-MiniLM-L6-v2 sentence embeddings and TF-IDF against field-level fuzzy matching for alumni identity resolution — fuzzy outperformed semantic approaches on exact-identity matching tasks.
  • Developed an entity resolution pipeline unifying alumni identities across Salesforce, Raiser's Edge, and Almabase, producing Salesforce-ready output.
  • Integrated financial and biographical data from Hurun and Forbes into a prospect scoring dataset for the institute's HNI outreach programme.
Data Engineering Intern Mar 2025 – Apr 2025
National Centre of Nano Sciences and Nanotechnology, University of Mumbai
  • Built a MongoDB data pipeline to organise and serve the lab's resource and facility data, reducing manual lookup effort for internal teams.
  • Connected the pipeline to a web app that replaced the team's spreadsheet-based workflow entirely.
Projects
Alumni Entity Resolution Tool The Re-Search Engine github ↗ github ↗
Python · Pandas · RapidFuzz · Recordlinkage · Streamlit
Python · all-MiniLM-L6-v2 · BERTopic · LightGBM · Groq API · Streamlit

Reconciled 60,000+ alumni records across three CRM platforms via field-level RapidFuzz token-set matching. Benchmarked embedding and TF-IDF approaches; selected fuzzy matching for reliable exact-identity resolution. Returned 43,000+ matches with per-field confidence scores, collapsing weeks of manual reconciliation into structured, reviewable output.

STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference; results served via Streamlit with ranked paper recommendations.

The Re-Search Engine Alumni Entity Resolution Tool github ↗ github ↗
Python · LightGBM · BERTopic · Sentence Transformers · Groq API · Streamlit
Python · Pandas · RapidFuzz · Recordlinkage · Sentence Transformers · Streamlit

STEM research discovery platform with RAG-based summarisation, influence-scored ranking (LightGBM), and BERTopic topic trend detection across arXiv and Semantic Scholar data. Groq API for LLM inference; Streamlit frontend with daily cluster outputs and ranked recommendations.

Entity resolution system reconciling 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase. Benchmarked sentence embedding and TF-IDF against field-level fuzzy matching — semantic approaches underperformed; final system uses RapidFuzz with per-field confidence scoring.

Reality Tracker — LLM Project Tracking System
Python · SQLite · MCP Protocol · Streamlit

MCP server with six read/write tools giving Claude Desktop live access to project state, timestamps, and log history. Fixes the fundamental LLM context-loss problem between sessions. Single SQLite backend shared between Streamlit app and MCP server.

MCP (Model Context Protocol) server enabling persistent LLM context across sessions — addressing fundamental statelessness in LLM inference. Six read/write tool endpoints expose project state, temporal context, and log history to the model mid-conversation.

Skills
Languages
PythonSQL
Analytics & BI
Power BIPower QueryExcelMatplotlibSeaborn
Data Engineering
PandasNumPyETL PipelinesData CleaningMongoDBSQLiteAPI Integration
ML / Statistics
RegressionClassificationClusteringFuzzy MatchingRecord LinkageScikit-learn
Tools
StreamlitGitJupyterLinux
Learning
DSAComputer VisionLinux (deep)
Languages
PythonSQL
ML / AI
LightGBMScikit-learnNLPRAGBERTopicSentence TransformersFuzzy MatchingRecord LinkageTF-IDF
Modelling
RegressionClassificationClusteringEmbedding ModelsLLM Inference
Data & Engineering
PandasNumPyETL PipelinesMongoDBSQLiteAPI Integration
Tools
StreamlitGitJupyterLinux
Learning
Computer VisionDSALinux (deep)
Education
MSc Data Science
University of Mumbai
2024 – Expected 2026
BA Mathematics
Kalindi College, University of Delhi
2020 – 2023
Other
Led Android dev team and peer study sessions — GDSC Kalindi, University of Delhi
Wildlife & bird photographer — jungle safari field trips and Kalindi College photography circle
Personal site: chinmayi.lol