Data Analyst · MSc Data Science

Chinmayi
Dixit.

I turn messy, fragmented data into pipelines, dashboards, and ML systems that actually hold up. Previously at IIT Bombay's DRF. Looking for analyst roles where the problems are hard and the data is real.

Get in touch ↗ See my work LinkedIn ↗

education

MSc Data Science

University of Mumbai

2024 → Expected 2026

BA Mathematics

Kalindi College, University of Delhi

2020 → 2023

skills

What I work
with.

Languages

PythonSQL

ML / AI

Scikit-learnLightGBM RegressionClassification ClusteringNLP RAGBERTopic Sentence TransformersFuzzy Matching Record LinkageTF-IDF

Data Eng

PandasNumPy ETL PipelinesData Cleaning Power QueryAPI Integration MongoDBSQLite

Viz & Tools

Power BIStreamlit MatplotlibSeaborn GitJupyter LinuxExcel

currently learning

Linux (deep) DSA Computer Vision

about

The person
behind the pipeline.

Final-semester MSc Data Science student at the University of Mumbai, with a BA in Mathematics from the University of Delhi. The maths background isn't decoration — it means I understand why models behave the way they do, not just how to call them.

Most substantial work to date: IIT Bombay's DRF, where I built an entity resolution pipeline across Salesforce, Raiser's Edge, and Almabase. Benchmarked embedding approaches, found field-level fuzzy matching outperformed semantic methods for exact-identity verification, and shipped something that was used in production.

Outside of data, I'm a wildlife and bird photographer — jungle safaris, waiting for the right shot. That patience makes me a better analyst. You notice what's actually there when you stop trying to force an answer.

experience

Where I've
done the work.

Feb 2026
May 2026

Data Management & Analytics Intern

IIT Bombay Development & Relations Foundation (DRF)

Cleaned and standardised 70,000+ alumni records across 700+ fields using Python, SQL, Pandas, NumPy, and Power Query.
Built Power BI dashboards covering donation trends, donor segmentation, and campaign performance for senior stakeholder reporting.
ID mapping pipeline unifying alumni identities across two institutional databases, producing Salesforce-ready output.
Pulled Hurun + Forbes data into a prospect scoring dataset for the institute's HNI outreach programme.

Mar 2025
Apr 2025

Data Engineering Intern

National Centre of Nano Sciences and Nanotechnology, University of Mumbai

MongoDB backend for lab resource and facility data, cutting manual lookup time for internal teams.
Connected pipeline to a web app that replaced the team's spreadsheet workflow entirely.

projects

Things built
and shipped.

Entity Resolution

Alumni Entity Resolution Tool

Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase via field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores made manual review structured rather than guesswork.

PythonRapidFuzzPandasRecordlinkageStreamlit

Information Retrieval

The Re-Search Engine

STEM research discovery platform on arXiv and Semantic Scholar. RAG for paper summarisation, PAIS influence scoring via LightGBM, BERTopic for daily topic clusters. Open-source embeddings + Groq inference, Streamlit frontend with ranked recommendations.

PythonLightGBMBERTopicRAGGroq APIStreamlit

LLM Tooling · Systems

Reality Tracker

Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. This fixes both.

MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite database shared between Streamlit app and MCP server.

PythonSQLiteMCPStreamlit

// personal tooling — no public repo

chinmayi.lol — bash — 80×40

// system info

~$whoami

Chinmayi Dixit — Data Analyst, MSc Data Science

~$cat status.json

location Delhi/NCR, India

target entry-level Data Analyst roles

email chinmayid2810@gmail.com

~$cat about.txt

MSc Data Science, University of Mumbai (2024–2026)

BA Mathematics, Kalindi College, University of Delhi (2020–2023)

Maths background = understanding WHY models behave, not just how to call them.

Also: wildlife + bird photographer. Jungle safaris. Patience is a data skill.

~$python -c "print(skills)"

languages Python, SQL

ml_ai Scikit-learn, LightGBM, Regression, Classification, Clustering, NLP, RAG, BERTopic, Sentence Transformers, TF-IDF, Fuzzy Matching, Record Linkage

data_eng Pandas, NumPy, ETL Pipelines, Data Cleaning, Power Query, API Integration, MongoDB, SQLite

viz_bi Power BI, Streamlit, Matplotlib, Seaborn

tools Git, Jupyter, Linux, Excel

learning Linux (deep), DSA, Computer Vision // in progress

// experience

~$ls -lt ./experience/

Data Management & Analytics Intern

IIT Bombay Development & Relations Foundation (DRF)

Feb 2026 – May 2026

Cleaned 70,000+ alumni records across 700+ fields — Python, SQL, Pandas, Power Query
Power BI dashboards: donation trends, donor segmentation, campaign performance
ID mapping pipeline → unified two institutional databases → Salesforce-ready output
Prospect scoring dataset from Hurun + Forbes for HNI outreach

Data Engineering Intern

National Centre of Nano Sciences and Nanotechnology, UoM

Mar 2025 – Apr 2025

MongoDB backend for lab resource + facility data — cut manual lookup time
Connected pipeline to web app → replaced spreadsheet workflow entirely

// projects

~$ls ./projects/ --verbose

Alumni_Entity_Resolution/

[ entity-resolution · record-linkage ]

Reconciled 60K+ alumni records across 3 CRMs via field-level RapidFuzz token-set matching. Benchmarked MiniLM + TF-IDF, both underperformed fuzzy for exact-identity work.

stack: Python · RapidFuzz · Pandas · Recordlinkage · Streamlit

→ github.com/Chinmayi2801/Alumni_Entity_Resolution

Research_Engine/

[ information-retrieval · nlp · ml ]

STEM research discovery on arXiv + Semantic Scholar. RAG summarisation, PAIS influence scoring via LightGBM, BERTopic daily clusters. Groq inference → Streamlit frontend.

stack: Python · LightGBM · BERTopic · RAG · Groq API · Streamlit

→ github.com/Chinmayi2801/Research_Engine

reality_tracker/ [private]

[ llm-tooling · mcp · systems ]

MCP server + Streamlit giving Claude Desktop persistent context. 6 read/write tools, SQLite backend.

stack: Python · SQLite · MCP · Streamlit

// personal tooling — no public repo

// contact

~$cat contact.json

email chinmayid2810@gmail.com

linkedin linkedin.com/in/dixitchinmayi

github github.com/Chinmayi2801

web chinmayi.lol

reach me at
chinmayi.lol

// yes, .lol is intentional.

delhi/ncr · data analyst

EMAIL email chinmayid2810@gmail.com LINKEDIN linkedin dixitchinmayi GITHUB github Chinmayi2801

the lab.

Everything I've built, shipped, or experimented with — pipelines, ML systems, LLM tooling, and whatever comes next. More of a field notebook than a portfolio.

03projects

02public repos

ongoingstatus

01
            Entity Resolution
            Analytics
          
Alumni Entity Resolution Tool
// 60K records. three CRMs. one clean output.
Reconciled 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase using field-level RapidFuzz token-set matching. Benchmarked all-MiniLM-L6-v2 and TF-IDF — both underperformed on exact-identity matching. Per-field confidence scores turned weeks of manual reconciliation into a structured, reviewable output. Built during internship at IIT Bombay's DRF.

          PythonRapidFuzzPandasRecordlinkageStreamlit
        
            github ↗
          
          shipped
        
02
            ML
            NLP
          
The Re-Search Engine
// rank papers. detect trends. skip the noise.
STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, a PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference. MSc thesis project.

          PythonLightGBMBERTopicRAGGroq APIStreamlit
        
            github ↗
          
          shipped
        
03
            LLM Tooling
            Systems
          
Reality Tracker
// giving LLMs memory they were never built to have.
Personal project tracking system built to fix a real LLM limitation: no memory between sessions. LLMs forget context, have no sense of time — deadlines slip undetected. MCP server with six read/write tools so Claude Desktop can query live project state, current time, and log history mid-conversation. Single SQLite backend shared between Streamlit app and MCP server.

          PythonSQLiteMCP ProtocolStreamlit
        
          // repo coming soon
          active

viewing as: Data Analytics

Chinmayi Dixit chinmayi.lol

Data Analytics AI · Machine Learning

chinmayid2810@gmail.com linkedin.com/in/dixitchinmayi github.com/Chinmayi2801 Delhi / NCR

Summary

Final-semester MSc Data Science student with a data analytics internship at IIT Bombay's Development & Relations Foundation (DRF), where I built a pipeline reconciling 60,000+ alumni records across three CRM systems. Comfortable working across Python, SQL, and Power BI — from messy raw data to dashboards senior stakeholders actually use. Background in Mathematics gives me a sharper read on statistical outputs than most analytics candidates at this level.

Final-semester MSc Data Science student with hands-on experience building NLP pipelines, information retrieval systems, and ML-scored ranking models. Built RAG-based research discovery tools using LightGBM, BERTopic, and Sentence Transformers; developed entity resolution systems benchmarking embedding approaches against fuzzy matching baselines. Background in Mathematics underpins a first-principles approach to model design and evaluation.

Experience

Data Management & Analytics Intern Feb 2026 – May 2026

IIT Bombay Development & Relations Foundation (DRF)

Cleaned and standardised 70,000+ alumni records across 700+ fields using Python (Pandas, NumPy), SQL, and Power Query — establishing a single source of truth for alumni engagement.
Built Power BI dashboards covering donation trends, donor segmentation, and campaign performance, used directly in senior stakeholder reporting.
Benchmarked all-MiniLM-L6-v2 sentence embeddings and TF-IDF against field-level fuzzy matching for alumni identity resolution — fuzzy outperformed semantic approaches on exact-identity matching tasks.
Developed an entity resolution pipeline unifying alumni identities across Salesforce, Raiser's Edge, and Almabase, producing Salesforce-ready output.
Integrated financial and biographical data from Hurun and Forbes into a prospect scoring dataset for the institute's HNI outreach programme.

Data Engineering Intern Mar 2025 – Apr 2025

National Centre of Nano Sciences and Nanotechnology, University of Mumbai

Built a MongoDB data pipeline to organise and serve the lab's resource and facility data, reducing manual lookup effort for internal teams.
Connected the pipeline to a web app that replaced the team's spreadsheet-based workflow entirely.

Projects

Alumni Entity Resolution Tool The Re-Search Engine github ↗ github ↗

Python · Pandas · RapidFuzz · Recordlinkage · Streamlit

Python · all-MiniLM-L6-v2 · BERTopic · LightGBM · Groq API · Streamlit

Reconciled 60,000+ alumni records across three CRM platforms via field-level RapidFuzz token-set matching. Benchmarked embedding and TF-IDF approaches; selected fuzzy matching for reliable exact-identity resolution. Returned 43,000+ matches with per-field confidence scores, collapsing weeks of manual reconciliation into structured, reviewable output.

STEM research discovery platform indexing arXiv and Semantic Scholar. RAG pipeline for paper summarisation, PAIS influence scoring model (LightGBM trained on h-index, citation count, venue score), and BERTopic for daily topic cluster detection. Open-source embeddings with Groq-backed LLM inference; results served via Streamlit with ranked paper recommendations.

The Re-Search Engine Alumni Entity Resolution Tool github ↗ github ↗

Python · LightGBM · BERTopic · Sentence Transformers · Groq API · Streamlit

Python · Pandas · RapidFuzz · Recordlinkage · Sentence Transformers · Streamlit

STEM research discovery platform with RAG-based summarisation, influence-scored ranking (LightGBM), and BERTopic topic trend detection across arXiv and Semantic Scholar data. Groq API for LLM inference; Streamlit frontend with daily cluster outputs and ranked recommendations.

Entity resolution system reconciling 60,000+ alumni records across Salesforce, Raiser's Edge, and Almabase. Benchmarked sentence embedding and TF-IDF against field-level fuzzy matching — semantic approaches underperformed; final system uses RapidFuzz with per-field confidence scoring.

Reality Tracker — LLM Project Tracking System

Python · SQLite · MCP Protocol · Streamlit

MCP server with six read/write tools giving Claude Desktop live access to project state, timestamps, and log history. Fixes the fundamental LLM context-loss problem between sessions. Single SQLite backend shared between Streamlit app and MCP server.

MCP (Model Context Protocol) server enabling persistent LLM context across sessions — addressing fundamental statelessness in LLM inference. Six read/write tool endpoints expose project state, temporal context, and log history to the model mid-conversation.

Skills

Languages

PythonSQL

Analytics & BI

Power BIPower QueryExcelMatplotlibSeaborn

Data Engineering

PandasNumPyETL PipelinesData CleaningMongoDBSQLiteAPI Integration

ML / Statistics

RegressionClassificationClusteringFuzzy MatchingRecord LinkageScikit-learn

Tools

StreamlitGitJupyterLinux

Learning

DSAComputer VisionLinux (deep)

Languages

PythonSQL

ML / AI

LightGBMScikit-learnNLPRAGBERTopicSentence TransformersFuzzy MatchingRecord LinkageTF-IDF

Modelling

RegressionClassificationClusteringEmbedding ModelsLLM Inference

Data & Engineering

PandasNumPyETL PipelinesMongoDBSQLiteAPI Integration

Tools

StreamlitGitJupyterLinux

Learning

Computer VisionDSALinux (deep)

Education

MSc Data Science

University of Mumbai

2024 – Expected 2026

BA Mathematics

Kalindi College, University of Delhi

2020 – 2023

Other

Led Android dev team and peer study sessions — GDSC Kalindi, University of Delhi

Wildlife & bird photographer — jungle safari field trips and Kalindi College photography circle

Personal site: chinmayi.lol

ChinmayiDixit.

What I workwith.

The personbehind the pipeline.

Where I'vedone the work.

Things builtand shipped.

reach me atchinmayi.lol

the lab.

Chinmayi
Dixit.

What I work
with.

The person
behind the pipeline.

Where I've
done the work.

Things built
and shipped.

reach me at
chinmayi.lol