Building a Personal CRM: Parsing 20 Years of Chat Data with

1.2 Million Messages, One Life Timeline

A developer named Drobinin spent years tracking his life through chat logs. He exported archives from Telegram, VK, Instagram, and Facebook—covering 2008 to 2024. The result: a structured personal CRM built from 1.2 million messages, 52,000 unique lemmas, and 5,695 conversation-days tagged with directional sentiment.

The Data Pipeline

Exporting and Parsing

Each platform has quirks. Instagram double-encodes Cyrillic through latin-1. Telegram assigns different internal message IDs between exports. Facebook E2E encryption scatters messages across three folders. VK exports everything without asking. Instagram doesn't differentiate broadcasts from personal chats.

Drobinin parsed all exports into a uniform tab-separated format. The corpus includes DMs, story interactions, follower graphs, and reply/mention graphs. For Twitter, he used reply graphs to filter out support requests and conference coordination.

Noise Filtering

His longest thread—486,000+ messages with a partner across ten years—breaks down as:

58.7% substantive text
28.4% short fillers
9.1% media
2.4% links
1.5% emoji-only

Filtering short messages was tricky. A three-word minimum would miss "he died". A denylist of "hahaha" and "noice" failed across languages. The solution: sample from five offset positions, frequency-count short tokens, review the top 80 manually, and pair a denylist with a protected set for life events.

After cleaning, the novelty rate—share of words never used before in any chat—plateaued at 6% six years ago. Most vocabulary was locked in by his early 20s.

Entity Resolution: Which Sasha?

People use multiple platforms with different usernames. Alexander becomes Al, Alex, Xander, Sandy, Alec—or Sasha, which is gender-neutral in Slavic languages. Morphological analyzers handle case inflection but not slang. NER models need hand-labeled training sets.

Drobinin used LLMs for name resolution. A prompt reads a chunk of messages and produces a structured JSON manifest with daily note bullets, entity facts, and a list of ambiguities ("msg 833006: 'John' without surname"). A deterministic Python script injects the bullets with provenance markers linking back to source messages.

Classification with LLMs

Keyword matching on first-person verbs ("bought", "moved") produced false positives. "I moved" to mom is relocation; in a friends' chat, it's interior design; after a breakup, an emotional milestone. Fine-tuning a BERT classifier would yield ~70-80% F1—and at 1.2M messages, 1% false positives means 12,000 fake events.

Drobinin used LLMs (Opus, Qwen3-30B-A3B locally via MLX) for classification. He ran 200+ sessions, roughly 15-20 billion tokens. On Opus, that's ~$15k. On a local M5 Pro, 10-15 weeks of continuous inference. The false-positive rate was under 1% on chunks below 6,000 messages.

A closure gate catches orphan wikilinks and duplicate citations. Sampling 5-10 outputs per batch checks against source. The model's self-reported confidence is never trusted.

Directional Sentiment

Standard sentiment assigns one polarity per message. But close friendships are warm by default—the signal is departure from baseline. Drobinin used 18 tags with three directional prefixes: my emotional state, counterpart's, and mutual.

He initially let the LLM free-tag, getting 5,700+ unique values like "WWDC-binge-mode". He redid it with the 18-tag system. Result: 66% of conversation-days are M:warm. 12.9% of conversations each month are transactional—but in March it's 17%, thanks to UK tax-year-end.

What the Data Shows

Message volume drops don't always mean friendship decay. Average message length can increase as relationships mature. Vocabulary overlap—Jaccard similarity of top-100 words—dropped from 69.5% to 8.7% in some relationships, indicating drifting interests.

A friendship shifting from M:playful to M:transactional across 18 months is a drift that's hard to notice one conversation at a time.

Technical Takeaways

LLMs beat fine-tuned classifiers for noisy, multilingual, contextual classification—if you can afford the inference cost.
Provenance tracking is critical for rollback. Every bullet links back to source messages via SQLite.
Directional sentiment reveals relationship health where absolute sentiment fails.

The Code

# Pseudocode for LLM-based classification
chunk = load_messages(chat_id, start_msg, end_msg)
prompt = f&#34;&#34;&#34;
Analyze these messages. Return JSON with:
- daily_notes: list of {date, bullets, sentiment}
- events: list of {date, event_type, description}
- ambiguities: list of {msg_id, issue}
Messages: {chunk}
&#34;&#34;&#34;
response = llm.generate(prompt)
manifest = json.loads(response)
inject_bullets(manifest, provenance_store)

Why This Matters

Drobinin's approach shows how to turn noisy, multi-platform personal data into structured insights. For developers building personal analytics, CRM tools, or lifelogging apps, this is a blueprint: parse exports, filter noise, resolve entities with LLMs, and track provenance for debuggability.

Editor's Take

I've been meaning to do something similar with my own chat archives for years. The $15k Opus bill gave me pause—but the local MLX approach is promising. I might try a smaller model like Llama 3.1 8B first. The provenance tracking is a must: without it, you're flying blind when the LLM hallucinates. I think the biggest insight here is that directional sentiment beats absolute sentiment for relationship tracking. I'm going to steal that idea for my own project.

Building a Personal CRM: Parsing 20 Years of Chat Data with LLMs

1.2 Million Messages, One Life Timeline

The Data Pipeline

Exporting and Parsing

Noise Filtering

Entity Resolution: Which Sasha?

Classification with LLMs

Directional Sentiment

What the Data Shows

Technical Takeaways

The Code

Why This Matters

Editor's Take

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

100 Lines of Lisp: An AI Agent That Writes Its Own Tools

xAI Grok CLI 0.2.93 Uploads Secrets and Whole Repo Unredacted

OpenAI Launches ChatGPT Work Agent Powered by GPT-5.6

AI Agent Safety: Semantic Layer Beats Pattern Matching 12/12 on Rule Override Detection

Biff.graph: Query Your Clojure Codebase as a Unified Graph

UPI Payment Architecture: Inside the 2,272 Crore Transaction Pipeline