Hoovik: A Distributed WebRTC Platform with Real-Time Emotion

Hoovik: A Distributed WebRTC Platform with Real-Time Emotion AI

Anupam Kumar built Hoovik, a multi-party video meeting platform combining WebRTC, distributed Node.js with Redis, real-time emotion recognition via Wav2Vec2 and MediaPipe, and a RAG pipeline over transcripts. This article breaks down the architecture, tradeoffs, and code behind each service.

3 min readJun 6, 2026

Hoovik: A Distributed WebRTC Platform with Real-Time Emotion AI

The Four Services of Hoovik

Hoovik is not a monolith. It's four services wired together:

React frontend (Vite)
Node.js backend (Express + Socket.IO)
Python transcript service (FastAPI)
Python emotion service (FastAPI + Socket.IO)

The backend runs as multiple PM2 processes sharing state via Redis and MongoDB. The Socket.IO Redis Adapter enables cross-process event delivery.

Distributed Room State with Redis

Room state can't live in process memory when multiple Node.js instances handle requests. Instead, mutable meeting state is stored in Redis:

HSET meeting:participants:  
HDEL meeting:participants:

Join order is stored separately for WebRTC role assignment. Room joins are serialized using a Redis-backed distributed lock to prevent race conditions:

await withRoomLock(meetingCode, async () =&gt; {
  // join logic
});

The lock uses SET NX PX acquisition, token-based ownership, and Lua-script compare-and-delete release.

WebRTC Signaling and Perfect Negotiation

Peer connections are managed through React hooks implementing the perfect negotiation pattern. The frontend supports multi-party video, ICE restarts, screen sharing, and remote participant management.

Active speaker detection uses two paths:

SSRC Path: RTCRtpReceiver.getSynchronizationSources() for RTP audio levels
RMS Fallback: Web Audio API AnalyserNode for RMS energy calculations

The application selects dynamically based on browser support.

Real-Time Emotion Recognition

The emotion service runs Wav2Vec2 for audio, MediaPipe for video, and XGBoost ensemble models. The frontend sends emotion.frame and audio_chunk events directly to the service via dedicated Socket.IO connections.

Modality-aware processing: inference continues even if a participant disables camera or microphone. The service emits server.status and backpressure events so the frontend can adjust capture rates.

Emotion events collected during a meeting are stored locally and later submitted when generating an AI summary. The backend combines transcript-derived emotion with live captured emotion to highlight discrepancies between spoken content and observed emotions.

Transcription Pipeline

The transcript service (FastAPI) processes meeting recordings asynchronously:

Audio upload → HTTP 202 Accepted
FFmpeg conversion
Whisper transcription
Segment merging
NLP emotion classification with DistilRoBERTa
Callback to Node.js backend with retry logic

RAG Pipeline

After transcripts are stored, they are indexed for semantic retrieval. Chunking preserves speaker attribution and timestamps when speaker segments are available; otherwise, a sliding-window strategy is used.

Embeddings come from nomic-embed-text-v1.5, cached in Redis. Indexing runs asynchronously via BullMQ workers to avoid blocking API requests.

Retrieval combines MongoDB Vector Search with Maximum Marginal Relevance (MMR) for relevance and diversity. Retrieved context is passed to Groq-hosted LLMs for answer generation. Session history supports multi-turn conversations.

Access control follows the same authorization model as transcript access: transcript owner, approved transcript request, or legacy transcripts without ownership metadata.

Tradeoffs and Future Improvements

Meeting cleanup jobs execute independently in each backend process.
BullMQ workers run alongside the application server, not in dedicated worker processes.
The transcript service lacks a centralized job queue.
Safari media preview workarounds remain.

These decisions were acceptable at current scale but dedicated workers and queue-based processing are natural next steps.

Authentication

JWT access tokens with refresh token rotation. Login issues a short-lived JWT and an opaque refresh token stored only in an HttpOnly cookie. Refresh tokens are rotated on every refresh request to reduce replay risk.

Try It

Explore the interactive demo or browse the source code on GitHub.

Editor's Take

I've built a few WebRTC apps, and the distributed lock design for room joins is exactly the kind of detail that separates a hobby project from a real service. The emotion-aware summary feature is clever but I wonder about privacy implications—users might not expect their facial expressions to be analyzed and stored. Still, the modular service boundary design is worth studying. I'd love to see a follow-up on how they handle horizontal scaling under load.

— DevDigest Editorial

Key Takeaways

•Use Redis distributed locks with SET NX PX and Lua scripts to serialize room joins across multiple Node.js instances.
•Implement modality-aware ML inference to maintain emotion tracking when participants disable camera or microphone.
•Offload transcription and embedding generation to background workers (BullMQ) to keep API endpoints responsive.

Why It Matters

Hoovik demonstrates a production-grade architecture combining WebRTC, real-time ML, and RAG. If you're building a multi-service platform with video and AI, the distributed locking, modality-aware emotion inference, and async transcription pipeline are directly applicable patterns.

#nodejs#redis#WebRTC#RAG#real-time AI

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.