The Four Services of Hoovik
Hoovik is not a monolith. It's four services wired together:
- React frontend (Vite)
- Node.js backend (Express + Socket.IO)
- Python transcript service (FastAPI)
- Python emotion service (FastAPI + Socket.IO)
The backend runs as multiple PM2 processes sharing state via Redis and MongoDB. The Socket.IO Redis Adapter enables cross-process event delivery.
Distributed Room State with Redis
Room state can't live in process memory when multiple Node.js instances handle requests. Instead, mutable meeting state is stored in Redis:
HSET meeting:participants:
HDEL meeting:participants:
Join order is stored separately for WebRTC role assignment. Room joins are serialized using a Redis-backed distributed lock to prevent race conditions:
await withRoomLock(meetingCode, async () => {
// join logic
});
The lock uses SET NX PX acquisition, token-based ownership, and Lua-script compare-and-delete release.
WebRTC Signaling and Perfect Negotiation
Peer connections are managed through React hooks implementing the perfect negotiation pattern. The frontend supports multi-party video, ICE restarts, screen sharing, and remote participant management.
Active speaker detection uses two paths:
- SSRC Path:
RTCRtpReceiver.getSynchronizationSources()for RTP audio levels - RMS Fallback: Web Audio API
AnalyserNodefor RMS energy calculations
The application selects dynamically based on browser support.
Real-Time Emotion Recognition
The emotion service runs Wav2Vec2 for audio, MediaPipe for video, and XGBoost ensemble models. The frontend sends emotion.frame and audio_chunk events directly to the service via dedicated Socket.IO connections.
Modality-aware processing: inference continues even if a participant disables camera or microphone. The service emits server.status and backpressure events so the frontend can adjust capture rates.
Emotion events collected during a meeting are stored locally and later submitted when generating an AI summary. The backend combines transcript-derived emotion with live captured emotion to highlight discrepancies between spoken content and observed emotions.
Transcription Pipeline
The transcript service (FastAPI) processes meeting recordings asynchronously:
- Audio upload → HTTP 202 Accepted
- FFmpeg conversion
- Whisper transcription
- Segment merging
- NLP emotion classification with DistilRoBERTa
- Callback to Node.js backend with retry logic
RAG Pipeline
After transcripts are stored, they are indexed for semantic retrieval. Chunking preserves speaker attribution and timestamps when speaker segments are available; otherwise, a sliding-window strategy is used.
Embeddings come from nomic-embed-text-v1.5, cached in Redis. Indexing runs asynchronously via BullMQ workers to avoid blocking API requests.
Retrieval combines MongoDB Vector Search with Maximum Marginal Relevance (MMR) for relevance and diversity. Retrieved context is passed to Groq-hosted LLMs for answer generation. Session history supports multi-turn conversations.
Access control follows the same authorization model as transcript access: transcript owner, approved transcript request, or legacy transcripts without ownership metadata.
Tradeoffs and Future Improvements
- Meeting cleanup jobs execute independently in each backend process.
- BullMQ workers run alongside the application server, not in dedicated worker processes.
- The transcript service lacks a centralized job queue.
- Safari media preview workarounds remain.
These decisions were acceptable at current scale but dedicated workers and queue-based processing are natural next steps.
Authentication
JWT access tokens with refresh token rotation. Login issues a short-lived JWT and an opaque refresh token stored only in an HttpOnly cookie. Refresh tokens are rotated on every refresh request to reduce replay risk.
Try It
Explore the interactive demo or browse the source code on GitHub.


