Building WebRTC Recording Infrastructure using GStreamer and Temporal

Vivek Chandela, Shantanu Sharma,Geetish Nayak11 Jun, 2025

Building WebRTC Recording Infrastructure using GStreamer and Temporal

In Part 1 of our blog, we detailed how we built an in-house streaming platform using GStreamer and Temporal to support multi-host live streams with low-latency RTMP/HLS delivery. In this second part, we’re extending our pipeline to include recording capabilities, supporting both composite and individual recordings of a livestream.

Overview

Our recording infrastructure focuses on WebRTC video and audio. The goal is to allow moderation teams to access clean, archived footage—composite or individual—for compliance, post-production, or analysis.

We output:

Video: 640x360 resolution @15 fps, H.264 codec @5 Mbps
Audio: AAC codec @44 kbps
Container: MP4 format

Recordings are chunked into 15-second MP4 snippets, available shortly after capture.

Why WebRTC Recording Is Hard

Daily.co put it well in this post. There are 3 main reasons:

Synchronizing the participants' video and audio streams and layout handling at runtime.
Testing livestreams is cumbersome due to the no. of edge cases
Choosing the point of view:
- client-side direct recording: Represent what some specific participant saw on their computer/device. This is brittle across environments.

server-rendered recording: capture the video call from a neutral viewpoint. This is what we used for composite and individual recording.

raw track recording: record all the different participants’ video streams individually into separate files. Then we could create the final video as a post-production effort. This is hard to sync and multiplex.

Recording Is Just Another Stream

Instead of forwarding encoded AV streams to an RTMP server, we send them to disk/cloud for later use. This is done by leveraging the tee element in GStreamer, which allows the stream to be duplicated. We introduced a mux stage using mp4mux or flvmux to synchronize video/audio streams. One path goes to rtmpsink (for streaming) and another to filesink (for storage).

Note: splitmuxsink is composed of (muxer = mp4mux, sink = filesink) by default. We can replace the mixer and sink with pre-built elements or create custom elements.

From Cloud Sink to Filesink: Why We Changed

Initially, we used awss3sink as the sink for splitmuxsink to write directly to cloud storage. However, we ran into issues:

File naming race conditions
Hard to cleanly signal 'file is complete'
Lack of atomic file visibility

We switched to a filesink that writes to disk, and then used a file watcher(which internally uses inotify for linux) as part of a Temporal job to:

Detect completed MP4s
Rename them with session metadata
Upload to the cloud atomically

Composite vs Individual Recording

Composite Recording

One unified canvas with all participants rendered.
Dimensions fixed at 360x640.
Represents the entire conversation from a neutral viewpoint.

Individual Recording

Each host is recorded separately.
Triggered dynamically based on camera on/off events.
Two variants: MP4 (audio+video), MP3 (audio-only).

Individual recording is just a special case of composite recording. Let’s say we have 4 hosts in a livestream → We’ll need 5 temporal workflows:

One for composite recording + RTMP streaming.
4 for individual recording – one per user. We disable RTMP streaming in these workflows.

Real-time Scoring and Moderation

In addition to cloud storage, we also publish 15-second snippet metadata to the Data Science team via Pub/Sub. Their moderation jobs fetch these snippets, perform analysis, and score the audio and video quality.

If any score falls below a quality threshold, the moderation system triggers a follow-up request to retrieve a longer segment (based on start and end time) from the cloud. These extended snippets are then merged and routed to a human reviewer through a moderation service API.

Final Thoughts

Across these two posts, we walked through how to build a complete, in-house livestreaming and recording solution using GStreamer and Temporal.

In Part 1, we focused on the real-time streaming infrastructure, which involves compositing multiple hosts, encoding with x264 and fdk-aac, and broadcasting via RTMP and HLS.

In Part 2, we extended this architecture with robust recording support, covering composite and individual recordings, file pipeline design, moderation scoring via Pub/Sub, and workflow orchestration with Temporal.

Together, these systems form a resilient and extensible foundation for both live video experiences and scalable post-stream workflows.

Building WebRTC Recording Infrastructure using GStreamer and Temporal

Overview

Why WebRTC Recording Is Hard

Recording Is Just Another Stream

From Cloud Sink to Filesink: Why We Changed

Composite vs Individual Recording

Composite Recording

Individual Recording

Other Suggested Blog

From Struggle To Success: Civil Engineer Finds Global Fame On ShareChat

Four interesting ShareChat features that content creators should try out

Are you in search of a job profile that fits your skill set perfectly?