Skip to content

Implement audio video synchronization#4325

Merged
nanangizz merged 21 commits intomasterfrom
avsync
Mar 27, 2025
Merged

Implement audio video synchronization#4325
nanangizz merged 21 commits intomasterfrom
avsync

Conversation

@nanangizz
Copy link
Copy Markdown
Member

@nanangizz nanangizz commented Feb 26, 2025

How to use

For app using PJSUA-LIB

By default it is enabled for audio & video calls. It can be disabled per call basis by setting PJSUA_CALL_NO_MEDIA_SYNC in pjsua_call_setting.flag.

For app using PJMEDIA

  1. App creates AV sync using pjmedia_av_sync_create().
  2. App adds all media to be synchronized using pjmedia_av_sync_add_media(). If the media is a audio/video stream, use pjmedia_stream_common_set_avsync() instead.
  3. Each time a media receives an RTCP-SR, update AV sync usingpjmedia_av_sync_update_ref().
  4. Each time a media returns a frame to be rendered, e.g: via port.get_frame(), update AV sync using pjmedia_av_sync_update_pts(), the function may request delay adjustment: increase or decrease delay.
  5. Remove media from AV sync using pjmedia_av_sync_del_media().
  6. Destroy the AV sync using pjmedia_av_sync_destroy().

For app using PJMEDIA AVI File Player

By default it is enabled. It can be disabled by setting PJMEDIA_AVI_FILE_NO_SYNC flag in creating AVI file player.

Global macro settings

Configurable via config_site.h.

/**
 * Maximum tolerable presentation lag from the earliest to the latest media,
 * in milliseconds, in inter-media synchronization. When the delay is
 * higher than this setting, the media synchronizer will request the slower
 * media to speed up. And if after a number of speed up requests the delay
 * is still beyond this setting, the fastest media will be requested to
 * slow down.
 *
 * Default: 45 ms
 */
#ifndef PJMEDIA_AVSYNC_MAX_TOLERABLE_LAG_MSEC
#   define PJMEDIA_AVSYNC_MAX_TOLERABLE_LAG_MSEC    45
#endif


/**
  * Maximum number of speed up request to synchronize presentation time,
  * before a slow down request to the fastest media is issued.
  *
  * Default: 10
  */
#ifndef PJMEDIA_AVSYNC_MAX_SPEEDUP_REQ_CNT
#   define PJMEDIA_AVSYNC_MAX_SPEEDUP_REQ_CNT       10
#endif

Logic in pjmedia_av_sync_update_pts()

  • Calculate the absolute timestamp (or NTP timestamp) of the frame playback based on the reference NTP+RTP timestamp from RTCP-SR packet and the frame RTP timestamp from RTP packet. This timestamp is usually called presentation time (PTS).
  • If the PTS is the largest, store it as AV sync's maximum PTS. Otherwise, calculate the difference or the delay of this PTS to the maximum PTS, then request the media to speed up as much as the delay.
  • If after a specific number of speed up requests (i.e: configurable via PJMEDIA_AVSYNC_MAX_SPEEDUP_REQ_CNT) and the lag is still beyond a tolerable value (i.e: configurable via PJMEDIA_AVSYNC_MAX_TOLERABLE_LAG_MSEC), the function will issue slow down request to the fastest media (which set the AV sync's maximum PTS).
  • To prevent unoptimized delay applied to all media for the synchronization, a mechanism is implemented where slow-down requests are marked down and speed-up requests are marked up.

@sauwming
Copy link
Copy Markdown
Member

An early comment since it's still a draft.
The term 'av' seems to imply the sync will only apply to audio and video only, even though it looks like the description can cover general cases.
Especially with the upcoming text media, which may also need to be sync-ed (although the sync may not be available in the early version, but perhaps in the future).

Perhaps removing the av and call it pjmedia_sync should be sufficient?

@nanangizz
Copy link
Copy Markdown
Member Author

An early comment since it's still a draft.

Sure, actually early feedbacks is one of the goal of creating this draft.

The term 'av' seems to imply the sync will only apply to audio and video only, even though it looks like the description can cover general cases. Especially with the upcoming text media, which may also need to be sync-ed (although the sync may not be available in the early version, but perhaps in the future).

Perhaps removing the av and call it pjmedia_sync should be sufficient?

Yes, the sync module is generic for any media, especially delivered via RTP/RTCP and using a shared/synchronized NTP source.

IMO the 'av_sync' prefix will enhance intuitiveness, i.e: it sounds related to presentation time synchronization. While 'sync' prefix may be somewhat ambiguous, e.g: what to sync?. I recall how 'ssl_sock' term was chosen over tls_sock for catchy/intuitiveness reason, when it was being developed, TLS was already around replacing SSL (even PJSIP transport was already using 'TLS', see also #957). What do you think?

@sauwming
Copy link
Copy Markdown
Member

The term should be inter-media synchronization, but yes, it's not quite as catchy, so I suppose it's okay if the 'av' stays.

The jitter buffer is designed for audio, while in video it is just a normal/dummy buffer. Hence managing delay in audio stream is much easier than in video stream. As normally video presentation is later than audio, it should be fine to adjust the delay in audio stream only for the synchronization.
ms_diff = ntp_to_ms(&ntp_diff);

/* Smoothen and round down the delay */
ms_diff = ((ms_diff + 19 * media->smooth_diff) / 200) * 10;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an upper bound considered? Where does the magic numbers come from?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added the upper bound 60s in the last commit, actually not really sure if it is needed.

The magic numbers are weighting for the smoothing, i.e: last delay uses weight of 19, newly received delay uses weight of 1, then divide the sum by 20 for the new delay. Instead of by 20, it is by 200 then multiplied to 10 for rounding down to nearest 10.

Thanks for the feedback.

@nanangizz
Copy link
Copy Markdown
Member Author

Ready to review perhaps :)

Currently the delay adjustment is implemented in audio stream only, because:

  • video playback is kind of play as soon as possible (with built-in minimum delay for stability), the delay/burst/optimal-delay calculations in jitter buffer do not really apply (due to multiple RTP packets per video frame)
  • it is easier to manage delay in audio as the jitter buffer already have info about delay & optimal delay
  • should be sufficient for synchronization.

@nanangizz nanangizz marked this pull request as ready for review March 5, 2025 08:13
@andreas-wehrmann
Copy link
Copy Markdown
Contributor

Especially with the upcoming text media, which may also need to be sync-ed (although the sync may not be available in the early version, but perhaps in the future).

Sorry, quick question about that: Are there any specific plans to implement this (more specifically: RTT with T.140)?
I'm asking because I will need this later (probably this year) and was thinking about implementing it myself and maybe bring this upstream, unless there are any efforts underway already which I could support?

@sauwming
Copy link
Copy Markdown
Member

sauwming commented Mar 5, 2025

Yes, I'm currently implementing it. The basic functionality is already working (can send and receive text), but there's still a lot more to do, such as redundancy, docs, etc.

@andreas-wehrmann
Copy link
Copy Markdown
Contributor

Yes, I'm currently implementing it.

If this work available publicly? From a quick glance I couldn't find a proper branch.
I'd like to take a look at it if that's alright with you (even if it's in its very early stages).

@sauwming
Copy link
Copy Markdown
Member

sauwming commented Mar 5, 2025

Not yet public. Will create a branch early next week.

Copy link
Copy Markdown
Member

@sauwming sauwming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CR first stage, I haven't checked av_sync.c

sizeof(call->media_prov[0]) * call->med_prov_cnt);

/* Create synchronizer */
if ((maudcnt+mvidcnt) > 1 && !call->av_sync) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need pjsua config to enable/disable the sync, as well as to choose which media will be sync?

Copy link
Copy Markdown
Member Author

@nanangizz nanangizz Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, enable/disable should be needed. As this is a new feature, I guess we can postpone fine-tuning settings such as selecting which media to sync, allowing two or more synchronizers in a call, we can add the settings later based on user feedback?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, I'm not sure if text stream should be synchronized by default, so that's why I proposed the setting to let the user decide.

Having said that, I'm not sure whether we should also implement text sync, so I suppose enable/disable is sufficient for now, with doc noting that it currently only applies for audio and video.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, text sync may be useful for streaming a video with subtitle (may require something like text-player, similar to wav/video-player). However, for now I think having text sync is not so urgent.

Just to add a sample scenario of two or more synchronizers in a call, in a conference call, there can be two sets of audio+video stream (so totally there will be 4 streams), rather than synchronize all 4 streams using one synchronizer, perhaps it is better to employ a separate synchronizer for each set (due to different original sources, they may have different codec delay, network delay, NTP source, etc). So in the future, when designing fine tune settings per media, perhaps we need to also consider this scenario.

@sauwming
Copy link
Copy Markdown
Member

sauwming commented Mar 6, 2025

Currently the delay adjustment is implemented in audio stream only, because:

  • video playback is kind of play as soon as possible (with built-in minimum delay for stability), the delay/burst/optimal-delay calculations in jitter buffer do not really apply (due to multiple RTP packets per video frame)
  • it is easier to manage delay in audio as the jitter buffer already have info about delay & optimal delay
  • should be sufficient for synchronization.

Actually I was thinking the other way around. Because while it's true that video is initially played asap without buffering/prefetching, eventually it will be video that will be lagging due to its high bandwidth, and overall processing power required.
And when it happens (the video lags behind), then I imagine the mechanism would be to start dropping frames.

So perhaps it should be tested in such scenario, for example using HD video and/or minimal connection speed such as mobile data.

@nanangizz
Copy link
Copy Markdown
Member Author

Currently the delay adjustment is implemented in audio stream only, because:

  • video playback is kind of play as soon as possible (with built-in minimum delay for stability), the delay/burst/optimal-delay calculations in jitter buffer do not really apply (due to multiple RTP packets per video frame)
  • it is easier to manage delay in audio as the jitter buffer already have info about delay & optimal delay
  • should be sufficient for synchronization.

Actually I was thinking the other way around. Because while it's true that video is initially played asap without buffering/prefetching, eventually it will be video that will be lagging due to its high bandwidth, and overall processing power required. And when it happens (the video lags behind), then I imagine the mechanism would be to start dropping frames.

So perhaps it should be tested in such scenario, for example using HD video and/or minimal connection speed such as mobile data.

Actually the description above is a bit misleading, it was supposed to explain that video stream does not have delay management and implementing it is not so simple for now because the jitter buffer does not really help. It needs to be implemented in the video stream itself from scratch. In audio, the optimal delay calculation is already done by jbuf and increasing delay can be easily inserted into jbuf.

Yes, in real world the video tends to lag behind the audio (from codecs & delivery aspects). That's why it is sufficient to have delay adjustment implemented only in audio, as mostly we need to add delay to audio (from its optimal delay).

Btw, adding delay to video is perhaps not so complex or risky, will try to implement it.

@sauwming
Copy link
Copy Markdown
Member

I'm not sure whether the video "delay" adjustment is completed, but it seems that the current patch can only add/decrease delay to the video? I believe a more realistic usage would be to speed up the video, i.e. by dropping or skipping frames.

In other words, instead of adding delay to audio, which will increase the latency/lag of the entire av streams, there can be another option for the user, which is to speed up the video, so the streams still feel like real time.

Also move speed-up markup by 4/3 to the synchronizer (was in each stream)
@nanangizz
Copy link
Copy Markdown
Member Author

I'm not sure whether the video "delay" adjustment is completed, but it seems that the current patch can only add/decrease delay to the video? I believe a more realistic usage would be to speed up the video, i.e. by dropping or skipping frames.

In other words, instead of adding delay to audio, which will increase the latency/lag of the entire av streams, there can be another option for the user, which is to speed up the video, so the streams still feel like real time.

Previously I assume that video decoding is always done as fast as possible, no room for decreasing the delay. However, after implementing your idea here, the log showed some frames being skipped.

@nanangizz
Copy link
Copy Markdown
Member Author

Next, will try to integrate the AV sync to AVI player in the source side (for streaming & local playback). Currently the exising AVI player synchronization seems to be done in the rendering side (for local playback only?).

@sauwming
Copy link
Copy Markdown
Member

Right, currently sync is only done in aviplay.

@nanangizz nanangizz merged commit 078bfea into master Mar 27, 2025
42 checks passed
@nanangizz nanangizz deleted the avsync branch March 27, 2025 08:28
@nanangizz nanangizz requested a review from Copilot April 10, 2025 09:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 19 out of 24 changed files in this pull request and generated 1 comment.

Files not reviewed (5)
  • pjmedia/build/Makefile: Language not supported
  • pjmedia/build/pjmedia.vcproj: Language not supported
  • pjmedia/build/pjmedia.vcxproj: Language not supported
  • pjmedia/build/pjmedia.vcxproj.filters: Language not supported
  • pjsip-apps/src/swig/symbols.i: Language not supported

pjmedia_port_destroy(&fport[i]->base);
}

if (*p_streams && (*p_streams)->avsync) {
Copy link

Copilot AI Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cleanup block where pjmedia_av_sync_del_media is called, a NULL is passed as the synchronizer parameter. To ensure proper removal of media from the synchronizer, pass the actual AV sync instance (e.g., (*p_streams)->avsync) instead of NULL.

Copilot uses AI. Check for mistakes.
@nanangizz nanangizz added this to the release-2.16 milestone Apr 16, 2025
BarryYin pushed a commit to BarryYin/pjproject that referenced this pull request Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants