Introduction
Many streaming providers are looking for ways to offer a more premium and high quality experience to their users. One often overlooked component in streaming quality is audio – and more specifically which audio bitrates, channel layouts, and even audio languages are available and how these options can be delivered to the viewers on a range of devices. While there many ways of improving the video streaming quality & experience such as Per-Title Encoding, Multi-Bitrate Video, High Dynamic Range (HDR), and high resolutions, there are also some some great ways of enhancing a user’s experience with premium hls audio. Some of the most important considerations for audio streaming are:
- Adaptive Streaming: serving multiple audio bitrates for various streaming conditions
- Reduced Bandwidth & Device Compatibility: multi-codec audio for better compression at reduced bitrates
- Improved User Experience: 5.1(or greater) surround sound or even lossless audio
- Accessibility and Localization: such as multi-language or descriptive audio
You can learn even more about how audio encoding affects the streaming experience in this blog.
In Bitmovin’s 2023-24 Video Developer Report, we saw that immersive audio ranked in the top 15 areas for innovation; while audio transcription was the #1 ranked use-case for AI and ML. Furthermore, though AAC remains the the most widely used audio codec – mostly due to it’s wide device support, we see that both Dolby Digital/+ and Dolby Atmos are the #2 and #3 ranked audio codecs that streaming companies are either currently supporting or planning on supporting in the near future.
With HLS and its multivariant approach, this is all possible; but understanding just how to construct and organize your HLS multivariant playlist can be tricky at first. In this tutorial we will take a look at some best practices in HLS for serving alternate audio renditions as well as an example at the end of this article showcasing how to simply do this using the Bitmovin Encoder.
Basic audio stream packaging
The most basic way to package audio for HLS is to mux the audio track with each video track. This works for very simple configurations where you are only dealing with outputting a single AAC Stereo audio track at a single given bitrate. While the benefit of this approach is simplicity, it has many limitations such as not being able to support multi-channel surround sound, advanced codecs, and multi-language support. Additionally demuxing audio and video comes with benefit of using other muxing containers like fragmented MP4 or CMAF which don’t require client-side transmuxing. Additionally, keeping audio and video muxed together comes with inefficient storage and delivery as each video variant will have the audio duplicated. Similarly, demuxed audio and video allows for the use MP4 and CMAF containers which are more performant for client devices since they won’t have to demux or transmux the segments real-time.
A multivariant playlist output for this would look something like:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-STREAM-INF:BANDWIDTH=4255267,AVERAGE-BANDWIDTH=4255267,CODECS="avc1.4d4032,mp4a.40.2",RESOLUTION=2560x1440
manifest_1.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=3062896,AVERAGE-BANDWIDTH=3062896,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1920x1080
manifest_2.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1591232,AVERAGE-BANDWIDTH=1591232,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1600x900
manifest_3.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=1365632,AVERAGE-BANDWIDTH=1365632,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720
manifest_4.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=862995,AVERAGE-BANDWIDTH=862995,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=960x540
manifest_5.m3u8
Audio/Video demuxing
A better approach is to demux the Audio and Video tracks – luckily HLS makes this simple by the use of HLS EXT-X-MEDIA
playlists which is the standard way of declaring alternate content renditions for audio, subtitle, closed-captions, or video(mostly used alternative viewing angles such as in live sports). With the use of EXT-X-MEDIA
to decouple audio from video, we can add in many great audio features such as supporting alternate/dubbed language tracks, surround sound tracks, multiple audio qualities, and multi-codec audio.
By supplying audio tracks with EXT-X-MEDIA
tags, we can explicitly add each audio track that we want to output as well as group them together – Then we can correlate each Video Variant(EXT-X-STREAM-INF
) to one of the grouped Audio Media Playlists.
Using the previous example of a single AAC Stereo Audio track, a demuxed audio/video output would look like:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC_Stereo",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=YES,URI="audio_aac.m3u8"
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.2",RESOLUTION=2560x1440,AUDIO="AAC_Stereo"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1920x1080,AUDIO="AAC_Stereo"
manifest_2.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1600x900,AUDIO="AAC_Stereo"
manifest_3.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,AUDIO="AAC_Stereo"
manifest_4.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=960x540,AUDIO="AAC_Stereo"
manifest_5.m3u8
Here, you can first see we declare a single Audio Media(EXT-X-MEDIA
) playlist for our audio track and give it a group-id attribute value of “AAC_Stereo
“. Then each Video Variant EXT-X-STREAM-INF
tag uses the “AUDIO
” attribute to associate its video track to the Audio Media group “AAC_Stereo
“.
Multiple audio bitrates
But now let’s imagine we want to better optimize our Adaptive Streaming to deliver our AAC Stereo audio in multiple bitrates such as a high(196kbps) and low(64kbps) so that the higher resolution Video Variants can take advantage of higher quality+bitrate audio given the increase in bandwidth when streaming those variants. We can accomplish this by encoding our audio with both a low and high bitrate outputs and group them separately – then decide which Video Variant gets which Audio bitrate/quality. – For example, our 720p or below variants get the lower quality audio by default, and our full HD or above variants get the higher quality audio by default. Just think of that as defaults though, because most modern Players that stream HLS, will allow for independently picking which audio quality to play based on Adaptive-Bitrate streaming conditions.
An example of utilizing a low and a high AAC Stereo tracks would look like:
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac-stereo-64",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=YES,URI="audio_aac_64k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac-stereo-196",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aac_196k.m3u8"
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.2",RESOLUTION=2560x1440,AUDIO="aac-stereo-196"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1920x1080,AUDIO="aac-stereo-196"
manifest_2.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4028,mp4a.40.2",RESOLUTION=1600x900,AUDIO="aac-stereo-196"
manifest_3.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=1280x720,AUDIO="aac-stereo-64"
manifest_4.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d401f,mp4a.40.2",RESOLUTION=960x540,AUDIO="aac-stereo-64"
manifest_5.m3u8
In this example, we now have two audio tracks, one for each bitrate, and therefore have two Audio Media (EXT-X-MEDIA
) playlists defined, each having unique GROUP-ID
attribute, but the same NAME
attribute. This is a good way declaring that the audio tracks are the same language, channel config, and codec, but at different qualities. Now, we can declare that each Video Variant(EXT-X-STREAM-INF
) that is 720p or less sets the AUDIO
group for that variant to the low bitrate Audio Track(GROUP-ID="aac-stereo-64"
) and those variants above 720p get the higher bitrate AUDIO
group(GROUP-ID="aac-stereo-196"
) by default (but again, most Players can manage the audio tracks independently for optimal adaptive streaming).
This is at least an improvement on the previous single-bitrate audio packaging – But still, there are plenty of enhancements we can make!
More efficient AAC
The previous examples are all relying on Low Complexity AAC(AAC-LC) because this basic audio codec is supported by every playback device. It is necessary to always have at least one AAC-LC track to be able support older devices. However, most devices these days can support more efficient versions of AAC such as High Efficiency AAC(AAC-HE) which comes in two main versions: v2 which is used for bitrates up to 48kbps and v1 which is used for bitrates up to 96kbps.
So let’s adapt our previous example to not rely on 2 (or more) different AAC-LC audio tracks, and instead output one AAC-HE v1, one AAC-HE v2, and one AAC-LC rendition. The tricky part here is that we will want to group each of the above into a different GROUP-ID
so that the Player client can decide which to use based on which codecs it supports – but we also will want each Video Variant to be able to use any of those audio tracks. To accomplish this, all we need to do is duplicate each Video Variant for each of the 3 unique Audio Media GROUP-IDs
.
A note on grouping audio renditions
The apple authoring spec recommends creating one audio group for each pair of codec and channel count.
We now have have 3 different versions of the AAC codec so we will have 3 different audio groups.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_lc-stereo-128k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=YES,URI="audio_aaclc_128k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_he1-stereo-64k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aache1_64k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_he2-stereo-32k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aache2_32k.m3u8"
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.2",RESOLUTION=2560x1440,AUDIO="aac_lc-stereo-128k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.5",RESOLUTION=2560x1440,AUDIO="aac_he1-stereo-64k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.29",RESOLUTION=2560x1440,AUDIO="aac_he2-stereo-32k"
manifest_1.m3u8
## Repeat above approach for each additional Video Variant
In this example, you can see that we replicated the 1440p variant 3 times – 1 for reach Audio Media GROUP-ID
which would then be repeated for each additional Video Variant. This will allow the client Player to decide for a given Video Variant, which audio track group to use based upon codec support and streaming conditions. Also take note how each Video Variant’s CODECS
attribute is updated to represent the necessary audio codec identifier.
Surround sound audio
Now, let’s say we also want to be able to support 5.1 surround sound for those clients which can benefit from it. For this we can decide on which surround sound codec we want to support. Let’s use Dolby Digital AC-3 for this example. Since we are now relying on a more advanced audio codec for optimal surround experience, it is also be important to consider devices that may have 5.1 or greater speaker setups, but that can NOT support Dolby Digital. For this we will also include a secondary 5.1 track using basic AAC-LC codec. Now, we will create 2 new Audio Media playlists with unique GROUP-ID
and NAME
attributes.
A note on downmixing from 5.1 audio sources
In this example, we will assume the source has a Dolby Digital surround audio track. From that single audio source, we will create create our AC-3 surround track, implicitly convert to our AAC surround track, and automatically downmix the source 5.1 to our various AAC 2.0 Stereo outputs using the Bitmovin Encoder which is shown in sample code at the bottom of this article. Alternatively you can do all sorts of mixing, channel-swapping, as well as work with distinct audio input files like separate files for each channel for example. You can learn more about that here.
Don’t forget about grouping audio renditions
As previously mentioned, the apple authoring spec recommends creating one audio group for each pair of codec and channel count.
We now have have 5 different unique combinations of codecs and channel counts so we will have 5 different audio groups.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_lc-stereo-128k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=YES,URI="audio_aac_128k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_he1-stereo-64k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aache1_64k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_he2-stereo-32k",LANGUAGE="en",NAME="English - Stereo",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aache2_32k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac_lc-5_1-320k",LANGUAGE="en",NAME="English - 5.1",AUTOSELECT=YES,DEFAULT=NO,URI="audio_aac_lc_5_1_320k.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="dolby",LANGUAGE="en",NAME="English - Dolby",CHANNELS="6",URI="audio_dolbydigital.m3u8"
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.2",RESOLUTION=2560x1440,AUDIO="aac_lc-stereo-128k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.5",RESOLUTION=2560x1440,AUDIO="aac_he1-stereo-64k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.29",RESOLUTION=2560x1440,AUDIO="aac_he2-stereo-32k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,mp4a.40.29",RESOLUTION=2560x1440,AUDIO="aac_lc-5_1-320k"
manifest_1.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4d4032,ac-3",RESOLUTION=2560x1440,AUDIO="dolby"
manifest_1.m3u8
## Repeat above approach for each additional Video Variant
Here you can see that now we have the 1440p variant replicated a total of 5 times, once for each Audio Media GROUP-ID
which allows the client Player to select the most appropriate audio and video track combination.
Again, note how each duplicated Video Variant has an updated CODECS
attribute to represent the appropriate audio codec associated to it. One major reason we duplicate each Video Variant for each Audio Media GROUP-ID
is that most devices cannot handle switching between audio codec’s during playback; so as Adaptive-Bitrate logic on the Player switches between different Video Variant’s it will pick the variant that has the same audio codec that it has been using. Additionally, in HLS, we cannot simply list the Video Variant once and add all of the various audio codecs to the CODECS
attribute. This is because per HLS, the client device MUST be able to support all of the CODECS
mentioned on a given Video Variant(EXT-X-STREAM-INF
) to avoid possible playback failures. So instead, we separate out the Video Variants per each codec + channel number set.
Multi-language audio
This is all great, but what if I want to support additional dubbed audio language tracks or even Descriptive Audio tracks? Luckily, that is rather simple to do. We can just create additional AudioMedia playlists for each language and utilize the existing GROUP-IDs
depending on which codecs and formats we want to support. We can use the existing GROUP-IDs
which are logically grouped by Codec and Channel pairing per the Apple authoring spec, then we can add our additional language tracks to those existing groups.
#EXTM3U
#EXT-X-INDEPENDENT-SEGMENTS
#EXT-X-VERSION:6
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-HE-V1-Stereo",NAME="English-Stereo",LANGUAGE="en",DEFAULT=NO,URI="audio_aache1_stereo.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-HE-V1-Stereo",NAME="Spanish-Stereo",LANGUAGE="es",DEFAULT=NO,URI="audio_aache1_stereo_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-HE-V2-Stereo",NAME="English-Stereo",LANGUAGE="en",DEFAULT=NO,URI="audio_aache2_stereo.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-HE-V2-Stereo",NAME="Spanish-Stereo",LANGUAGE="es",DEFAULT=NO,URI="audio_aache2_stereo_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-LC-5.1",NAME="English-5.1",LANGUAGE="en",DEFAULT=NO,URI="audio_aaclc-5_1.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-LC-5.1",NAME="Spanish-5.1",LANGUAGE="es",DEFAULT=NO,URI="audio_aaclc-5_1_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-LC-Stereo",NAME="English-Stereo",LANGUAGE="en",DEFAULT=NO,URI="audio_aaclc_stereo.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AAC-LC-Stereo",NAME="Spanish-Stereo",LANGUAGE="es",DEFAULT=NO,URI="audio_aaclc_stereo_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AC-3-5.1",NAME="English-Dolby",LANGUAGE="en",CHANNELS="6",DEFAULT=NO,URI="dolby-ac3-5_1.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="AC-3-5.1",NAME="Spanish-Dolby",LANGUAGE="es",CHANNELS="6",DEFAULT=NO,URI="dolby-ac3-5_1_es.m3u8"
#EXT-X-STREAM-INF:...,CODECS="avc1.4D401F,ac-3",RESOLUTION=1280x720,AUDIO="AC-3-5.1".0
video_720_3000000.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4D401F,mp4a.40.29",RESOLUTION=1280x720,AUDIO="AAC-HE-V2-Stereo".0
video_720_3000000.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4D401F,mp4a.40.2",RESOLUTION=1280x720,AUDIO="AAC-LC-Stereo".0
video_720_3000000.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4D401F,mp4a.40.2",RESOLUTION=1280x720,AUDIO="AAC-LC-5.1".0
video_720_3000000.m3u8
#EXT-X-STREAM-INF:...,CODECS="avc1.4D401F,mp4a.40.5",RESOLUTION=1280x720,AUDIO="AAC-HE-V1-Stereo".0
video_720_3000000.m3u8
How does this differ from DASH?
In DASH, demuxed Audio and Video tracks are grouped into separate AdaptationSets
for a given period. This means each given Video AdaptationSet
is not directly linked to one specific Audio track, but rather the client Player independently picks a Video Representation
from the Video AdaptationSet
and a Audio Representation
from the Audio AdaptationSet
. So with DASH, we don’t have to worry about re-stating Video tracks for each group of Audio tracks as they are managed independently of each other.
Additional notes
The video codecs you choose to support may also determine which audio codecs and container formats you use. For example if you encode video to VP9 you may want to consider using vorbis or opus audio codecs.
In this example, we used AC-3 for Dolby Digital 5.1, but you may consider using Enhanced AC-3 or more commonly referred to as E-AC-3 for additional channel support(such as 7.1 or more) or spatial audio support like Dolby Atmos. Other premium surround sound codec options are DTS:HD and DTS:X.
Premium HLS audio example with the Bitmovin Encoder & Manifest Generator
Below linked GitHub sample is a pseudo-code example using the Bitmovin Javascript/Typescript SDK that demonstrates outputting multi-bitrate, multi-codec, multi-channel, and multi-language audio tracks. This can greatly enhance user’s experience as it allows for streaming the best quality and most appropriate audio for each device’s codec support and speaker channel configuration.
With the Bitmovin Encoder, we can use one master (Dolby Digital surround in this example) audio file/stream for each language and easily downmix it to 2.0 stereo or implicitly convert it to AAC 5.1. Then, once we simply create each desired audio track, we will use the Bitmovin Manifest Generator to create our HLS multivariant playlists.