Media Servers

MPEG Audio Standards

It is quite difficult to gather a holistic view of MPEG audio standards. This article is intended to serve as a resource for high-level understanding of the most common MPEG audio standards.

MPEG is an abbreviation for Moving Pictures Experts Group, an ISO/IEC workgroup. Formally known as ISO/IEC JTC1/SC29/WG11, MPEG was established by the ISO/IEC standardization body in 1988 to develop generic standards for the coded representation of moving pictures, associated audio, and their combination. If you think that's a jargon mouthful, it gets better.

MPEG's full, official name sounds like a bureaucracy department in the former Soviet Union: International Organization for Standardization and the International Electrotechnical Commission Joint Technical Committee One Standards Committee 29 Working Group 11. Its website is https://jtc1info.org/. Personally, I'll continue calling it by its nickname: MPEG.

Standards by Committee

The MPEG committee describes itself as, "... the standards development environment where experts come together to develop worldwide Information and Communication Technology (ICT) standards for business and consumer applications."1 Sounds rather vague, eh?

There are multiple MPEG standards for audio and video. You may be familiar with the terms MPEG-2 and MPEG-4, which are very common in multimedia discussions. What you might not be aware of is the fact these standards are broken down into sections - called "parts" - and in some cases there are multiple parts for both video and audio. Point being, when referencing an MPEG standard it's best to also specify the exact Part of the standard. Does the author/speaker/etc. referring to the audio or video section of MPEPG-x? Which one?

Illustrating this point, here's a table of the MPEG-2 and MPEG-4 standards relevant to A/V (audio/visual) interests.

Amd = Amendment
BC = Backward Compatible; format is {version}-{part}
History of MPEG Audio Standards
Release Part Edition Codecs Max
Channels
Year BC? Standard
MPEG-21 -- 2nd -- -- 2019 -- ISO/IEC 21000:2019
MPEG-4 3 5th ALS, SLS 65,536 2019 M2-P7 ISO/IEC 14496-3:2019
MPEG-21 -- 2nd -- -- 2018 -- ISO/IEC 21000:2016/Amd 1:2018
MPEG-21 -- 2nd -- -- 2016 -- ISO/IEC 21000:2016
MPEG-4 3 4th SLS
(HD-AAC)
65,536 2015 M2-P7 ISO/IEC 14496-3:2009/Amd 5:2015
MPEG-4 3 4th ALS 65,536 2013 M2-P7 ISO/IEC 14496-3:2009/Amd 4:2013
MPEG-D
MPEG-4
3
3
1st
4th
eAAC++
(xAAC-HE)
65,536 2012 M4-P3 ISO/IEC 23003-3:2012/Amd 3:2016
IEC 14496-3:2009/Amd 3:2012
MPEG-4 3 4th AAC-ELDv2 48 + 16 2012 -- IEC 14496-3:2009/Amd 3:2012
MPEG-4 3 3rd AAC-ELD 48 + 16 2008 -- ISO/IEC 14496-3:2009/COR 7
MPEG-21 -- 1st -- -- 2008 -- ISO/IEC 21000:2005/Amd 1:2008
MPEG-2 7 4th AAC
(AAC-LC)
8 2007 -- ISO/IEC 13818-7:2006/AMD 1:2007
MPEG-4 3 3rd SLS
(HD-AAC)
65,536 2006 M2-P7 ISO/IEC 14496-3:2005/Amd 3:2006
MPEG-4 3 3rd AAC++
(HE-AAC v2)
48 + 16 2006 M2-P7 ISO/IEC 14496-3:2005/Amd 2:2006
MPEG-2 7 4th AAC
(AAC-LC)
8 2006 -- ISO/IEC 13818-7:2006
MPEG-21 -- 1st -- -- 2005 -- ISO/IEC 21000
MPEG-4 3 2nd ALS 65,536 2005 M2-P7 ISO/IEC 14496-3:2005
MPEG-2 7 3rd AAC-LC
(AAC)
8 2004 -- ISO/IEC 13818-7:2004
MPEG-4 3 2nd AAC+
(HE-AAC v1)
48 + 16 2003 M2-P7 ISO/IEC 14496-3:2001/Amd 1:2003
MPEG-4 3 2nd AAC-LD 48 + 16 2000 -- ISO/IEC 14496-3:1999/Amd 1:2000
MPEG-4 3 1st AAC
(AAC-LC)
48 + 16 1999 M2-P7 ISO/IEC 14496-3
MPEG-2 3 2nd ADIF (MP3) 8 1998 M1-P3 ISO/IEC 13818-3:1998
MPEG-2 7 1st AAC
(ADIF, ADTS)
8 1997 -- ISO/IEC 13818-7
MPEG-2 3 1st ADIF, LOAS
MP3, MP2, MP1
8 1995 M1-P3 ISO/IEC 13818-3
MPEG-1 3 Layer III MP3 2 1992 -- ISO/IEC 11172-3
MPEG-1 3 Layer II MP2 2 1992 -- ISO/IEC 11172-3
MPEG-1 3 Layer I PASC (MP1) 2 1992 -- ISO/IEC 11172-3

The current version of any given MPEG audio protocol is always the most recent iteration of its MPEG standard.

As you can see in the table above, the MPEG standards are not chronological relative to their common names (e.g. MPEG-2, MPEG-4). They don't align neatly in order. Though they do change over time, MPEG standards remain with their particular iteration. For example, the MPEG-1 standard from 1992 is still considered to be current, even though its use is obsolete. MPEG-2 superceded MPEG-1, yet MPEG-2 was not replaced by MPEG-4. These standards continue to cohabitate within the multimedia universe.

Audio Standards Evolution

The MPEG standards are connected. From their humble beginning in 1995 as MPEG-1 Part 3 Layer I to the lofty goals of MPEG-21 and MPEG-DASH, these standards are very important because they represent attempts to bring order to chaos in a world frequently dominated by proprietary multimedia codecs pushed by the manufacturers who produce, license, and profit from them. MPEG standards on-the-other-hand are "open," meaning they are designed to accommodate different types of content.

MP3 and AAC

Although MP3 remains a ubiquitous audio format and file standard in terms of usage, it is quite outdated. AAC was designed to replace MP3 and improves on it in every respect. AAC is also a modular audio format. It makes use of a number of different codecs, which are applied based on "profiles." The default AAC mode is the Low Complexity codec; also known as AAC-LC. Although not as old as MP3, the core AAC codec is also very old, though it is still relevant today. More recent AAC codecs and profiles (pre-defined groups of AAC codecs) are primarily oriented toward low-bitrate streaming applications where the target device is either processor-limited and/or bandwidth limited. There are also a few bucking that trend, eschewing bandwidth concerns in favor of higher fidelity, such as ALS, SLS, and HD-AAC. The latter provide the highest quality audio reproductions with the fewest audio artifacts.

* Total (all channels combined); Constant Bit Rate
MPEG Audio Format Comparison (Chronological Order)
File Ext Standard Max
Channels
Bitrates (Kbps)* Sampling Rates (kHz) Year
Low High Typical
.mp1 | .m1a MP1 (MPEG-1 Part 3, Layer I) 2 32 448 384 32 / 44.1 / 48 1992
.mp2 MP2 (MPEG-1 Part 3, Layer II) 2 32 384 192 32 / 44.1 / 48 1992
.mp3 MP3 (MPEG-1 Part 3, Layer III) 2 32 320 128 32 / 44.1 / 48 1992
.mp3 MP3 (MPEG-2 Part 3) 5.1 + 2 32 320 128 32 / 44.1 / 48 1995
.aac AAC (MPEG-2 Part 7) 5.1 + 2 8 576 192 32 / 44.1 / 48 1997
.aac AAC-LC (MPEG-4 Part 3) 48 8 1,024 192 8 - 96 1999
.aac | .m4a | .mp4
.3gp | .m4a | .m4p
AAC+ / HE-AAC v1 (MPEG-4) 48 + 16 6 96 24 16 - 96 1999
.vqf TwinVQ (AAC) [MPEG-4] 48 + 16 80 192 96 8 / 11.025 1999
.aac AAC-LD (MPEG-4) 48 + 16 32 32 64 22.05 - 48 2000
.aac | .m4a | .mp4
.3gp | .m4a | .m4p
AAC++ / e-AAC+
HE-AAC v2 (MPEG-4)
48 + 16 2 128 64 24 - 96 2004
.als ALS (MPEG-4) 65,536 8 6,144 128 8 - 192 2005
.sls SLS / HD-AAC (MPEG-4) 65,536 8 2,304 768 8 - 192 2006
.aac AAC-ELD 48 + 16 6 96 64 16 - 48 2008
.aac AAC-ELDv2 48 + 16 6 96 64 16 - 48 2012
.aac | .m4a | .mp4
.3gp | .m4a | .m4p
eAAC++ / xHE-AAC (MPEG-D) 65,536 12 512 48 22.05 / 32 / 44.1 2012

High vs. Low Bitrate MPEG Codecs

Modern MPEG audio formats should be grouped by low vs. high bitrates. This provides an idea of which AAC flavors are best suited for a particular application. For instance, home theater enthusiasts will typically want to playback content enocded with AAC-LC (AAC core), SLS, or possibly HD-AAC. Mobile device users will encounter a better experience with HE-AAC, ALS, AAC-LD, and their variants.

* Only if an SLS stream contains a "core" or lossy layer (if not, the AAC decoder cannot read SLS)
Decoder Compatability Matrix
Codec Format
AAC-LC AAC+ ALS SLS
AAC (AAC-LC) --- No No Maybe*
AAC+ (HE-AAC v1) No --- No No
ALS Yes ? --- Maybe*
SLS Maybe* No Maybe* ---
AAC+ (HE-AAC v2) No Yes ? ?
eAAC++ No Yes No No
HD-AAC Yes No No Yes

Low Density (bitrate) MPEG Audio Formats

  • ALS
  • MP2
  • MP3
  • SLS
  • AAC
    • AAC-ELD
    • AAC-ELDv2
    • AAC-LD
    • eAAC++ (xAAC-HE)
    • HE-AAC (v1)
    • HE-AACv2
    • TwinVQ

High Quality MPEG Audio Formats

  • MP1
  • SLS
  • AAC
    • AAC-LC (AAC base)
    • HD-AAC2

Caution is advised when contemplating the use of proprietary codecs - such as HD-AAC - due to the potential for future compatibility challenges. On the other hand, SLS is versatile enough to be useful as either a low or high bitrate codec. It may be utilized in either capacity.

AAC vs AC-3

Try not to get confused between these two similar sounding, but very different audio codecs. AC-3 is a competing standard developed by Dolby Laboratories. AAC and AC-3 are both transform coders, but AAC uses a filterbank with a finer frequency resolution that enables superior signal compression.

Launched in 1997 as the successor to MP3, Advanced Audio Coding (AAC) offers smaller files and faster encoding while maintaining as-good-as or better audio fidelity. Although widely supported, AAC remains relatively unknown to most consumers. Over time, the original AAC standard has evolved into a family of proprietary MPEG-2 and MPEG-4 lossy audio codecs. The earliest (MPEG-2) iteration can be thought of as AAC Version 1, as it is the original (main) AAC standard. The MPEG-4 reference is commonly also referred to as AAC, but in order to clarify the version, one may refer to it as AAC Version 2. The official names for the MPEG-4 main AAC version are AAC-LC (where "LC" means "Low Complexity") and/or simply AAC. Other forms of AAC include ALS (MPEG-4 Audio Lossless Coding), SLS (MPEG-4 Scalable Lossless Coding), and ADTS (Audio Data Transport Stream).

Today, the term AAC has several connotations, though this rarely matters to modern decoders. It is unfortunately often difficult to fully comprehend which version of AAC someone is talking about unless they are pointed with regards to nomenclature, such as AACv1, AACv2, etc. so the distinction is clear. For the most part, it doesn't matter as MPEG-4 based decoders should be able to decode the MPEG-2 AAC codecs as well, however that is not guaranteed.

AAC-LC is sometimes used to explicitly differentiate between AAC and AAC+ (another name for HE-AAC).

Making matters potentially more confusing at times, AAC-LC is also known as AAC version 2 and refers to the default codec of the revised AAC standard introduced in MPEG-4, while the original AAC specification may be referred to as AAC version 1, denoting the original AAC specification defined under MPEG-2.

AAC streams are often embedded inside audio and multimedia containers, though AAC is capable of acting as a file format (stand-alone audio stream). AAC is commonly found inside CAF, Matroska, and MPEG-4 containers such as M4A. AAC audio content can even be inside MP4, 3GP, or ADTS containers and bear the .aac file extension. Obviously, this is potentially confusing for playback/decoding applications, which may expect a .aac file to be a raw AAC stream and not wrapped in one of the aforementioned containers. Other combinations normally take on the file extension of the corresponding container.

AAC has developed over time into a set of standards, many of which are designed for streaming over low-bandwidth or unstable network connections. Supported codecs include AAC (AAC-LC) version 1 (MPEG-2 Part 7), AAC-LC version 2 (MPEG-4 Part 3), HE-AAC (also known as AAC Plus or AAC+), ALS, and HE-AAC, ALS, SLS, and/or the ADTS file format, or various external documents such as the MPEG-2 Part 7 [ISO/IEC 13818-7:2006 - Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC)], and/or MPEG-4 Part 3 for information on ALS and LOAS (Low Overhead Audio Stream) - file formats that supplanted ADIF and ADTS in MPEG-7.

Bits, Samples, and Bandwidth

Bitrate is a measurement of the amount of digital data transferred between two (2) points over a specified unit of time. Multimedia bitrates are normally measured in bits-per-second (bps).

Bit Depth

Sometimes referred to as resolution, bit depth is a term you will hear from time to time in audio discussions. This value doesn't truly matter unless the bitstream is uncompressed, raw audio such as Pulse Code Modulation (PCM) or Linear PCM (LPCM). Bit depth is irrelevant with lossy compression formats because the original (lossless) audio has been compressed before being packaged into the codec's delivery stream. On-the-other-hand, sampling rate and the bitrate still do matter. Maximum available throughput (bandwidth) for all channels is also obviously relevant for any application type.

Bit Depth only matters for uncompressed recording formats. It can be relevant to lossy formats, but normally that is not the case as lossy formats are usually also compressed. Most codecs where bit depth is a factor also use a Constant Bit Rate (CBR), but they don't have to.

The maximum bit depth for lossless encoding is 32-bits per sample. ALS and SLS are examples of lossless AAC codecs where bit depth is a significant metric. ALS support up to 32 bits of resolution per sample; SLS supports up to 24-bits.

Sampling Rate

Sampling Rate - more accurately referred to as Sampling Frequency - is the number of samples taken per second from a continuous audio signal. Each "sample" is recorded digitally, as a one (1) or zero (0). Sound wave recording is measured in Hertz, which is just another way of saying samples per second. One hertz = one sample per second. Thus, a codec's sampling rate is normally expresed as kilohertz (kHz) or thousands of samples per second.

So, why is this sampling rate thing so important for audio? In a nutshell, some rather complicated math boils down to a few key concepts:

  • Each "sample" is simply a 1 or 0, stored digitally
  • The range of human hearing is from approximately 20 Hz - 20,000 Hz (20 kHz)
  • Each sample records the state (on - 1 or off - 0) of a single audio hertz frequency during the one (1) second time duration of the sampling process
  • An important scientific principle called the Nyquist–Shannon sampling theorem (also known as the "Nyquist principle")

On the surfact this sounds very straighforward, eh? If any audio is detected at a given audible frequency (hertz) in a given time span (1 second), a one (1) or "on" bit is recorded. If not, a zero (0) or "off" bit is recorded. Well, it doesn't take a genius to realize two (2) things:

  1. You only need about 20,000 samples to cover every frequency in the human hearing range
  2. Since music almost always has more than 1 beat (signal) per second, you're going to lose a lot of signals if you only capture one per second

Obviously, this seems like a significant problem. Now, let's add more fuel to the fire. Recall the aforementioned Nyquist–Shannon theorem. This states that in order to capture the perfect reconstuction of an audio signal, the sampling frequency needs to be greater than double the maximum frequency of the signal being sampled.3 So, here we have a problem where it appears one will need about 40,000 samples per second simply to obtain a sampling within one second of any given audio. Thinking about how music tends to sound, that's not going to work. There's no way you can only sample music every second and capture all its nuances, and of course if you've ever listened to a digital music or audio recording, you know there is much more there than sounds that last a full second or are silent, over and over. So, how are more signals recorded per second?

The answer has to do with the fact that although the range of human hearing is theoretically about 20,000 hertz, but most people cannot hear audio across the entire range. Furthermore, most recorded audio only covers a small portion of this large frequency band. For instance, human speech often falls within a range of 85 - 255 Hz for most people, though some languages are much higher pitched and when you factor in shouting and singing, the 2 kHz - 4 kHz range comes into play.4,5 Either way, the range for any given language is rather small, resulting in a range that is considerably less than the full spectrum. Music tends to predominantly cover a range of about 60 Hz to 4 kHz, with some jumps outside of that range and also some contextual audio information presented outside that range.

However, that's only half the story. The majority of time, there is a much smaller range one should focus on for any given recording. If you think about it, a full range recording would effectively waste a lot of data on zeroes because nothing is normally being recorded outside certain bounds, depending on what is being recorded. Wouldn't it be great if there was a way to do this? Well, of course there is! Herein lies the magic of audio codecs and their sampling rates. By limiting the scope of the frequency range being sampled, suddenly there are a lot more available samples per time period (second) with which to take a reading. For example, with a sampling rate of 96,000 times per second (hertz), and sampling human speech, you should be able to glean ~564 samples per second per frequency. Now, we have gone from a very unrealistic and minimalistic one (1) sample per second per frequency to 564/second. Obviously, with a broader range of frequencies the sampling number becomes much smaller. However, even then there are mathethematical tricks the codec can play on the sampling methodology, and therein is much of the "magic" per se of modern audio codecs. This is also why audio compression becomes important, because it means a lossy codec is able to work even more magic by fudging the data collection a bit and skipping or ignoring certain data points in order to either collect more data and/or use less space to store it.

This is how and why audio codecs tend to focus on application-specific functions. Need a speech-only or primarily speech codec? Then you can use one with a smaller sampling rate, because the range you need to capture is very small. Want to record a variety of sounds and effects in an action movie? Then you will need a broader range of recorded frequencies. That means some combination of a higher sampling rate, and/or some other fancy codec work, such as dividing the sounds into different channels. See where this is going?

Multi-Channel and Signal Multiplexing

This is where one begins to enter the deep end of audio technology and especially codecs and encoding algorithms. Channels can be used to segment the capture of audio signals, because sampling rate is measured per channel. Thus, a codec supporting more channels will allow a sound engineer to multiplex a very active and broad signal into many channels, which is very helpful if your codec only has a very low sampling rate for example. Of course, this also depends on the ability of the decoder on the other end to handle a large number of simultaneous channels. Audio objects are also useful for this, and both can be used in conjunction with frequency ranges to include directional sound and other goodies. All of these things are well beyond the scope of this article. The intent here is simply to scratch the surface and drive home the point that a higher sampling frequency means greater audio fidelity and/or audible range of the given recording. Or you could just say more = better. The Achilles' Heel however, is higher sampling rate = more samples = more data to be transferred. Again, why we have multiple codecs (even for AAC) with different goals. One size does not fit all.

Bitrate

Bitrates measure the amount of data output by a decoder, per channel.

Why is bitrate important? Also written as "bit rate" and "bit-rate," the bitrate of all channels combined is limited by a codec's bandwidth. When the bandwidth limit is reached, data flow will be cut off. Therefore, it is important to pay attention to how bitrate, bit depth (if applicable), and sample rate work together to saturate whatever bandwidth is available. Most codecs allow a high enough sample rate and bitrate that when maximized, it's not physically possible to use those high settings on every channel. Why do codec authors do this? Because there are circumstances where a sound engineer might want to take advantage of higher encoding bitrates for a particular reason. The most common example is when encoding stereo sound at very high bitrates. There is normally sufficient bandwith in modern audio codes to max out two (2) channels of audio, but if the engineer needs to encode six (6) or eight (8) channels for example, they will likely have to use a much lower bitrate per channel. Otherwise, they will very quickly saturate the bandwidth, and whatever information didn't get decoded before hitting the limit will simply disappear.

The majority of AAC codecs are lossy and use a Variable Bit Rate (VBR). This means there is no specific standard or required bitrate, or range of bitrates. Instead, it varies constantly even within a single file, and will depend on the selected sampling rate.

{# of channels} x {bitrate per channel} = bandwidth required

Bitrate of What? Lack of Clear Scope is Commonplace

Unfortunately, it's quite common to find references to audio bitrates without a clear explanation of whether the discussion is about a single channel, multiple channels, or the underlying codec. Yet this important detail makes a huge difference in understanding the context. Adding insult to injury, you can't simply add up the bitrate for all channels to arrive at a total bandwidth for an audio codec or a given stream. Why not? Because channel bitrates are not additive. For starters, lossy codecs in particular don't work that way. You could have a format with 64 kbps mono and 96 kbps stereo signals. If you only knew a single channel had a 64 kbps bitrate, you would think a stereo stream using the same codec would be 64x2 or 128 kbps, but that is not a given and you'd likely be wrong.

Overhead

All multimedia codecs have overhead. This is the portion of bandwidth (and by inference, bitrate) occupied by administrative information integral to the operation of the codec to begin with. Examples include headers and frame metadata. However, when multiple channels are multiplexed into a single stream, this overhead isn't duplicated. There becomes an efficiency of scale to some extent as more channels are added to the same stream. This is one reason why some codecs have a very high range of independent channels within a single stream. At some point, the addition of each new channel takes up less administrative space and requires less additional processing to multiplex it than if fewer channels were transmitted. The "savings" per se vary depending on the codec and application. Other efficiencies are also possible - and relatively common in modern audio codecs - such as shared frequency overlaps, compression, and limiting "dead space" occupied by large portions of unused frequency ranges. On the other hand, just because a codec can support 65,536 channels for example does not mean it's a practical feat. It is exceptionally rare to find any use for more than about 16 channels in a single stream, under any circumstance.

Bandwidth

Audio Bandwidth is frequently misunderstood by end users. Bandwidth is the maximum amount of data which can be transferred from one point to another in a given period of time. Bandwidth is impacted by:

  • Delivery medium of the audio stream
  • Speed of decoding platform (e.g. processor)
  • Constraints of the codec
  • Number of channels transmitted
  • Frequency range of each channel
  • Lossy versus lossless compression, if any

Looking at the bandwidth problem from a top-down perspective, the most important limiting factor is the physical transport layer. How will audio data be transmitted from the source to the playback device? Common examples include:

  • HDMI cables (e.g. home theater)
  • Network cables
  • WiFi
  • Cellular (2G/3G/4G/5G)
  • RCA cables
* Theoretical limit at maximum 96 kbps sampling rate
MP3 and AAC Bandwidth
Format Standard Part Edition Max
Throughput
MP1 MPEG-1 Layer I -- 1,500 kbps
MP2 MPEG-1 Layer II -- 1,066 kbps
MP3 MPEG-1 Layer III -- 1,002 kbps
MP3 MPEG-2 3 5,102 kbps
AAC MPEG-2 7 5,102 kbps
AAC MPEG-4 3 ~256,000 kbps*

Every codec has a maximum bandwidth capability. A codec's bandwidth is how much data it is capable of processing per second. What happens when you exceed the maximum bandwidth of a codec? The remaining information is lost. Poof.

A simple way to visualize this is by examining what would happen if you attempted to encode a multi-channel data signal at the maximum possible bitrate the codec supports. This would likely work fine for two (2) channels. Perhaps four (4) or five (5). However, at some point what happens is the large amount of data generated by each channel input hits the overall limitation of the codec; it's bandwidth. This figure represents the maximum amount of data the codec is capable of processing within a specified amount of time. Normally, one second. Imagine expecting six (6) channels of audio playback for example, but only hearing four (4) of them because audio from the fifth and sixth channels never gets decoded and transmitted to the playback device. This is what happens if the total bandwidth you attempt to utilize exceeds the bandwidth limitation of the codec. The formula to calculate the required bandwidth is:

{# of channels} x {bitrate per channel} = bandwidth required

If the bandwidth required exceeds the bandwidth available, audio data will be lost.

Audio Codecs, Containers, Profiles, and Formats

AAC is a very broad and deep topic. A brief modicum of information on AAC formats was mentioned above simply to provide a level of context and highlight the fact AAC is a multi-faceted hydra. Continuing the discussion at a high level, this section provides a cursory explanation of codecs and container formats.

A broader discussion of audio codecs and containers may be found in the related article How Audio Files Work: Codecs and Containers.

What is a Codec?

Codecs are used to encode and decode multimedia, audio, or video streams. Codecs aren't a physical device. They are standards and methodologies. As long as an encoder or decoder produces the same input (encoder) and output (decoder) as prescribed by the format standard, they are acceptable. Except for open-source versions, codecs must be licensed from their author. They are intellectual property. Likewise, a device that uses an encoder or decoder is normally proprietary (some open-source versions exist, especially of decoders). When you buy an audio/visual (A/V) receiver for example that claims it supports various surround sound formats (Dolby Digital for example), this is because the manufacturer of that A/V receiver has licensed respective codecs, AND they have built the corresponding decoders into the equipment you've purchased. In order for the device manufacturer to be able to do this, they must pay a license fee to the codec owner (Dolby Laboratories, in the example of Dolby Digital).

Audio Containers

AAC files may be stored in different types of containers, depending on whether they are the MPEG-2 or MPEG-4 variety. These may be audio-only or audio/visual container types. The question is: Does it matter? And the answer is: Sometimes, yes.

First there is the question of which MPEG standard was used to encode the AAC file. If it was MPEG-2, then the AAC audio will be encoded using either Audio Data Interchange Format (ADIF) or Audio Data Transport Stream (ADTS).

The history of ADIF and ADTS is discussed briefly in MPEG-2 Part 7.

ADIF

What is ADIF? Audio Data Interchange File (ADIF) is an audio bitstream format consisting of headers follwed by raw audio data. Part of MPEG-2 Part 3 standard, ADIF is a primitive file structure. The length is not known until the next header or end of file is reached.

ADIF containers may encapsulate audio data encoded only per the old AAC base standard (MPEG-2 Part 3 as amended or original MPEG-2 Part 7).

ADTS

What is ADTS? Audio Data Transport Stream (ADTS) is an audio bitstream format consisting of a sequence of frames with headers similar to MPEG-1 audio frame headers. The encoded audio data of each frame is always contained between two sync words. This allows the number of bits in a frame to be variable. Defined by the MPEG-2 Part 7 standard, ADTS was the first MPEG file structure developed for streaming audio.

LATM and LOAS

The MPEG-4 version of AAC introduced a far greater ability to multiplex audio channels into an AAC stream. With this change in function, a corresponding change in file architecture was required. MPEG-4 contains its own file and streaming structures, which apply to audio only or a combination of audio and video content. It also introduced two (2) new transport standards for MPEG-4 audio only: Low overhead Audio Transport Multiplex (LATM) and Low Overhead Audio Stream (LOAS). The latter was based on the former. Both are defined in the MPEG-4 Part 3 (audio) standard.

LATM and LOAS are single stream transport formats. This means that unlike the native MPEG-4 transport format, LATM and LOAS cannot matrix 5.1 + 2. This means they will contain a multi-channel encoded stream OR a stereo stream, but not both. Furthermore, while they are capable of holding almost any variant of ACC (such as TwinVQ, ALS, or any other AAC codec that uses SBR or PS extensions), LATM and LOAS are not compatible with AAC-LC.

3GP/3GPP

3GP is another audio container format worth mentioning. Targetted solely at mobile and low-power devices, 3GP (also known as 3GPP) is explained in How Audio Files Work: Codecs and Containers.

Audio Profiles

AAC takes a modular approach to encoding. So-called codec "profiles" are defined based on complexity of the bitstream to be encoded, desired performance, and acceptance criteria of the output. The most commonly implemented profiles are the AAC Main Audio Profile (the first MPEG-4 Part 3 AAC profile created in 2003), the AAC Profile (what is typically referred to in modern day references to "AAC"; defined by MPEG-4 Part 3 2nd edition and uses AAC-LC only, HE-AACv1, and HE-AACv2.

Audio Formats

AAC's audio formats are the workhorses of the standard. Profiles apply these formats to encode and decode AAC streams. This article's discussion pertaining to AAC's capabilities in terms of numbers of channels and its encoding capabilities stems primarily from these formats. The most prevalent are described briefly below.

ALS

ALS (Audio Lossless Coding) is backwards-compatible format standard with both MPEG-4 and MPEG-2 variants of AAC.6

SLS (HD-AAC)

Scalable Lossless Coding (SLS) - also known as HD-AAC - is a proprietary codec capable of running in either lossless or lossy modes. In lossless mode, it can handle up to a 24-bit depth at 192 kHz sampling rate. HD-AAC is a derivative of AAC-LC and SLS. The SLS file format standard is unique. It can be lossless only (not backwards compatible with ALS and AAC) or contain both a lossy layer and a lossless correction stream (backwards compatible with ALS and AAC). SLS is backward compatible with MPEG AAC-compliant bitstreams.

High Definition Advanced Audio Coding (HD-AAC) is a proprietary form of Scalable to Lossless (SLS) developed by the Fraunhofer Institute and released in 2006. HD-AAC enhances the ISO/IEC 14496-3:2005/Amd 3:2006 SLS standard with higher bitrates and improved error detection algorithms.2

The MPEG Standards

This section provides brief summaries of each MPEG standard related to audio.

MPEG-1

Current Standard : ISO/IEC 11172-3
Released: 1992
MIME Type: mpeg
File Extensions:.mp1 | .mp2 | .mp3

MPEG-1 was the first audio/visual format released by MPEG. Its audio form contains three (3) sections called Layers (I/II/II), and began what has become a tradition of Part 3 of MPEG multimedia standards containing the audio portion of those standards.

The very first MP3 file standard orginated with MPEG-1, as MPEG-1 Part 3 Layer III. It was 2-channel stereo-only with an average 32 kbps bitrate and offered sampling frequencies of 32, 44.1, or 48 kHz. It's maximum bitrate is 128 kbps (both channels). Even though the standard's theoretical bandwidth is 1.5 Mbps, this includes allowing room for a video stream. MPEG-1 Part 3 also introduced Layer I (.mp1) and Layer II (.mp2), although these formats were never widely used and are virtually unknown today.

MPEG-2: The Gold Standard

Current Standard : ISO/IEC 13818-3:1998
: ISO/IEC 13818-7:2006/AMD 1:2007
Released: 1995
MIME Type: mpeg
File Extensions:.mp3

The Godfather of multimedia codecs. MPEG-1 gave birth to the second generation of MPEG audio codecs, and what has become the most famous audio file type on the Internet: MP3. It was MPEG-2 that cemented MP3 as the de-facto standard. Beyond audio, MPEG-2 quickly became the most prolific multimedia file type for many years, and this popularity brought the stand-alone MP3 standard along for the ride. MPEG-2 audio and video codecs are used to encode DVDs and were one of the first digital broadcast multimedia standards. Even today, MPEG-2 is still one of the most popular formats for the generic coding of moving pictures and associated audio.

MPEG documents group their specifications into independent sub-sections called "Parts." Each part of a format standard is an exclusive set of rules. MPEG-2 has two (2) audio parts, resulting in two (2) distinctly different audio format criteria.

Part 3: MP3 Multi-channel

MPEG-2 Part 3 is also referred to as MPEG-2 BC, where "BC" means "Backward Compatible." Part 3 contains Audio Layers that mimic the MP formats introduced by MPEG-1 Part 3 (MP1/MP2/MP3).

MPEG-2 Part 3 improved on the original MP3 format (MPEG-1 Part 3 Layer III) by adding Variable Bit Rate (VBR) capability and lifting the bitrate ceiling to a total of 320 Kbps (from MPEG-1's 128 kbps), though it did lower the sampling frequency range to 16, 22.05, and 24 kHz. And Part 3 marked another watershed for the MP3 format by incorporating support for 5.1 surround sound channels (the primary reason for the lift of its maximum audio bitrate to 320 kbps total).

An MPEG-2 Part 3 stereo MP3 file can be encoded at 320 kbps, but doing so is pointless as there is no appreciable difference in audio quality.

The "MP" in MPx format abbreviations refers to "Music Player."

Why did MPEG-2 Audio (Part 3) continue the MP3 (and MP1, MP2) standards while its companion Part 7 created an entirely new audio format? The primary purpose of Part 3 was to maintain backward compatibility with MPEG-1 audio Layers I/II/III, more commonly known by their standard file extensions: MP1, MP2, and the venerable MP3. MPEG-2 Part 3 enhanced MPEG-1's audio layers by allowing more than two channels (up to 5.1 multi-channel) for the Layer III format (MP3).

Part 7: AAC

The next chapter in MPEG audio evolution, MPEG-2 Part 7 introduced the Advanced Audio Coding format (AAC).

MPEG-2 Part 7 and AAC differs substantially from Part 3. Also referred to as MPEG-2 NBC (where "NBC" means "Not Backward Compatible"), MPEG-2 Part 7 allows a 5.1 signal and a 2-channel downmix to be multiplexed together into a single stream. This simplifies delivery, at the expense of incompatability with older decoders (which cannot split the muxxed stream). This dual-stream capability was at the heart of the earliest AAC implementation. The ability greatly simplified audio content distribution, by allowing a single stream to deliver multiple content variations.

AAC remains a very popular audio format today. The format offers superior compression and audio quality performance compared to MPEG-2 Part 3 (MP3), which it was purposefully designed to replace. To date, MP3 continues to remain the most common audio file format online in spite of the presence of superior formats such as AAC.

AAC is NOT backward compatible with MP3/MP2/MP1

Capable of transmitting up to 48 simultaneous channels of full-range audio and 16 channels of Low Frequency Effects (LFE), AAC streams are broken down into groups of 16-channels each: 16 mono (full range) channels; 16 LFE channels; and 16 commentary channels (dialogue). The LFE and so-called "commentary" channels are frequency range restricted. LFE channels top out around 200 Hz and the commentary channels are limited to the range of human speech. Commentary channels are typically used to provide multi-lingual versions of a spoken soundtrack, such as a narrator or spoken language dubbed over the soundtrack by a translator. It's important to note that just because there are 48 channels available doesn't mean you'll ever see that many used in an MPEG-2 stream (you won't). Not only would that be unnecessary, but when a physical media type is involved, there is a technical limit to how much data can be crammed onto it (e.g. DVD or Blu-ray). Even when that is not the case, there are practical limitations. How many language versions of a soundtrack do you need???

MPEG-2 Part 7 encoded AAC streams streams can be read by most MPEG-4 decoders, but their support of ADIF and ADTS profiles is optional.

AAC files encoded under MPEG-2 Part 7 are container types and use one of two (2) architecture types: Audio Data Interchange Format (ADIF) or Audio Data Transport Stream (ADTS). What's the difference? While, both are frame-based data architectures, ADIF is used when the file is a coded by an MPEG-2 encoder as AAC and the file only contains raw AAC audio data. If the file is encoded as a transport layer with the AAC audio content embedded inside, it must be created using the ADTS profile, as the latter allows self-synchronization of frames inside the container. MPEG-2 AAC (multichannel) uses ADIF.

MPEG-3

What happened to MPEG-3? Why do the MPEG standards jump from MPEG-2 to MPEG-4? The answer is MPEG-3 is a 'non-standard.' It was a standard that began in parallel to MPEG-2, but at some point no longer made sense on its own. Hence, it was rolled into MPEG-2. There never was an explicit audio format under MPEG-3. There are plenty of other numbers skipped over as well.7

MPEG-4: Modern AAC and Multimedia

Current Standard : ISO/IEC 14496-3:2019
Released: 1999
MIME Type: aac
File Extensions: .aac | .m4a | .mp4
: .3gp | .m4a | .m4p
: .vqf | .als | .sls

MPEG-4 audio is a significant upgrade and departure from MPEG-2, and presents a myriad of new AAC codecs, called "profiles."

MPEG-4 is the first MPEG multimedia format designed specifically for streaming. It allows MPEG-4 decoders to support (but does not require) MPEG-1 and MPEG-2 multimedia backward compatibility. Likewise, it can (but is not required to) support Audio Data Interchange Format (ADIF), which is used by MPEG-1 and MPEG-2.

MPEG-4 is platform agnostic. Although not as sophisticated in this regard as MPEG-21, it does include specialized codecs for applications such as speech, natural audio, multi-channel vs stereo, and high quality recordings. These variances are called "profiles," and are distinguished by their accompanying codec frequency range, bitrate, and compression level.

The most common MPEG-4 codec or profile is AAC LC. The "LC" means, "Low Complexity." AAC LC - also known as AAC Version 1 (v1) - is an extension of MPEG-2's AAC codec. It remains the most common audio codec used in MPEG-4 recordings. When you read or hear of "AAC" audio in the context of MPEG-4, it is nearly always AAC LC, and this should be presumed unless proven otherwise. AAC LC (AAC v1) has the highest audio fidelity and bitrate of the AAC varieties. If you're watching a DVD or Blu-ray encoded with AAC audio, it's this version.

MPEG-4 introduced another variant of AAC called HE AAC, where the "HE" portion means, "High Efficiency." AAC HE - also known as AAC Plus (or AAC+) and is sometimes confusingly (and erroneously) referred to as "AAC v2" - is AAC + SBR (Spectral Band Replication). AAC+ provides a roughly 30% increase in data compression over AAC LC).

Containerization

This information doesn't matter to 99.9% of people and should be irrelevant to nearly any end user. It really doesn't matter unless you are encoding, but in the event you're curious....

MPEG-4 audio utilizes two (2) container types. Low Overhead Audio Transport Multiplex (LATM) is used when an MPEG-4 contains audio only. Otherwise, Low Overhead Audio Stream (LOAS) is used to encapsulate one or more AAC (either SBR or Parametric Stereo streams), ALS (Audio Lossless coding Stream), TwinVQ (Transform-domain Weighted Interleave Vector Quantization), or LATM streams.

Known as AAC version 2, MPEG-4 Part 3 permits backward compatibility with MPEG-2's version of AAC (AAC version 1), but it is not required. It's important to understand this, becasuse not all MPEG-4 decoders support AAC v1, yet nearly all MPEG audio decoders you will encounter are MPEG-4 based.

MPEG-7

Current Standard : ISO/IEC 15938:2020
Released: 2006
MIME Type: n/a
File Extensions: n/a

MPEG-7 is not an audio or video codec. In fact, it contains no codecs or compression algorithms at all. MPEG-7 is different. It defines multimedia content representation (metadata). MPEG-7 is defined under ISO/IEC 15938: Multimedia Content Description Interface. The section pertaining specifically to audio is ISO/IEC 15938-4:2006 Audio section (Part 4, 2006). The standard requires:

  • Complementary functionality to previous MPEG standards
  • Standardization of multimedia content descriptions, representing information about the content, not the content itself
  • Descriptive metadata to be separated from the content
  • Descriptions be XML based

MPEG-D

Current Standard : ISO/IEC 23003:2012/Amd 3:2016
Released: 2012
MIME Type: n/a
File Extensions: n/a

MPEG-D is not a container format. The standard is divided into five (5) distinct parts. Each part defines a different, highly efficient audio codec. This article is primarily concerned with Parts 1 and 3.

MPEG-D multi-channel audio is defined in Part 1 and is called, MPEG Surround. It uses a hybrid content model of compressed (lossy) mono or stereo, plus a side-channel with audio object information. This allows MPEG-D decoders to playback multi-channel (5.1) audio, while retaining backward compatibility for most other formats as a stereo signal. In essence, MPEG-D supports a single stream containing 5.1 multi-channel surround coupled with a 2-channel stereo downmix. Although this parallel is somewhat of an oversimplification, it accurately states the net effect for compatible decoders predating MPEG-D support. As with MPEG-4 AAC, the client device selects which stream should be played. Both sub-streams are transmitted from the host device. MPEG claims the resulting MPEG-D multi-channel stream is a "high-quality multi-channel surround experience."8,9

Part 3 defines the Unified Speech and Audio Coding (USAC) extension; used by the xHE-AAC codec.10

Incidentally, MPEG-D Part 4 defines MPEG-D DRC (Dynamic Range Control) and was added to the standard in 2015. DRC is a form of automatic volume adjustment designed to reduce the impact of background noise from interfering with the intended focus of the listener.

MPEG-21: An Audio Object Standard

Current Standard : ISO/IEC 21000:2019
Released: 2003
MIME Type: mp21
File Extensions: .m21 | .mp21

Introducing the concept of Digital Items, MPEG-21 is not a group of codecs like MPEG-2 and MPEG-4. It is a multimedia framework where users interact with one another, and the object of their interaction is a Digital Item. To facilitate this interaction, ISO 21000 defines a "Rights Expression Language." This standard is a set of rules that manage restrictions for digital content usage. This "language" is XML based. In a sense, MPEG-21 is closer in architecture to MPEG-7, which describes content via metadata.

MPEG-21's primary objective is to define technology needed to support users in exchanging, accessing, consuming, trading or manipulating Digital Items. In this sense, a "user" is anyone who interacts with Digital Items inside the MPEG-21 Multimedia Framework.

For a more detailed explanation of MPEG-21, see the related article on MPEG-21.

MPEG-21 is mentioned in this article because it may be encountered as a conduit for other MPEG formats and/or other non-MPEG audio formats. As a packaging protocol, it functions as a container.

MPEG-47

What the heck is MPEG-47?

MPEG-47 is an obscure term. It is not a real standard. It is an amalgamation of "MPEG-4" and "MPEG-7" and is best thought of as a sort of slang term for the use of both formats in a single application. Recall MPEG-4 is a content standard while MPEG-7 is a content description standard.

MPEG-Dash (not MPEG-D)

MPEG-DASH is a standard that defines a system for the dynamic adaptive streaming of multimedia content over HTTP. It is mentioned here only for the sake of completeness. Although it is capable of streaming audio, MPEG-DASH is not an audio format standard by definition. MPEG-DASH is defined by ISO/IEC 23009:2014, and has been updated several times between 2014-2018.

MPEG-DASH is sometimes referred to as MPEG-DA.

References

A closer look into MPEG-4 High Efficiency AAC. (n.d.). Telos Alliance.

Bosi, Marina and Goldberg, Richard E. (2003). Introduction to Digital Audio Coding Standards. Kluwer Academic Publishers.

Brandenburg, K. (1999). MP3 and AAC Explained.

Brandenburg, Karlheinz; Kunz, Oliver; Sugiyama, Akihiko (1999). "MPEG-4 Natural Audio Coding - General Audio Coding (AAC based)". chiariglione.org.

Calabrese, Robert. (13 November 2019). Home DJ Studio.

Dietz, Martin and Herre, Jürgen. (May 2008). MPEG-4 High-Efficiency AAC Coding. IEEE Signal Processing Magazine, 137-142.

Eureka People: How Karlheinz Brandenburg invented the MP3. (29 January 2013).

Grill, B. and Quackenbush, S. (October 2005). MPEG-2 Audio.

HD-AAC. (2020). Fraunhofer Institute for Digital Media Technology.

ISO/IEC 13818-7. (15 October 2004). Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC).

ISO/IEC 14496-12:2015

Meltzer, Stefan and Moser, Gerald. (January 2006). MPEG-4 HE-AAC v2 — audio coding for today’s digital media world. Audio Compression.

Proceedings of the AES 17th International Conference: High-Quality Audio Coding, Florence, Italy, 1999 September 2-5.

Purnhagen, Heiko. (7 November 2001). MPEG Audio FAQ. Massachusetts Institute of Technology.

Rose, Matthias. Understanding MPEG Audio Codecs From MP3 To xHE-AAC. (n.d.). MODULATION INDEX, LLC.

Schnell, Markus; et al. (October 2008). MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication (PDF). 125th AES Convention. Fraunhofer IIS. Audio Engineering Society.

Timmerer, C.; Hellwagner, H. (2008) MPEG-21 Multimedia Framework. In: Furht B. (eds) Encyclopedia of Multimedia. Springer, Boston, MA

Van der Meer, Jan. (20 March 2014). Fundamentals and Evolution of MPEG-2 Systems: Paving the MPEG Road. John Wiley & Sons.

End Notes

1 ISO/IEC JTC 1: Information Technology. (n.d.). International Organization for Standardization.

2 HD-AAC. (2020). Fraunhofer Institute for Digital Media Technology IDMT.

3 Nyquist–Shannon sampling theorem. (8 June 2020). Wikipedia. Wikimedia Foundation.

4 Facts About Speech Intelligibility. (20 January 2016). DPA Microphones, Inc.

5 Neuendorf, Max; et al. (April 2012). MPEG Unified Speech and Audio Coding – The ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types. Audio Engineering Society. 132nd Convention 2012 April 26–29. Budapest, Hungary.

6 ISO/IEC Copyright Office. (1 September 1999). International standard ISO/IEC 14496-3:2005/Amd 2:2006 Information technology — Coding of audio-visual objects — Part 3: Audio. AMENDMENT 2: Audio Lossless Coding (ALS), new audio and BSAC extensions. (3rd ed.).

7 Full list of MPEG Standards. (n.d.). The Moving Picture Experts Group.

8 Chiariglione, Leonardo. (24 February 2019). There is more to say about MPEG standards.

9 MPEG-D. (n.d.). The Moving Picture Experts Group.

10 USAC (Unified speech and audio coding). (2019).