This is a lengthy article about metadata intended to describe multimedia content.
The purpose of Metadata is to convey information about an object to a user interacting with the object. An "object" in this case may be any multimedia content, such as a video, audio recording, series of images, or sounds that represents digital art. To be effective, metadata must articulate characteristics of the object in question. Effective metadata should meet the following criteria:
- Complies with one or more commonly accepted schema
- Allows embedded information to be edited, retains any changes, and faithfully reproduces them during any subsequent read requests
- Stores information using in a format designed to be read (viewed/used) by a commonly available reader
- Describe one or more characteristics of its associated content
- May be freely used, read, written, and freely exchanged without encumbrance upon the end user
- Chunks, Frames, and Packets
- Purpose of Metadata
- Types of Metadata
- Audio-Only Metadata
- Multimedia Metadata
- Broadcast Metadata
- MPEG Standards
What is Metadata?
Metadata is one of the most mis-understood characteristics of audio and video content files.
Metadata is simply information describing the characteristics of content. As consumers, we tend to think of metadata as information such as a song's title, author, and/or artist name. While this is true, it is in fact a much broader topic. Metadata is easier to understand when divided into broad categories:
- Technical Metadata: a set of information that defines the specifications and objective characteristics of the content. It refers to indisputable information such as the resolution of a video, frequency range of audio, etc. It tells you how the content is composed.
- Contextual Metadata: a set of information that describes, parameterizes or catalogs content, such as title, album, artist name, copyright holder, etc. Also known as descriptive metadata, content metadata, or semantic metadata, this is what most consumers think of when they hear the term "metadata." It describes what the content is.
- Dark Metadata: is unknown to the decoding application at time of processing. This may occur for many reasons, such as private cataloging (also known as private metadata, unknown extensions to MXF and standardized metadata items that are simply not understood by the decoder. The term is a tongue-in-cheek reference to Dark Matter.1
Some codecs embed technical metadata, but rarely contain contextual metadata. Notable exceptions are MP3 and True Audio codecs. Most containers harbor technical metadata, and nearly all of them allow contextual metadata as well. Dark metadata can be in either, but is more likely to be hidden in container formats.
Metadata vs. Metrics
Metadata should not be confused with metrics. Metrics are measurements of data. They quantify statistics about a data set, such as frequency or coallation. Metrics can also deal with qualitative information, such as efficiency. Metrics compare and group information. Metadata on the other hand doesn't care about those details. It only cares about what something is. While metadata can sometimes appear to look like a metric, it's not. Metadata is static. It doesn't change no matter how you utilize the associated information. The metadata remains the same. It tags along for the ride. Metrics, however changes all the time. How many people listened to a particular song last week? That is a metric. But, the name of that song is metadata. It doesn't change regardless of how many times the song is played or listened to.
Is there such a thing as "normal" metadata? No. By its very nature, metadata in and of itself paints with a very broad brush. Metadata may be deliniated by Purpose, Type, and/or Metadata Structure.
Chunks, Frames, and Packets
Chunks and frames are virtually identical in meaning. They are simply containers within a file that store specific types of data. Typically, a chunk is larger and may contain a group of frames or group of keys (tags), whereas most frame-based codecs only store one key/tag per frame. Examples of chunk-style containers include RIFF, AIFF, WAVE, OGG-based (FLAC/VORBIS/OPUS), and CAF. ADTS is an example of a frame-style container. Generic container types normally use chunks; such as, RIFF, AIFF, AIFC, WAVE, ANI, and AVI.
Metadata needs to be compartmentalized such that playback/file reading applications can clearly understand where your metadata is, that it's metadata, and what type it is. One of the oldest methods of delineating different content portions of a file is the concept of chunks. Simply a method of segmenting a file into organized pieces, each chunk begins with a series of data that signals the beginning of a new chunk. Depending on the format, it is possible for chunks to be nested.
Frames are a cousin of the chunks concept. Frames work in a similar fashion, but allow more flexibility of the content within each frame. Othewise, they are very similar to chunks. Their main difference is sets of tags are used to delineate the beginning and end of each frame.
Some codecs use what they refer to as "packets." In true sense of the networking term, packets are a uniform size. However, this often isn't the case when multimedia codecs refer to their composition as consisting of packets, and such descriptions should be viewed with skepticism. In general, it's best to treat them as frame types. For example, the WavePack (MPC) standard claims to use packets, but in reality its metadata storage methodology is more characteristic of frames.
Purpose of Metadata
Why does anyone bother with metadata? What's the big deal? Metadata is independent of content. It provides the means to organize its associated content. Have a large digital music collection? How do you organize it, such as keeping track of where songs are and what they are? Almost undoubtedly, you are using metadata to do so. While you certainly could use a simpler format such as filenames, methods such as that break down and become useless very easily. One accidental widespread re-write of filenames, and your methodology renders your cataloging system useless, forcing you to live with all your content being unknown without listening to it and trying to recall what it is, or wiping it out and starting over.
Types of Metadata
When metadata was first introduced into media files, it was a mixture of a few fixed, pre-defined pieces of data at the beginning of a file. It has since come a long way and metadata schemas now exist that can be highly complex. Let's begin by examining the main categories of metadata. There are three broad types:
- Structural Metadata: a set of information that defines the object's structure; i.e. how the object was edited, what source components were included in which derivation chain.
- Descriptive Metadata: a set of information that describes, parameterizes or catalogs content; for example, an episode number, a copyright holder name.
- Dark Metadata: another term for unknown metadata; information unknown to an application at the time of creation. It is added after content creation, often by an application that cannot be identified by standard media players. This could be private metadata, unknown extensions to a codec, or even standardized metadata items not handled by the media player application.
Structural metadata is information that is expected, typically as part of the form factor of an underlying container or codec containing basic media information. Presumed and sometimes required, it is almost always a very limited set of fields. For instance, if a codec or container format includes a structured metadata field called, "Author" it might expect that field to contain letters, but not numbers and symbols. And it might specify the field may be a maximum of 32 characters, and can only be UTF-8 characters. This is an example of structured metadata. Its form is rigid and inflexible
When Contextual Metadata is Also Structural
At times, metadata crosses the boundary between multiple types. For example, RIFF is a common chunk style multimedia container format that - when containing metadata - is always found inside a chunk named INFO. In RIFF's case, its metadata type declaration in and of itself is structural, but any existing metadata within this structure - in the case of RIFF - is in fact descriptive (or potentially dark metadata. Confused? The RIFF container format requires metadata be contained inside an INFO chunk if it exists, yet within the INFO chunk the format of the metadata is undefined. At the RIFF container level, the INFO chunk either exists or it doesn't. Any metadata inside the INFO chunk is not structural because RIFF doesn't understand metadata, period. At best, it's contextual/descriptive. At worst, it's dark metadata. Which depends on what created the metadata and what is trying to read it. Does the decoding application accurately reproduce the metadata? Then it's likely descriptive. If not, it's likely dark. I say, "likely" regarding both cases because there is one more possibility. If the RIFF container in this case is not actually the top-level container, it could be nested another container. And in that particular scenario, any metadata in the RIFF's INFO chunk might turn out to be structural after all. Not to RIFF per se, but to its parent container type. Is this confusing? Yes. Yes, it is; and that is often the case with metadata, which is why it is so difficult to get a handle on in practical use unless one sticks to one particular format.
Skeleton metadata is a type of structural metadata sometimes erroneously referred to as required metadata. Skeleton metadata is structural data that is expected and presumed by a decoder, but is not necessarily required.
The term skeleton metadata is a reference to the presence of pre-defined "bare bones" (minimal) metadata information, which may or may not be required (but normally is, though the data fields may be empty).
Descriptive metadata is what most people think of when they hear the term, "metadata." Also called contextual, semantic, and temporal metadata, descriptive metadata is information describing associated content. Descriptive metadata is sometimes also structural, and vice-versa. For example, a song title and primary artist name could be both. The information naturally describes characteristics of a song, and therefore adds context. It attributes some meaning or concept to the work of art. When a particular codec or container requires this sort of information to be present, it also becomes structural in nature.
Dark metadata is unknown metadata. The decoder doesn't see it and has no awareness of it. Dark metadata only has meaning to the application that created it. Hence, to most applications it is useless information that is skipped over. This could be arbitrary metadata included by accident or metadata that can only be read by a non-standard, specific application. The most common form is created when a proprietary encoder inserts metadata to record some sort of technical, analytical, or documentation information and the original encoder/decoder understands it but other decoders don't. An example would be an encoder inserting a custom tag (key) with the application's version number stored as the value. If the encoder has a corresponding decoder application, the decoder might look for the encoding version tag when reading a file, thus allowing the decoder to understand which program created the file and which version of the program it was. An example would be a RIFF container with an INFO chunk, and within that INFO chunk, a string of data that no standard media player is able to decipher. Standard reading applications will simply ignore it or skip over it. Another possibility would be a chunk in a RIFF container file with an illegal chunk category. While it will be skipped by RIFF-compliant media players, this chunk could still contain valuable information to an application that expects to find it there. Thus, dark metadata can be utilized as a form of hiding information in plain sight, within a file.
Private metadata is a form of dark metadata. The term private metadata refers to arbitrary metadata inserted by an individual or entity for the purpose of recording information about the content only the recording agent is aware of its existence, purpose, and how to retrieve and apply it for a specific purpose. An example of private metadata could be an arbitrary chunk in a WAVE/RIFF file where the author chooses to store the names of their family members who prefer listening to the audio data (song) stored in the file. Private metadata isn't useful to anyone other than the author. It may be passed along inadvertently or on purpose when the file is shared or created, depending on the application that created it. Decoders should ignore it.
Technical metadata describes characteristics of its associated media content pertaining to the type of information a playback device might need to know. End users and consumers don't normally care too much about technical metadata as it doesn't usually pertain to cataloging and organizing content, such as one might want to do with a digitial library. In fact, users rarely see it. When it is available in some fashion to end users, it may appear more like metrics than metadata, however some codecs and containers require or prefer this information to be stored as metadata for the purpose of aligning playback devices. In some cases, predictive models built into containers and codecs will use this type of information to perform functions such as calculating rewind and fast-forward content offsets in a file. The bottom line is most users ignore this information, if they ever see it. This article does not address technical metadata.
A good example of technical metadata is Dolby's Program Metadata. There is a standard version and a professional version, depending on your focus.
Multimedia formats containing both video and audio content split them from one another. They are recorded in parallel, as separate objects. This is what allows two people to watch the same film, but in different languages. The video portion is common (shared) while the audio portion is different.
Text and XML
The majority of metadata is stored as either ASCII text or XML, making reading it relatively simple (provided it's not dark metadata).
We've established one can have multiple audio streams in a single file, and in fact this is not uncommon (e.g. English and Spanish languages or stereo and 5.1 multi-channel sound tracks embedded in the same media file). Is it possible to have multiple video streams in a single file? Yes; provided the file type supports it. First, it must be a container type. Second, the container type must support this functionality. Two good examples are MPEG-4 and Matroska containers. Both treat any sort of content inside them as objects, and thus they don't care if an object is audio, video, an image, or even something like MPEG-7 metadata. It's just an object to those container formats.
Multiple video streams are very rare to be found in a file. It could make sense in a situation such as an archival file for a broadcast organization, where there is a need to be able to transmit a high or low resolution version of a video at any given time, but for consumers it would be exceptionally rare to require this feature. It also has the disadvantage of creating very large files, which aside from their obvious increased storage needs also makes it more difficult for playback functions such as seek and fast-forward, reverse, etc. if the playback device has to buffer a larger amount of data. The latter really depends on implementation of the playback device or software and how the multi-video file is structured, but the circumstance is likely to present such challenges. In general, the practice should be avoided. It's almost always better to simply allow the playback device to moderate the video stream (down-convert or up-convert) rather than recording multiple streams. The concept only makes sense in very specialized use case scenarios.
Multiple video streams embedded in the same file are possible now. MPEG-4 Part 14 specifies support for this function. However, it's exceptionally rare to encounter this because it tends to result in very large files. The concept is primarily designed for circumstances that involve delivering multiple content streams in a single file or application, and to support streaming and display models where the content distributor chooses to act as a gatekeeper for which version of particular content a display device will receive. For instance, a direct broadcaster beaming a video to your cell phone might send down a lower resolution video as compared with if the same broadcaster were sending a video down to your 65" wide TV. Either way, this concept really boils down to shifting the control of display resolution from the client device to the distributor of the content.
A common question when it comes to metadata and comparing standards is, "Does it support cover art?" If it does, depending on the metadata format and file format, this could mean just a small thumbnail image, or it might support a full-size image (normally album cover art). In the case of the latter, the full size image is normally reduced automatically for display as a thumbnail size in file viewing programs or previewing programs. The reverse is not true; thumbnail images cannot be enlarged.
Which metadata formats support embedded artwork? Cover Art is a feature most consumers find very important. Yet, in spite of this most metadata standards don't support it. Multi-frame tagging is even rarer.
|Cover Art & Multi-tagging|
Multi-frame tagging is the ability to specify multiple tags with the same name, such as two tags in the same file named ARTIST=artist name1 and ARTIST=artist name2, but contain different names. This might seem inconsequential in the beginning, but as your multimedia library grows, the lack of a good method of multiple-value associations to the same key or tag name quickly becomes apparent.
Interpreting the ChartThe chart to the right indicates which metadata standards support embedded artwork and multiple identically named tags. A blue background denotes recommended metadata solutions for audio tagging.A gold background indicates recommendations for video and audio+video metadata solutions. A background color of pink background means the metadata format has limited usage potential, and while its use is not frowned upon, it is recommended to use with caution because while the format is either propritary or has limited usage for some other reason.
The alternative is to simply join the two names into a single tag using delimiters, such as ARTIST=artist name1 / artist name2. The problem with the latter approach (1:1 relationship between tags and values) is if one or more artist names contain the same character(s) as the delimiter, then you have a problem. It won't work. This is where multi-frame tags help out tremendously, and thus what separates the good from the best metadata formats is the ability to incorporate flexible choices, such as cover art and multiple tag frames in the same file. To get the maximum benefit out of metadata, you should use a metadata format that natively supports multi-frames when possible.
How do you handle multiple instances of a tag? For instance, if multiple artists contributed to a song, how can that be reprented in metadata? The answer is: it depends on the metadata type and the file format. Some metadata formats allow multiple copies of the same tag in the same file, such as multiple ARTIST tags. Each tag with the same name is aggregated. When displayed in playback or tagging software, those tags should be displayed sequentially, often with some sort of pre-defined character or string between them. Multi-tags are sometimes referred to as multi-frame. This is particularly true for metadata standards such as ID3 that use frames.
If you're using or planning on using a metadata format that does not support multi-frames/multi-tagging, you will have to list all the associated names of a tag value in a single tag. For example, if you want to specify more than one artist associated with a music track and you're working with a QuickTime file that uses Apple iTunes metadata, you will need to include all the names in a single tag and use some type of delimiter to separate the artist names and make it clear you're referencing about multiple names. Unfortunately, this approach has a number of disadvantages when it comes to playback and especially when searching through a multimedia collection trying to find associated content. For starters, you may not use the same character or string of characters the metadata format uses to indicate the of a tag name/value combination, or the character/string used to denote the end of a tag name and beginning of its stored value. Thus, metadata formats supporting multi-frames are without question superior in this regard as they eliminate these problems. First, you're able to clearly delineate different content contributors, commments, and other information. And secondly, when searching metadata you won't get false search hits or miss appropriate matches due to how different values were sandwiched together.
This section is focused on audio-only codecs and containers that support metadata. Below is a chart correlating audio metadata formats to the metadata types they support and the standards they are compatible with.
|Audio Metadata Formats : Metadata Standards|
|RF64 | BW64||Yes||Yes||No||Yes||Yes||--||--||--||--||--||--||Yes||--||--||--||--|
|Core Audio Format||Yes||Yes||Yes||No||--||--||--||--||--||--||--||--||--||--||Yes||--|
|True Audio (TTA)||Yes||Yes||No||No||--||v2||--||Yes||--||--||--||--||--||--||--||--|
Opus Interactive Audio Codec - normally abbreviated as simply, Opus - is a lossy audio codec brought to you by the same organization that created FLAC (Xiph.org). Opus competes with MP3, AAC, and Vorbis audio for music.
Opus allows the use of Vorbis Comments for metadata.
OptimFROG consists of two (2) obscure and uncommon audio compression formats (codecs). One is lossy and the other is lossless. Both use APEv2 tags to store metadata.
True Audio (2007)
True Audio (TTA) is a compressed, lossless audio codec. It was invented primarily for the purpose of compressing multi-channel WAVe files.
TTA files do not contain explicit metadata, but allow the incorporation of ID3v1, ID3v2, and APEv2 tags. TTA decoders will recognize those metadata formats on playback if the file was encoded as such.
Apple Core Audio Format (CAF) is a chunk-based audio container format developed by Apple, Inc. CAF files contain a number of chunks devoted to technical metadata. Contextual metadata is possible via an 'INFO' chunk (very similar to the more common RIFF and AIFF standards. CAF has limited structural metadata, which is a hybrid of technical and contextual metadata fields. Apple refers to its tag names (keys) as "Information Entry Keys."
Apple Lossless Audio Codec (ALAC) is a proprietary audio file format owned by Apple, Inc. What is the purpose of ALAC in terms of metadata? In a nutshell, it's a lossless audio format that Apple created in order to exercise control over iOS users. Why would I say that? iOS does not natively support WAVE and RIFF file metadata. In fact, it ignores virtually any metadata from any PCM file format (PCM = Pulse Code Modulation; translation: raw audio). This isn't due to native compatibility issues such as a hardware related problem. It is designed this way on purpose by Apple, to induce the effect of promoting the company's proprietary format which accomplishes the same thing. ALAC is a container format designed to encapsulate PCM-based content for Apple devices. Simultaneously, Apple chooses to exclude native support for many industry standard formats that are also lossless audio.
For a list of supported ALAC tags, see the section on MP4
ALAC allows a combination of structured and unstructured (free-form) contextual metadata and ALAC conforms to the MP4 standard (i.e., MPEG-4 Part 14). Ironically, MP4 is an open standard, but ALAC is not. ALAC supports MP4 compliant tags, but also has its own unique tags that other MPEG-4 compliant formats cannot understand.
ADIF was originally designed to contain Apple's AAC audio codec content.
Free Lossless Audio Codec (FLAC) is exactly what its name implies: a free, lossless audio codec. FLAC is very popular and similar to MP3. The biggest difference is FLAC is lossless while MP3 is a lossy codec. FLAC has its own structural and technical metadata. It permits the use of Vorbis Comments. FLAC is a block-based format. "Blocks" are similar in functionality to chunks. FLAC stores embedded images in a dedicated block called the PICTURE block. It supports ReplayGain.
Within the digital audio world, APE is often misunderstood. Is it a codec? An MP3 file? A metadata format? The answers are yes, no, and sort of.
APE is first and foremost a file format created by a software application called Monkey's Audio. APE is a proprietary, lossless audio codec intended to compete with MP3 (MPEG-1/MPEG-2 Layer 3). APE files have their own metadata schema, yet they are also designed to be compatible with ID3 metadata. Herein lies the point of confusion.
Just like ID3, the APE format has two (2) generations of metadata tags. ID3v1 and v2 are very different. APEv1 versus v2 differences are more subtle. One of the most important relates to how and where these metadata formats are stored in an MP3 file. Where is MP3 metadata located?
- ID3v1: end of file
- ID3v2: beginning of file
- APEv1: end of file
- APEv2: can be anywhere, but normally placed at end of file
As an audio format, APE is rarely encountered. While it offers superior file compression to compared to other lossless legacy formats such as WAVE, it's more difficult to find APE file players, and there are other challenges such as spotty support by third-party applications for APE's metadata. Ironically, APE's metadata is rather similar to ID3, and at one time that seemed like a good reason to many users to abandon ID3v2 in favor of APEv2. Due to ID3v2's bumpy launch, in the early 2000's it looked as if APE might wrest ID3's crown from it as the most prominent MP3 metadata standard. As you probably know, this never happened, but it took several years for ID3v2's developers to iron out its kinks. In the interim, APE tags grew substantially in popularity. Ultimately, ID3 finally sorted itself out and the limited scope of APE's structure yielded the crown back to ID3 (v2.2+).
Re-capping some important points here:
- MP3 files are most likely to use ID3 tags for metadata, but may use APEv1 tags instead
- Conversely, APEv2 tags may use ID3v1 tags.
- APEv2 tags should not be used in MP3 files
Another difference between ID3 and APE pertains to their supported character sets. APEv1 stores tags and their values as ASCII text (a subset of Unicode UTF-8), in a similar fashion as Vorbis Comments. APEv2 can handle the full UTF-8 character set (similar to Vorbis Comment and Matroska). ID3v1 uses UTF-8 while ID3v2 is capable of using three (3) different character sets, with a default of UTF-16. And, as a final word on ID3v2 vs APEv2; the latter has poor picture support. ID3v2 is better at handling embedded images such as cover artwork.
Vorbis I (2000)
Vorbis I is a free, open-source lossy audio codec. Vorbis (officially Vorbis I) is often confused with the metadata schema of a very similar name, Vorbis Comments.Opus competes with MP3, AAC, and Vorbis for music. The fact Vorbis supports Vorbis Comment tags for metadata doesn't help.
Dolby Digital (1999)
Dolby Laboratories has two (2) types of audio technical metadata embedded in its audio codec streams: Dolby Digital and Dolby Digital Plus. These aren't directly applicable to consumers for the most part. Dolby Digital AC-3 was the first concerted effort by Dolby to incorporate metadata into an audio stream that has a direct and significant impact on sound reproduction by the decoder. There are 27 metadata characteristics in every AC-3 stream. An interesting side-effect of the process is if a stream's metadata is corrupt or missing, the decoder will use the parameters from the last good set of metadata it received and apply those guidelines to the current stream. For the average person, this fact is inconsequential. However, it is worth noting if you enjoy manipulating audio data streams at the lowest levels.2
WavPack (pronounced WavePack) is an open source audio codec capable of operating in either lossy or lossless mode. It has some peculariar characteristics and is not particularly popular. WavPack supports ID3v1 and APEv2 tags for metadata (including ReplayGain).
You can learn more about WavPack from this article: How Audio Files Work: Codecs and Containers.
Advanced Audio Coding (AAC) is a lossy audio codec, and a popular alternative to MP3. Originally developed by Apple as an alternative to MP3, AAC offers superior sound quality at the expense of larger file sizes. Ironically, both MP3's last iteration and AAC's first iteration stem from MPEG-2. At that time, AAC had to be delivered inside an ADTS container type. When MPEG-4 was developed, the AAC format was tweaked and a new container type to hold it was created called ADIF. Today, MP4 containers are another option under the MPEG-4 umbrella.
AAC has a considerable list of variants. Some forms fall under the 3GPP standard instead of MPEG-2 or MPEG-4. The 3GPP varieties were developed to address demand for low-bandwidth applications such as streaming to smart phones and similar devices with limited network bandwidth and limited ability to playback audio content with any modicum of audio fidelity. Their streams may be multi-channel, but are most often stereo or even mono. The MPEG-4 versions of AAC on-the-other-hand are suitable for more complex applications, such as multi-channel surround sound and playback on higher fidelity systems such as home theaters. These facts have a direct impact on applying metadata to AAC files.
AAC's use of metadata is not straight-forward. While officially AAC has no metadata, like MP3 a de-facto format has been established; in this case, Apple's tagging format (ala QuickTime). Tag-value pairs are stored in a particular "atom" (what Apple calls chunks) as UTF-8 characters when the form is the default AAC profile (AAC-Low Complexity or AAC-LC). If the AAC file consists of a profile comprised of an ADTS or ADIF container, there's none. If it's a higher complexity type (such as a 3GPP type file/container) there may be some, but it is very limited. A better solution is to forego applying any metadata at the AAC data level and compile the stream inside an MP4> (MPEG-4) or Matroska) container type. This method allows a wide selection of metadata associated with the file indirectly. Matraoska has the added advantage of permitting multiple AAC streams to be muxxed into the same file. For example, an entire album can be stored inside a single Matroska file, complete with individual song separation and independent metadata blocks for each song.
Though similar in many respects, AAC and MP3 are incompatible with one another.
Audio Data Transport Stream (ADTS) is an audio container type designed specifically to stream AAC audio content following the MPEG-2 standard. It was originally designed to contain MP3 and represents MPEG's early interpretation of AAC audio content.
ADTS files consist of a very small header, plus the raw AAC audio data. It has no metadata capabilities.
BWF, RF64, and BW64 (1997/2007/2015)
These broadcast (think TV or cable providers) container formats all use a proprietary metadata standard called iXML the RIFF metadata standard for structed metadata and follow various standards for technical data.3 BW64 and RW64 are 64-bit versions of BWF file format. BW=Broadcast Wave. BWF=Broadcast WAVE File. RF=RIFF File. BWF is based on RIFF. BW64 and RF64 are 64-bit flavors of BWF with BWF's file size limit of 4 GB removed. BW64 and RF64 files may exceed that boundary through the use of nested chunks.
With the exception of WAVE files, RIFF based files are considered obsolete. RF64, the most recent iteration, was deprecated in 2018.
BW64 and RF64 utilize the Audio Data Model (ADM) of metadata. ADM provides UTF-8 character encoded XML, applied inside various specified chunks, akin to the RIFF standard.
From a metadata standpoint, the file types are virtually identical with the exception of chunk names and how the chunks are arranged. Otherwise, they all use XML. You can read more about these files in the related article, How Audio Files Work: Codecs and Containers.
ADM is a technical metadata standard and provides no contextual metadata service. The RF64 file type has a bit more flexibility in terms of contextual data, as it does permit incorporation of the very limited metadata found in RIFF files.
Musepack is an open source, lossy audio-only codec typically associated with the .mpc file extension. Also known as MPC and MPEGplus, Musepack supports the optional use of contextual metadata by allowing embedding APEv2 tags as "chapter tag packets."4
Official Musepack software development has been dormant since 2011.
The Au file type (.au file extension) is an audio codec invented by Sun Microsystems in 1992. It and .snd are the only official UNIX Audio file formats, though today other formats are widely supported on UNIX platforms.
Au is a legacy file type. It contains raw audio data and has no metadata.
SND (or .snd) stands for SouND. It is an audio file type developed by Apple intended to resemble the behavior of the Au UNIX file format. Like Au, SND has no metadata. Its exact origin timeframe is unclear. It's extremely unlikely you'll ever encounter an .snd file, but they do exist.
WAVE (.wav file format) is another joint venture between Microsoft and IBM from the early 1990's. Sometimes referred to as RIFF/WAVE, the WAVE standard is a more user-friendly, audio-only iteration of the RIFF multimedia container format. While it is also a container format, WAVE differs from RIFF in slight, but important ways. WAVE files may contain only uncompressed audio data. WAVE has a more robust set of structured metadata than RIFF and theoretically supports ID3 and XMP, though the latter may not function correctly due to a long-standing bug in WAVE's implementation of the RIFF standard.5
WAVE files have a limited set of structured metadata stored as sub-chunk IDs of the INFO chunk (per the RIFF standard). Technically they are not called "tags" per se, but functionally, and for the purpose of this discussion; they are the same thing. WAVE file reading applications should ignore any metadata chunks they don't understand/can't interpret.
Audio Interchange File Format (AIFF) is a lossless, uncompressed raw audio (PCM) file container format created by Apple in 1988 as a competitor to Microsoft's RIFF standard. You can think of AIFF files as WAVE files for macOS and iOS. AIFF is a chunk-based format, just like RIFF. The are both derived from the ancient IFF standard. AIFF-C (more commonly known as AIFC) is a closely related standard. AIFC files (Audio Interchange Format Compressed) are simply compressed AIFF files.
The remainder of this article explores various multimedia codec, container, and file type metadata standards sorted with the most recent first.
The majority of video codecs and containers are sponsored by MPEG or the ITU Telecommunication Standardization Sector (ITU-T), a technical standards body of the International Telecommunication Union (ITU). The ITU is an international body sanctioned by the United Nations.6 ITU-T is a sub-group of the ITU, and is responsible for coordinating worldwide standards for telecommunications and cybersecurity. The so-called "H-" standards are video compression codecs endorsed by the ITU-T. These are not independent standards. Rather, the H. designation should be thought of as a short-hand way of referring to video codec standards created by other bodies and adopted by the ITU-T. For example, H.264 was developed by MPEG.
This chart correlates multimedia metadata formats to the metadata types they support and the standards they are compatible with.
|Multimedia Metadata Formats : Metadata Standards|
AV1 Image File Format or AVIF is an image file format container. AVIF has provisions for technical metadata only (referred to as Properties). While the AVIF standard eludes to structured metadata in several places, it does not actually exist. In fact, to the contrary, AVIF's decoding technical documentation leaves the door open for dark metadata to be inserted via its object oriented methodology. No structured metadata is defined. Instead, the specifications specifically indicate structured metadata is reserved for future use, and if a decoder encounters a metadata object, it should be ignored (thus opening the door for dark metadata inclusion).5,6
HEIF (2015) & HEIC (2017)
High Efficiency Image File Format (HEIF) is a proprietary, partially open image container format (free for personal use). HEIF supports animation (time image presentation) and offers roughly 2x better data compression compared with its predecessor, H.264 (MPEG-AVC). HEIF permits the inclusion of XMP, Exif, and MPEG-7 metadata types.9 HEIF files almost always contain data encoded with some form of HEVC (High Efficiency Video Coding); better known as H.265. HEVC or H.265 is a codec.
HEIC (2017) is an adaptation of HEIF created by Apple, Inc. HEIF stands for High Efficiency Video Coding in HEIF or HEVC in HEIF. You may think of HEIC as a branded version of HEIF, just like Kleenex is a registered brand name of facial tissue.
Other variants of HEIF containers combined with specific video codecs include Advanced Video Coding (AVC in HEIF or AVCI) and AV1 Image File Format (AFIF or AV1 in HEIF). Since these are all HEIF variants, they may contain the same types of metadata at the container level. This distinction is important as codecs embedded in containers may optionally contain their own metadata as well as the container holding them.
Daala is a free, open-source video codec intended to replace Theora. While Daala describes itself as a next-generation video codec superior to H.265, its development appears to have sputtered shortly after take-off, and although technically still in active development it does not appear to have seen earnest attention for several years. Its prospects for longevity appear dim. Daala does not have any metadata of its own. It must reside inside an Ogg container, which allows for Vorbis Comments, however it must be stressed any metadata is at the container level and part of Ogg.
H.265 is a video codec. Also known as HEVC v2, H.265 is an implementation of MPEG-H (ISO/IEC 23008:2019). H.265 is a high efficiency coding and media delivery standard designed to accommodate different environments. Like its predecessor H.264, while capable of storing technical metadata, it has no standard means for doing so and what it can store is arbitrary, normally highly technical, and of little or no use to an end-user.
WebM allows some structured metadata stored at the end of the file. The following metadata standards are supported: XMP, Exif, iCCP, and a limited set of Matroska tags. WebM does not have official metadata of its own.11
All WebM metadata must be recorded in UTF-8 text format.
H.264 is a video codec. You may also known it as HEVC v1, Advanced Video Coding (AVC), , or MPEG-4 Part 10. All of these names are referring to the same codec. Regardless of what you choose to call it, H.264 has metadata capability, but it is very limited and should be considered dark metadata.
The MPEG-4 Part 10 standard is designed for streaming, and includes as part of its architecture two (2) frame types that can be used for metadata called the Supplemental Enhancement Information (SEI) and Video Usability Information (VUI) frames. If used at all, these packet frames contain technical metadata about the video stream. SEI and VUI frames are arbitrary, and are inserted into a stream. To read them you'd have to know they are there, when and where to expect the data.
Regardless, there is no metadata standard for H.264. This means whatever tool you use to apply metadata to an H.264 stream, you're probably only going to be able to read it with the same application. This means tagging and playback applications will generally ignore it because they either won't be looking for it or won't understand it.
3GP or 3GPP is shorthand for 3rd Generation Partnership Project, a mobile device and telecommuncations industry consortium responsible for helping develop the Enhanced AAC Plus (AAC+) and AMR codecs, and defining the 3GP file format.
3GPP metadata is structured. It must be either XML (and use pre-defined fields and value types) and/or follow the ID3v2 standard. It may optionally be encrypted. The scope of proprietary (structural) 3GPP metadata is limited. 3GP files (.3gp, .3gpp) may have the following metadata fields shown in the accompanying chart as null-terminated text fields.12
Multi-frames are NOT permitted.
In the interest of brevity, only common Matroska tags are listed below for reference. The entire list may be found here. Alternatively, you may download a PDF list of the tags here. Another useful reference is Matroska's comprehensive comparison with other tagging standards found in this table.
Matroska is one of the most popular container formats. Why? If you had to choose one word that vastly differentiates Matroska from every other multimedia container type, it would be metadata. Matroska fills a gap still left wide open by almost every other digital multimedia file type. Open source and very widely supported, it is capable of nesting multiple video and audio streams, and/or other containers. Matroska (.mkv/.mka) uses a combination of structured and unstructured metadata, and at the same time any containers and codec inside it may have their own metadata as well. This raises some interesting scenarios, such as the possibility of multiple codecs or independent streams providing independent contextual metadata to a playback device. Matroska's metadata is a hybrid format, offering a unique combination of structured and unstructured (contextual) and technical metadata.
Matrosk's metadata handling is a bit bi-polar. While it has strict formatting restrictions pertaining to its technical metadata and certain technical tags that could arguably be considered contextual, they are treated more like technical metadata fields with regards to strict syntax enforcement. Like Vorbis Comment, Matroska supports free-form tags, yet it strongly discourages their use. Matroska's developers clearly have a preference for structured metadata (not a bad thing). It even has its own, proprietary metadata framework called EBML that must be used when storing official Matroska tags.
Matroska's strength lies in its robust structured metadata keys, which it refers to as elements. An "element" is a key or tag. Matroska's structured tagging system can be quite confusing. For better or worse, it utilizes a concept of nested tags (meaning tags can be inside other tags). When nested, lower level tags are related to the upper level they fall under.
Another twist to Matroska's metadata menu is its ability to include special embedded tags underlying codecs won't be aware of on playback, but which Matroska will recognize. An example is a chapter marker. The underlying content decoder never sees it, but a Matroska playback device does. In fact, it's not uncommon for an MKV file to contain multiple iterations of metadata with each one tied to a particular codec or another container nested inside the Matroska parent container.
Notice anything strange about the list of tags above? There's one very common musical tag that is missing: Album. Why? This seems like such an obvious miss; why would Matroska's developers do this? How could they forget? Well, they didn't. This is an issue that highlights my previous comment that Matroska's metadata schema is confusing. It doesn't operate like nearly every other tagging schema from other file formats. Figuring out how to tag your file with Album information is a prime example of this. Instead of an album tag, the album name falls under a sub-category of metadata Matroska calls Target Types.
Matroska's documentation reads in part, "TargetType element allows tagging of different parts that are inside or outside a given file." What does that mean exactly? Recall I mentioned previously that Matroska may contain nested content (multiple files inside of itself) and it may contain content-related signals inside the MKV container as well, which the underlying content doesn't see. Target Types (sort of) work the same way. TargetType values store information that describes a portion of the content inside the Matroska container. Imagine a chapter marker - the example I gave previously - only these indicators can be other things as well, though the list is limited. Album is one of them, though of course it's only applicable to music. This is one of the nuances of TargetType. It doesn't need to know if the content it is describing is music, a movie, or something else. There is flexibility built-in. However, this can make the process of decoding what the TargetType means a challenge for playback devices, tag readers, and tag writing programs.
Matroska does support cover art (album/artist/movie thumbnails embedded as metadata). Matroska only supports PNG and JPEG images, stored as attachments. The technical details may be found here for those so inclined.
It's worth noting Matroska permits two (2) images per file: a standard size and a thumbnail size, referred to as the 'cover' art and the 'small cover' artwork. The pictures need not be identical (other than their size). Thus, you may have one type of image for a small thumbnail view and a different image for the larger view. Matroska supports square, portrait, or landscape modes, but a picture needs to be a square or rectanglular shape.
The successor to RIFF, Advanced Systems Format (ASF) is a legacy audio/video streaming container format released by Microsoft in 1998.
ASF containers usually contain .wma and/or .wmv files (Windows Media Audio/Video). It treats the media content inside of it as objects, meaning it makes no distinction between audio and video media types. ASF was one of the first container types designed specifically for streaming. Instead of chunks like RIFF style containers or frames like many modern container types, ASF stores data as packets. This makes its storage processing somewhat synonymous with network packets, which is of course conducive to streaming media across a network.
WMA (Windows Media Audio) files must be inside ASF (audio) or AVI (a/v) containers.
ASF has limited, proprietary metadata and purportedly supports ID3, however its implementation of the ID3 standard is awkward and non-conformist. ASF treats ID3 tags as "attributes" and thus they need to be defined as attributes, meaning they don't behave like traditional ID3 tags. How? For starters, no validation is performed on ID3 style tags. This means there is no guarantee on the quality or accuracy of reproducing the tags when the file is decoded. Second, within the context of ID3 tags, ASF supports ID3v1.x, ID3v2.2, ID3v2.3, and ID3v2.4. If the content was tagged with ID3v2.0, ASF is unlikely to handle them correctly.11,12 Another difference regarding ID3 tags in an ASF containers is how they are named. ASF does not support proper ID3 naming specifications. Instead, they must be prefixed with "ID3/" in the format "ID3/tag-name," such as "ID3/TPE1" which is equivalent to the ID3 tag "TPE1" or the Windows Media/ASF attribute called "Author" (no quotes, of course). Likewise, to represent the AlbumArtist attribute you'd need to use either "WM/AlbumArtist" (the Windows Media/ASF standard attribute nomenclature) or the ID3 equivalent - "ID3/TPE2" - could be used instead. ASF Content Descriptors may theoretically contain other (non-WM/ASF, non-ID3) forms of metadata, however it may be considered dark metadata as there is no guarantee a reading/decoding application will understand it.
ASF Metadata Object Types
ASF has its own definition of metadata object types (defined in the ASF specification):
- Content Description Object
- Content Branding Object (some URL storage; not useful)
- Extended Content Description Object
- Metadata Object
- Metadata Library Object
These "object types" are free-form text entry fields. With a minimum size of 26 bytes, most have a maxmimum length of 65,536 bytes in length. The Extended Description Object and Metadata Library Object types have exceptions to the length limit. Their maximum length is 16 exi-bytes! The Metadata Library Object includes a GUID (Global Unique IDentifier) and may be stream specific and/or language specific (unlike all the others which don't care about langauge). Content Descriptors are stored as Name/Value fields.
WCHAR stands for "Wide Character." It is a Microsoft implementation of 16-bit Unicode text encoded as UTF-16LE, the native character type on Windows operating systems.15 WORD refers to DWORDs, QWORDs, and ordinary WORDs. These all refer to some form of WORD representation of data storage. Historically, in computer science a "WORD" is a 16-bit string. In this case, the term means 16-bit characters (similar to WCHAR), but there are some differences. QWORDs may be up to 64-bits, though it depends on the operating system. The bottom line for ASF is objects allowing these types are more flexible with regards to the values they can hold per byte within the data stored in the object, and those values don't have to be intelligible from a language perspective (though, obviously that is what most people use these fields for). What does this mean for you? Probably, nothing. These objects are designed to hold up to 65,535 characters per value for each object. If you need more space than that, perhaps you need to re-evaluate your digital library plans.
The Metadata Library object type also permits a GUID (Global Universal IDentifier) data type.
Extended Content Description Object
Note in particular the last three objects in the table above. They are all free-form text. You can place anything you like in there. Now, this presents an interesting conundrum. First off, it's a positive because it means it is possible to apply all sorts of nifty descriptive metadata to ASF files if you'd like. However, the caveat to this approach is it is quite likely doing so will limit your ability to read any tags stored in these objects via only the application that wrote them.
You may have noticed before that various other websites claim all sorts of metadata tags are applicable to ASF files. Well, the fact is that is only true if one subscribes to the same application that purports to use or write them. The information in the table above is taken directly from the most recently published ASF Specification (version 1.20.03, circa 2004). No other list of metadata for ASF is "official." Caveat emptor.
As an example, Microsoft's Windows Media Player natively supports a set of arbitrary metadata stored in ASF files. This is probably as close to an expanded, universal list of metadata for ASF files you will find simply given Microsoft's support being behind it. These fields are readily identifiable as Microsoft creations by virtue of their "WM/" prefix, where WM stands for, "Windows Media."16
Windows Media Player Proprietary ASF Metadata
Remember, these fields are arbitrary as they are contained in one of the three (3) free-text objects in an ASF file. However, these can be considered part of an informal standard (Windows Media). These are all string types except for WM/Picture and WM/Text which are binary. WM/Text is a null terminated string.17
- WM/Text [comments???]
- WM/Track [track number]
- WM/TrackNumber [track number]
- WM/PartofSet [disc number]
- WM/Mood [mood]
- WM/Picture [embedded artwork]
- These tags support multiple values (delimited by the | character):
- WM/Mood [mood]
- WM/Picture [embedded artwork]
- These tags are compatible with ID3 in .wma and .wmv files:
- WM/Picture (APIC in ID3)
Note: ASF does not validate ID3 formatted tags
Another key (tag) called WMT_STORAGE_FORMAT indicates if the ASF file metadata is encoded as ASF or ID3. By default, ASF type is presumed if this key is not present or not defined. Either way, ID3 and MP3 storage formats are supported for reading only, and not writing. The exact representation is actually:18
- WMT_Storage_Format_MP3 [MP3 format]
- WMT_Storage_Format_V1 [Windows media format]
If you are using a Windows based computer, you're in luck, as Windows plays back ASF files and reads their metadata via an internal tool called the WM ASF Reader Filter.10
Ogg is a free, open-source container format. Ogg may hold content created with FLAC, Vorbis I (Vorbis), Theora, Opus, and other similar codecs. Ogg has its skeleton metadata and supports FLAC blocks. Cover art is supported via FLAC metadata.
Audio Video Interleave (AVI) is another legacy multimedia container format. Like RIFF, it was created by Microsoft in the early 1990's. In spite of its age, AVI remains popular for video archiving. Although its predecessor RIFF format has an overall 4 GB size limit, AVI is fully capable of using nested chunks to expand its storage size maximum well beyond 4 GB. However, software bugs in early AVI implementations hindered this ability, limiting the outer RIFF chunks to approximately 2 GB. Therefore, when working with very old hardware, one should be cognizant of this issue.
Through chunk nesting, AVI can achieve a nearly unlimited file size (there is no disk available currently that would not be exceeded in capacity of the largest possible AVI file, if one could create such a file). It works by nesting multiple depths of AVI chunks. A sequence of RIFF chunks is written such that the first layer has an ID of "AVI " (note the space character at the end) and internally nested RIFF chunks have an ID of AVIX. RIFF chunk IDs must be four (4) bytes in length.
When it comes to AVI metadata, arbitrary metadata may be applied inside any of the root or nested chunks so long as it follows the INFO chunk protocol defined by RIFF. It is common to find XMP chunks placed in AVI files, particularly in any of the outer RIFF chunks.
For detailed information, start with the latest AVI specs. Followed by a review of the AVIX Container File Format Extension (1996) and/or the XMP Part 3 specifications (2020).
AVI's architecture is an off-shoot of RIFF, and its metadata schema is similar to that used by WAVe (INFO chunk and sub-chunks).19 For practical purposes, it is wise to stick with a common RIFF schema such as WAVe's metadata schema in order to retain the highest level of compatibility among tag reading applications. AVI is a very versatile container format, particularly for audio content. The following audio file types may be incorporated into an AVI container: FLAC, DTS, AMR, MPEG-1 Audio Layer I, Layer II, Layer III; AAC, AC3, WMA, Opus, ALAC, WMA Lossless, LPCM, u-law PCM, A-law PCM.
QuickTime is a proprietary MPEG-4 multimedia container format created by Apple, Inc. QuickTime uses the Apple iTunes tagging format. Therefore, it's metadata schema is explained fully in that section of this article.
Resource Interchange File Format (RIFF) is a legacy multimedia container format. It supports a very limited set of metadata. RIFF is a chunk style format, and may contain audio and/or video. RIFF files are limited to ~4 GB in size, including any metadata chunks. RIFF doesn't have a defined group of metadata like most older multimedia file formats. It only defines where metadata must be stored, which is in a specially designated chunk called INFO.dark metadata. If you have a reason to want to insert dark metadata into a file, a RIFF based file format such as RIFF or AVI may be a good choice, depending on your other needs.
The most recent RIFF specification is Version 3 (1994).
AVI (.avi) and WAVE (.wav) files are implementations of the RIFF standard. WAVe especially is sometimes represented as a "WAVE/RIFF" file type.
Interchange File Format (IFF) is the forefather of every multimedia file type. Invented by Electronic Arts, Inc. in 1985, IFF was the first concerted attempt within the commercial computer industry to create a universal file format for the express purpose of sharing images, audio, and text between computers of different types and brands. Prior to IFF, it was virtually impossible for computers manufactured by different companies to exchange data via files. IFF bridged this gap and introduced the concept of chunk chunk-based file formats.
These are certain types of metadata that - as a consumer - you should never consider using. Why? They were invented by and for the professional broadcast industry. Unless you're going to need metadata for that purpose, they are pointless for your usage. Note in this context, "broadcast" does not mean streaming. It means - literally - broadcasting. Such as transmitting real-time television feeds up and down terrestrial-to-satellite links. That is broadcasting. Not a YouTube LiveStream or Twitch, etc. You will probably never encounter these metadata formats, but if you do, now you'll know what they are!
All of these metadata schemas are obscure, and you're unlikely to ever encounter them. But, if you do... now you'll know what they are.
aXML is an obscure European Broadcasting Union (EBU) metadata standard for broadcast audio.
EBU 3285 Supplement 5 defines aXML, which is named after an XML expression of the Dublin Core based core audio descriptive metadata standard. It allows an XML document of any length (up to a container's limits), may appear in any order within a container, and may be used in RIFF containers within the INFO chunk.
ATSC 3.0 (2017)
ATSC 3.0 - also known as "NextGen TV" - is a broadcast multimedia standard created by the Advanced Television Systems Committee (ATSC), an international multimedia broadcast standards body. ATSC 3.0 leverages other existing multimedia standards and technologies, and simply bundles them together for the purpose of standardizing their use within the terrestrial broadcasting industry. ATSC 3.0 aggregates a foundation of three (3) popular standards to create a unified standard. It encompasses AC-4 Dolby audio, HEVC (v1/v2), and MPEG-H.
ATSC 3.0 is effectively an effort to standardize the broadcast industry's application of 3D sound and 4K video. From a techincal perspective, it is a fully digital, object-oriented approach to the broadcast streaming of multimedia content. A downside of this approach is it requires complementary hardware support on the receiving end. For example, consumers must have corresponding compatible set-top-boxes or so-called "Smart" TVs in order to receive ATSC 3.0 content. The standard is not backward compatible, and represents a concerted effort within the broadcast industry to modernize across the board.
ATSC 3.0 doesn't have its own metadata. Instead, it relies upon metadata embedded within the objects it carries. ATSC 3.0 should be thought of as a transport layer. It is not a codec or container. It simply specifies how digital data will be delivered and shared.
Audio Data Model (2014)
The Audio Definition Model (ADM) is an open standard audio metadata model. Designed as an extension to BWF and BW64 audio container formats used in broadcasting distribution and production, ADM is an XML-based general description metadata model. It describes technical characteristics of object-based audio, scene-based audio, and channel-based audio. ADM may be included in BWF/BW64 WAVE files (delivered inside a special "AXML" chunk) or as a stand-alone streaming format.
Designed specifically for the BW64 audio file format, ADM is a technical metadata format that defines multiple loudspeaker positions for the replay of channel-based, scene-based, and object-based audio.
The ADM standard is maintained by the ITU and EBU, where it has historically been managed across several document trees:
- EBU TECH 3364 1.0 (2014)
- EBU BS.2076 Version 0 (2015)
- EBU BS.2076 Version 2 (2019)
- Workflow for ADM metadata inside BW64 files: ITU-R BS.2388 Version 3: Usage guidelines for the audio definition model and multichannel audio files (2018)
- BW64 technical metadata (not contextual) extension: ITU-R BS.2076: Audio Definition Model renderer for advanced sound systems (2019)
The Broadcast Exchange Format (BXF) is yet another SMPTE standard for data exchange in the broadcasting industry. BXF was developed to replace various archaic types of information exchange for playlists, record lists and other data in broadcasting. The current specification is BXF version 4.0 (2017). The first version was finalized in 2008 and published in 2009. Subsequent versions 2 and 3 were released in 2012 and 2015, respectively.
Like nearly all broadcast-type metadata standards, BXF is XML-based and primarily covers technical metadata. However, BXF is a bit more versatile than many other broadcast metadata schemas because it also has provisions for descriptive (content) metadata, though the latter is quite limited.
The BXF standard was last updated in 2017.
iXML is an XML based metadata format designed exclusively for BWF files. It is more flexible than the Broadcast Audio Extension chunk and can hold more information. It was introduced in 2004.
Material Exchange Format (2004)
Material Exchange Format or MXF is a broadcast container format. It stores metadata based on Key:Value pairs and may be structured and/or unstructured. MXF's structured metadata is regulated by various SMPTE standards and stored in a file header. Optionally, MXF files may contain pointers to external files containing descriptive metadata. The format also supports a metadata plugin called DMS-1 (Descriptive Metadata Schema).
MXF is a huge standard in and of itself, and it is also interoperable with MPEG-7, TVA, and P/Meta. One of the most comprehensive resources on native MXF metadata structure is the Federal Agencies Digital Guidelines Initiative publication, AS-07: MXF Archive and Preservation Format Application Specification (2017).
Note the MXF metadata standard refers to the associated multimedia content - in whatever form - as "essence."
Descriptive Metadata Schema/DMS-1
SMPTE standard 380M defines a descriptive metadata framework applicable to SMPTE EG42 (the MXF specification) called DMS-1 (Descriptive Metadata Schema).
DMS-1 is a plugin used by MXF. It is designed to circumvent MXF's rigid structured metadata limitations. MXF (Material Exchange Format) is a broadcast, structured metadata format that normally does not allow free-form descriptive metadata. DMS-1 provides a means around this limitaton while maintaining compliance with the SMTPE guidelines for MXF.
DMS-1 defines a descriptive, freeform metadata format. The standard is somewhat unique in that it specifically supports local tag values, provided they do not mimic an existing structured tag name found in the MXF metadata header which already conforms to the MXF standard. Interestingly, one might at first suspect this makes DMS-1 dark metadata, however that is not the case. A MXF decoder supporting the DMS-1 plugin will parse DMS-1 tags in the MXF file header as if they were structured metadata. This means as long as the free-form text metadata applied with the DMS-1 plugin standard conforms to the MXF expectation of key/value formatting used in its header, and the corresponding keys don't already exist as standard (structured) MXF keys, the decoder will also read these semi-structured contextual metadata key/value pairs in DMS-1 form, found in the MXF header.
MXF is prone to dark metadata through purposeful or accidental use of incompatible extensions and plugins. This is perhaps not too surprising given the breadth of standards it is intended to support (see the MXF section above for more information). If a decoder follows the standard explicitly, but an encoding application does not, the effect often results in creating what is effectively dark metadata, with unpredictable results during playback/metadata reading.
P/Meta Semantic Metadata Schema (P/Meta) is a free, XML-based broadcast use technical metadata format. Managed and maintained by the European Broadcasting Union (EBU), P/Meta may be used with MXF and BXF. The project concept began in 1999 and took four (4) years to come to fruition. Given the fact its technical specifications (e.g. EBU Technical Specification Tech 3295) are now relatively outdated (2011 or so) and the format is not particularly popular, it almost constitutes dark metadata in the sense many modern applications are unlikely to recognize it.
Support for P/Meta ceased in 2011.
Advanced Authoring Format (2000)
Advanced Authoring Format (AAF) is an XML-based cross-platform video post-production and authoring schema that encompasses technical metadata only. Created by the Advanced Media Workflow Association (AMWA), AAF was later standardized under the Society of Motion Picture and Television Engineers (SMPTE) standards organization. Its founding members include Avid, BBC, Microsoft, Sony, and CNN.
AAF is a niche product that occupies a unique role in the broadcast media market. Out of all the broadcast metadata standards mentioned in this article, AAF is probably the one you're least likely to ever encounter. AAF was designed for the video post-production and authoring environment, and as such it represents works in progress, as compared to every other metadata format, built for exchanging finished media products.
A file format for professional cross-platform data interchange (like IFF), AAF is stored in a separate file from the content, and uses a header object designed to facilitate sharing metadata between incompatible devices sharing the same program material.
TeleVision Anytime - also known as TVA and TV-Anytime - is a now defunct attempt at standardizing broadcast metadata for the purpose of facilitating storage and retrieval of multimedia content to and from consumer-based multimedia display devices. In-other-words, it was a metadata standard tied to broadcast content. Begun in 1999 as digital satellite television was begining to reach critical mass in adoption, this global consortium of organizations sought to develop specifications to enable audio-visual and other services based on mass-market, high volume, digital storage in consumer platforms (i.e. DVRs).
Much like P/Meta, TVA appears to be dead. The most recent overview of TVA on the EBU's website is dated 2016, and the TVA standards have not been refreshed since at least 2012. This is another topic mentioned in this article for the sake of completeness. "It's dead, Jim."
The most recent update for AAF was released in 2011.24
Dublin Core (1995)
Dublin Core is a metadata framework.
The Dublin Core Metadata Initiative (DCMI) is the organization that owns the rights to Dublin Core. DCMI's self-stated mission is to "support innovation in metadata design and best practices across the metadata ecology." Basically, it's another metadata standards organization. It's mentioned here as a reference point because the term occasionally crops up in metadata discussions, and because Dublin Core is designed specifically to be incorporated within other metadata standards.
The original Dublin Core standard was codified under IETF RFC 2413: Dublin Core Metadata for Resource Discovery. Dublin Core stores structures semantic metadata in 15 pre-defined core elements. These core elements may contain multiple values (e.g. multiple artist names, actor names in a film, etc.). Although the original RFC was revised in 2007 (see RFC 5013: The Dublin Core Metadata Element Set), the so-called Core Elements remained the same. The full, current standard is available on DCMI's website.
Modern metadata formats are increasingly moving toward specific metadata standards that operate completely independently of their associated content. In some cases this means the metadata is stored in a separate file that accompanies the content file. In other situations, both content and media effectively co-exist together within a single, top-layer multimedia container. These methods (and particularly the latter) are becoming more popular because they remove the burden of metadata management from the content container itself, while simultaneously creating a more flexible metadata environment, which benefits both content creators and content consumers. An additional benefit is a universal metadata type may be applied to multiple different types of media content, making the metadata and media library management processes smoother and more efficient. This gain of function is largely driven through a concurrent process of ignoring metadata in the media container, removing it, or (ideally) not allowing it to be created in the first place.
It may surprise you to learn Apple's veritable iTunes has its own proprietary metadata format. Released in 2004, iTunes Metadata was designed to be used with iTunes (naturally), and Apple's QuickTime. iTunes metadata was a hybrid structured model unlike any other format, which makes it impossible for most 3rd party applications to read and write it. Technically, UTF-8 based free-form text entry with a maximum of 255 characters per field, the format acted very much like Vorbis Comment. However, under Apple's control (such as editing metadata via a user's iTunes account), it was limited to specific fields which Apple's products controlled behind the scenes.21
Apple deprecated the iTunes standard in 2019
Native Apple iTunes metadata is basically Album, Track, Song, Artist, and Title. Linked XML files are another way metadata is sometimes applied, especially for content containing video (e.g. .info files).
QuickTime File Format (.mov and .qt files)
QuickTime files use what is technically called the QuickTime File Format or QTFF. It is a chunk-based file structure, but of course Apple being Apple, calling it a chunk-based file format (what it is) isn't good enough. So, Apple invented calling the same thing, "Atoms." According to Apple, and "atom" is a basic structure for storing information. And it's how a file is structured. Looking at atoms from a technical viewpoint, they are chunks. Apple can call them whatever they want, but in terms of framing this discussion, they're the same thing. It's just typical Apple-speak to try and convince people the Apple version of something mundane is somehow superior when it isn't.
Moving on, QT files may store technical and contextual metadata via a myriad of ways, which increases the mystique and borders on turning all your metadata into dark metadata because it is not always obvious where it's stored. Although the metadata is stored inside specific atoms (chunks), the metadata atoms may have various different names, and they don't need to be embedded in the QuickTime file. It may be (most common approach), but doesn't have to be so long as there is a reference in the file pointing to its location, which could be an URL or another file.
One can tell Apple learned quite a bit from the weaknesses of earlier multimedia formats such as RIFF and AIFF, and this is reflected in some portions of QT's superior architecture design. For example, aside from the flexibility in metadata location mentioned above, QT supports nested chunks (atoms) and each top-level chunk (atom) self-indicates if its size should be handled as a 32-bit or 64-bit range of data (i.e. can hold >4 GB of data). Furthermore, if your QuickTime file has 64-bit chunks but your device can't support 64-bit, on playback those chunks will be ignored. The atoms (chunks) may also be presented in any order.
Numerous aspects of the QuickTime format follow the MPEG-4 and JPEG-2000 architecture conventions. Coupled with an optional, special atom designed to identify file format compatibility, it's possible for a suitable QT device to playback any of these video file formats embedded inside a .mov file. Practically speaking, however this is rare and encoding .mov files as such should generally be avoided as most players presume a QuickTime file (.mov) is simply that: QuickTime content.25
QuickTime File Metadata
QuickTime files are obsolete and difficult to work with. If you still have some, I recommend attempting to convert them to a universally adopted format such as MP4 (MPEG-4) and manually re-writing your contextual metadata if you must. Now, at this point if you're really stubborn and insist on using .mov files and manipulating their metadata, these resources may be useful. You should read them.
- QuickTime metadata overview from Apple
- QuickTime technical metadata overview
- Mapping other audio/video file formats to QuickTime
QuickTime has a lot of flexibility. It is capable of containing details about the details, and giving excruciating control over some minutiae to the file creator. The chart shown here only displays select common fields. More are available, but did seem prudent to list here due to QuickTime's obscurity. For more details, be sure to examine the information linked above.
Extensible Binary Meta Language (EBML) is the official metadata format of Matroska. It is an XML-like framework. Matroska's official tags are stored in the EBML format. Other tag types Matroska is capable of handling, such as ID3, don't use EBML. Remember, EBML isn't a set of tags. It is a framework that defines the details of how Matroska official tags are stored.26
Exif (or EXIF) is an abbreviation for Exchangeable image file format. Unlike most other container file formats, Exif is referred to officially by it's abbreviated name, thus the lowercase first characters in its full name.
Originally designed as a standard for recording metadata associated with still images (photographs) blended with short audio clips, Exif popularity has grown over time since its release in 1995. The latest version is 2.32, released in 2019.
Exif was one of the first metadata standards in digital photograpy and one of the first to assimilate geotagging with digital media. Natively, it supports .jpg, .tif, and .wav files.
Exif has a limited structure. In fact, it's structured metadata is applicable only toward image-only content. When utilized for tagging audio content, Exif may store it as uncompressed PCM (Pulse Code Modulation) or ADPCM (Adaptive Differential Pulse Code Modulation), implemented using the RIFF standard (thus its .wav file support; i.e. the WAVE/RIFF standard). This is not surprising if you think about it. When Exif was invented, RIFF was one of the few metadata standards at that time for audio content. Exif does not natively support moving images metadata (i.e. film) formats. Audio files saved by an application in Exif format will use the RIFF file standard by default (specifically, an INFO chunk). Thus, Exif is not a very useful metadata type for audio files. While it can be used (and presumably when doing so the intent is to mimic storage of structured Exif metadata fields), its focus is normally on technical data such as sampling rate and current time/date stamps. Exif does not understand typical audio tags such as those found in the ID3 specifications.27
Most digital cameras using Exif metadata automatically store GPS location, time, date, and a unique device identifer, making the standard somewhat controversial from a privacy perspective.
iCCP is a very particular technical metadata standard you'll likely never encounter. iCCP stands for ICC Profile, and is promulgated by the International Color Consortium (ICC). Part of PNG image specification, it is specifies color characteristics of an object. iCCP is recognized by some multimedia formats, such as WebM. iCCP contains technical metadata only. It is mentioned here solely for the sake of completeness (it's mentioned in the WebM metadata spec).
ID3 is a very popular metadata format utilized primarily for MP3 audio files. ID3 metadata - commonly referred to as "tags" - is quite versatile. First, it is supported natively in MP3 files - the most popular digital music file format on the planet - and secondly, many other codecs and containers are compatible with it. ID3 is one of the most ubiquitous metadata methodologies.
Newer isn't always better.AAC files only accept single frame ID3v2 tags while the older MP3 standard it replaced permits multi-frame ID3v2 metadata.
Why is ID3 so popular? The MP3 standard (MPEG-2 Part 3) has never included metadata. An avid user named Eric Kemp invented a creative method of including metadata embedded inside MP3 audio files in such a way that MP3 file readers ignore it, but a metadata-reading application that knows what to look for can find it easily.28 Upon discovery by others, it was more or less an instant hit. ID3 is not the de facto metadata of MP3, even though it's not official. ID3 is both structured and descriptive. For example, while it is structured, it contains specified fields designed to be arbitrary. This makes it flexible. Furthermore, no particular field is required, and later versions of it (e.g. v2.3, v2.4) also act as a framework, allowing custom fields to be created. Even if one user applies arbitrary, custom "tags" (key:value pairs) to a particular file, any application capable of reading the corresponding ID3 version with which the tags were created will be capable of presenting the custom metadata if the playback application allows it. In a worst case scenario, the data will be ignored.
The most important things you need to know about ID3 metadata:
- ID3v2.3 is the most widely used version
- ID3v2.4 is the most recent version
- ID3v2.4 is not fully backward compatible with v2.3
- The GENRE tag does not officially exist in ID3v2.x, the TCON tag is commonly used to represent genre
- The list of ID3v1 genres is still considered the gold standard of genre lists
Due to significant re-mappings at various points in time, ID3 versions are mixed when it comes to backward compatibility. ID3 has two (2) main versions: ID3v1 and ID3v2. The main versions are very different and are not compatible with one another due to significant changes in where the metadata is stored in a file. The chart below demonstrates the relationships between versions and how they impact users on the "read" side of the equation.
Although ID3 metadata tagging is extremely popular, it's had its fair share of proverbial bumps in the road. Today, those issues should be well ironed out, but as recently as the mid-2010's there were still numerous applications that either failed to write ID3v2.4 tags properly, or sometimes couldn't read them properly. Prior to this, the when ID3v2.3 was released, it was the first version to support UTF-16 characters. Many applications have historically had trouble with certain characters as they mistakenly presumed ID3 only supported Latin characters (ISO 8859-1 standard). This is all in spite of the fact the v2.4 standard was released in 2000! Aside from these technical challenges for some, version 2.4.0 also changed up some of ID3's dedicated (structural) frame IDs (tags) and added support for UTF-8 characters, resulting in occasional data mismatches upon reading v2.3 versus v2.4 tags. There really isn't a completely foolproof way to prevent this other than sticking with one version or the other and being consistent. And adding insult-to-injury, the ID3 standard has been tweaked over the years in other respects as well.
Only ID3 versions 2.3 and 2.4 should be used unless you have a specific need for v1. Versions 2.0 and 2.2 are obsolete and should be avoided. Consult the version compatibility chart above for additional guidance. Most ID3 afficionados recommend using ID3v2.3 as it has the greatest chance of working seamlessly with applications that read ID3 tags.
ID3v1 vs ID3v2 (1993-2003)
ID3 has two very different and incompatible branches: Version 1 and Version 2. They differ in several important ways. Understanding these nuances is important if you have files encoded with ID3v1.x metadata. Not all programs can read the v1.x format, nor can all tag readers read and/or convert them.
- Tags at end of file
- Maximum tag size 30 characters
- ID3v1.1 (Enhanced Tags)
- Extra data block added in-line before ID3v1 tags
- Free-form genre description
- Track start/end times
- Artist/Album/Title tags increased to 60 character limit each
- Fade in/out flag
- Added Track Number field
- Added 79 pre-defined genres
- WinAMP (a ) later expanded the list to 191 genres
- Tags usually at beginning of file, but may be at the end
- Introduced concept of "frames" separating content types in an audio file
- Numerous 4-character tags were introduced (structured metadata)
- Fixed Genre IDs were dropped, in favor of free-form genre text entry29
- Added "User defined text information frame" capability (custom tag creation)
- Completely different software development group from ID3v1's creators
- Maximum tag frame size of 256 MB
- Introduced multi-frame tags
- Maximum tag size (within a multi-frame tag) of 16 MB each (per sub-frame)
- Unicode support (language agnostic)
- Chapter support added in 2005, but support by 3rd party applications is sporadic and unreliable
NFO (.nfo) means "iNFO" or "information." Sometimes presented as .info, these are a form of metadata. They are informational files describing metadata about another multimedia file. Their purpose is to provide a metadata source independent of the content. Thus, they remove the dependency for metadata from the content itself; instead storing it in this separate file. The advantage of this approach is you may store any type of metadata about any type of media file. It completely decouples the metadata from the content.
However, there are some downsides. First, just like dark metadata, if the encoding application and decoding application are not on the same page, it won't work. So, this means generally speaking a .nfo file is going to only work with one program or application, and one should presume the metadata inside of it is not suitable for sharing or export. MusicBrainz and Kodi are examples of well-known applications capable of using NFO files. However, as mentioned above, these are proprietary implementations and any use of NFO files should be treated carefully. Always presume such methods are singularly specific to a particular application.
In many respects, NFO files are the antithesis of the concept of metadata. For the most part, the only circumstance where they make sense is when there is a desire to apply metadata to a media format that eitehr does not support any form of metadata, or the media in question is inside a container type with very limited metadata and migrating the content to a superior container type is not an option. The whole point here is the purpose of metadata is to share information, and NFO files tend to make sharing information very difficult if not downright impossible.
.nfo files can be a necessary evil at times, but should be avoided if possible
Vorbis comment (often incorrectly referred to as "vorbis tags") is a very simple, UTF-8 based open text metadata framework. You can create any custom tag name you like, although the official standard does include a limited set of pre-defined "official" (suggested) tags. Vorbis' architecture is very similar to RIFF, though they are completely unrelated. Vorbis Comment tag names are normally in ALL CAPS.
"ISRC" is an abbreviation for International Standard Recording Code, a unique identification system for recordings. You may think of it as an ISBN system of sorts. There is a searchable ISRC database. The idea behind ISRC is to uniquely identify works of art, greatly simplifying the process of matching up information such as metadata, cover art, etc. Theoretically, if every item in your entire collection of multimedia content was simply tagged with an ISRC code, it would be a very simple process for a tagging program to correctly tag your entire multimedia collection very quickly.
How do I embed pictures in Vorbis Comment metadata?You don't. Vorbis comment is incompatible with picture data. If you want embedded artwork in your file, you'll need to use a file format that supports in addition to supporting Vorbis comments. FLAC, for example is an audio format that supports embedded images and expects Vorbis comments to be used as its metadata. On the other hand, ID3 provides native support for embedded pictures and metadata.
eXtensible Metadata Platform (XMP) is a metadata framework developed by Adobe. Like most other metadata formats, it works by embedding data in a file. XMP is a well defined, text-based metadata standard consisting of serialized XML. It is broken down into three (3) parts:
- Part 1: Data model
- Part 2: Additional properties
- Part 3: File storage
XMP is an unusual metadata framework. It has its own schema, yet it also supports Dublin Core, ID3, and iXML. Originally developed for applying metadata to photographs, XMP was initially designed to support the digital camera market and compete with Exif. Over time it has evolved substantially. Like MPEG-7, XMP evolved into an International Organization for Standardization (ISO) standard (2012).30 Unfortunately, one cannot rely on the ISO for the most recent version, as would typically be common with ISO standards. In the case of XMP, while the ISO has sanctioned it, Adobe remains XMP's patent owner and doesn't seem to be concerned with coordinating the release of standard updates with ISO. This unfortunately has created a patchwork of disconnects from time to time between the current version and the ISO sanctioned version. For example, currently Part 3 was updated by Adobe in 2020, yet the current ISO version is circa 2019 (ISO 16684-1:2019) and reflects the 2016 version of Adobe's documentation. So, as you can see, unless you're obtaining the standard format directly from Adobe, it can be difficult to ascertain exactly how current (or old) your documentation actually is. To wit, XMP's guidelines should always be retrieved from Adobe's XMP Developer website to avoid this potential pitfall.
XMP, Exif, and MPEG-7 are alternative metadata platforms. They both offer a wide scope of guidelines toward applying metadata. Neither is particularly better than the other, though XMP is more widely applied. For hardcore users, in addition to maintaining the standard, Adobe has a Software Developer Kit (SDK) with C++ software libraries to make it easier to create and manage XMP. The current SDK may be found on GitHub.
ReplayGain is a proposd psychoacoustic loudness (volume) adjustment standard designed to normalize the perceived loudness of sound. It is a bit odd as it somewhat broaches the divide between structural and technical metadata types. Its purpose is to normalize the volume of a collection of audio (e.g. music) such that a user perceives the volume of each selection as nearly identical. The idea is to solve what is a common problem for music listeners when playing back digitally recorded audio. Namely, the gain (signal strength) of each recording is frequently different from others. The classic example is a comparison to a movie soundtrack where some scenes are very loud and some are very quiet, and the user may find themselves frequently reaching for the volume control to lower or raise it as the case may be.
Playing back audio content on the same device set to the same volume may give the user a perception that some songs are louder than others, when in fact the signal or recording strength may simply be greater for one recording versus another. At any rate, the intent of ReplayGain is normalize this effect by boosting or retarding audio elements of a track in order to cause it to maintain the same perceived average volume to the listener.
ReplayGain serves a niche purpose and its use remains relatively muted (at least in terms of conscientious usage), though it is widely supported by playback devices and software programs. Many audiophiles, for example frown upon using it as at more extreme levels of attunation it can diminish the original intent of the artist. ReplayGain is not quite a de-facto standard, such as ID3, though it is well known. Furthermore, the current state of ReplayGain is a somewhat fragmented landscape of sub-standards. For example, one variation of the standard promotes the normalization of audio output to a baseline of 89db. Thus, a ReplayGain value may be positive or negative, depending on the content's recorded volume. Another version caps the positive and negative gains or reductions in applying ReplaGain to a maximum swing of -15 to +15 db, irrespective of the sound level of the original recording. In many ways, this latter method is more akin to a Loudness adjustment rather than a true ReplayGain adjustment, whereas the former more closely aligns with the original spirit of the concept.
The Moving Pictures Experts Group (MPEG) was the first international standards organization to make a concerted effort at defining a well organized structure for expressing metadata (MPEG-7). Ironically, most MPEG multimedia formats don't have any metadata associated with them at all. The following are MPEG standards not discussed elsewhere in this article.
The "eye" in MPEG-I stands for "immersive." MPEG-I is intended to be a next-generation multimedia format devoted to so-called immersive multimedia technologies. This basically means 3-dimensional immersive experiences, such as Virtual Reality simulation, and omni-directional/enveloping audio experiences. MPEG-I's official standard is ISO/IEC 23090 (Coded Representation of Immersive Media). It is a very comprehensive standard that includes media architecture, video coding algorithms, geometric compression, video experience guides, and content on the network transmission of such data. MPEG-I should be considered a draft standard for the time being as it stil under active development and definition by various MPEG working groups.
MPEG-I is expected to have its own metadata, which will fall under Part 7 of the standard. However, for the time being there is not even a draft metadata standard. MPEG-I is not likely to become mainstream before the mid-2020's, given its current trajectory, market demand, and complexity. Noteworthy to some people, MPEG-I is expected to contain a new type of "green" metadata for the first time.
Green Metadata (MPEG-B)
Yes, believe it or not there is such a thing as Green Metadata. The term "green metadata" refers to the collection and monitoring of metadata related to energy consumption, and is captured under ISO/IEC CD 23001-11. It is also known as MPEG-B. Originally drafted in 2015, the current version is ISO/IEC 23001-11:2019.
MPEG-H is an international standard for 4K video and 3D audio (ISO/IEC 23008-3:2019/AMD 1:2019). It standardizes the structure of files, transmission, and broadcast of 4K video and 3D audio content. Part 2 is the video standard portion of MPEG-H, which is HEVC. Part 3 is a 3D audio standard. Part 1 describes its transport layer, which includes approximately 2x the compression of AVC, putting it on par with HEVC v2 (H.265) in terms of file size.
One may think of MPEG-H as MPEG's rough equivalent to Dolby Atmos, another 3D audio standard. Like Atmos, MPEG-H Part 3 is an object-based audio format. Although similar standards in some respects, it's important to note Dolby Atmos is a codec/audio delivery format while MPEG-H encompasses the spectrum of how content is packaged, transmitted, and delivered. While MPEG-H impacts the development of codecs and decoders, it is more of a platform versus Dolby's primary focus on implementation. From a metadata perspective, MPEG-H only has limited technical metadata. It does not natively support any contextual metadata standard. If you're delivering MPEG-H audio content and require semantic metadata, you will need to rely on the container level.
MPEG's Dynamic Adaptive Streaming over HTTP - frequently abbreviated as DASH - is a standard for delivering dynamic adaptive streaming of multimedia content over HTTP. Published in 2012 under ISO/IEC 23009, DASH has been updated several times between 2014-2020. MPEG-DASH defines a framework and is not actually a container or codec itself. You may think of it as infrastructure design. Within the context of metadata, MPEG-DASH presentations may include a Media Presentation Description file, or MPD. MPD files contain the metadata for the associated DASH media content in XML or UTF-8 formats.
Even when DASH content is encountered, it rarely includes MPD metadata. It is much more likely to have metadata that is part of the streaming content, such as an MPEG-4 container.
MPEG-7 is a dedicated metadata standard and marks MPEG's first attempt at standardizing the storage and presentation of semantic metadata in a highly organized and reproducible fashion. A downside to applying any metadata at all to a file is metadata as a subject is a bit of the Wild West. Most metadata formats are used in an ad-hoc fashion. At its introduction, MPEG-7 sought to resolve this issue by finally creating a standard for all MPEG content. Defined in 2002 and launched in 2003 as ISO/IEC 15938, the latest iteration was updated in 2015 as ISO/IEC 15938-5:2003 Amd 5 (2015) (Amendment 5).
MPEG-7 metadata is focused primarily on analytics rather than descriptive or contextual information about a work of art.
MPEG-7 is not a standard dealing with audio/visual content. It may be used to identify audio (e.g. music), video, and multimedia (both audio and video content together). MPEG-7 specifies how to describe content. Let's break it down: MPEG-7 (ISO/IEC 15938 Multimedia content description interface)....
- is a multimedia content description standard
- represents information about the content, not the content itself
- uses XML
- is intended to provide complementary functionality to previous MPEG standards (-1,-2,-4)
- independent of other MPEG standards
- facilitates content management, such as via search
- requires audio/visual content description to be separate from the actual content, but they must have a relationship to one another
Why create a standard for metadata? MPEG's intent was to allow fast and efficient searching for material of interest to the user. MPEG-7 defines a language to specify Descriptors and Description Schemes, called the Description Definition Language ("DDL").
MPEG-7 has three (3) parts of particular importance:
- Part 3: Video
- Part 4: Audio
- Part 5: Multimedia Description
While MPEG-7 specifies how metadata is to be recorded and handled, there is no standard when it comes to extracting its metadata. However, since MPEG-7 is stored as XML, any text-based reader or XML parser is capable of reading the information (at least the raw data).
One of the most confusing subjects about metadata is MPEG-4 metadata (more commonly known as MP4). Why? There is no official MP4 contextual/structured metadata standard. None. Doesn't exist. On-the-other-hand, MPEG-4 was the first MPEG standard to incorporate some metadata standard; namely, technical metadata. MPEG-4's self-contained metadata exists solely to facilitate accurate playback of MPEG-4 material.
MPEG-4 is a container format (not a codec). This makes it uniquely qualified to store metadata, even if it doesn't understand its context. MPEG-4 files (e.g., .mpa, .mp4) do not have contextual (structured) metadata, though codecs stored inside an MPEG-4 container may. Now, here's where confusion tends to begin setting in. Aside from whatever metadata an embedded codec might contain, MP4 files may contain certain descriptive metadata, such as location data that is part of the container. While MP4 files have no pre-defined metadata of their own, they have built-in reservations for specific metadata fields. Generally speaking though, MPEG-4 metadata needs to be defined optionally, such as by a particular codec or implementation (e.g. QuickTime has its own set of metadata). Furthermore, other than MP4 files specifically, the MPEG-4 standard itself does not mandate any metadata at all. Though again, in what can seem like contradictory requirements, MPEG-4 containers have built-in arbitrary metadata slots that resEBMLe metadata, yet they are totally undefined. In this regard, MPEG-4 acts much like Vorbis Comment.
Let's try to simplify this issue a bit. The golden rule per se, is metadata in an MPEG-4 file depends on the codec(s) used. However, if the MP4 file format is used there are some additional options found specifically with that file type. Defined under ISO/IEC 14496 Part 14 (2020),31 MP4 files have their own metadata standards in addition to that of MPEG-4 (which for most practical purposes is non-existent). This includes the ability to hold XMP metadata.
MP4 files specifically have inherent metadata separated into two (2) types: timed and non-timed. Timed metadata is stored in a track with the media it is describing (e.g. region, location, etc.). Non-timed metadata items are associated with a track.
ISO/IEC 14496 Part 12 (Information technology — Coding of audio-visual objects — Part 12: ISO base media file format) also defines metadata guidelines for ISO base files. What consumers think of as files are ISO based. It is very generalized and simply specifies concept such as the fact metadata may be contained in the same file as media content or a separate file. From a practical standpoint, it's not a useful reference with regards to multimedia collections. The current standard is ISO/IEC 14496-12:2015/Amd 2:2018.
Some MP4 implementations (e.g. ffmpeg) use their own set of structured metadata for MP4, however the MPEG-4 standard does not define a metadata standard.
MPEG-1 and MPEG-2 (1989/1991)
MPEG-1 and MPEG-2 metadata support is a bit confusing to most people. First off, MPEG-1 and MPEG-2 audio only files (Part 3) do support the ID3 standard. However, neither support metadata when used as multimedia formats (video with or without audio). A work-around is to place MPEG-1/MPEG-2 multimedia content inside a parent container format that supports metadata properly.
MPEG-2 multimedia files (.mpg) have no contextual metadata support
Surprisingly MPEG-1 does actually support embedded dark metadata (though using it as such is strongly discouraged). In fact, MPEG-1 and MPEG-2 Part 3 (audio)'s support for ID3 is functionally a by-product of how these MPEG standards handle unknown data frames versus their intended use. In other words, ID3 is supported by chance. MPEG-1 and MPEG-2 ignore the ID3 frames. Their codecs don't support ID3 natively.
This inconsistent treatment of metadata by MPEG-1 and MPEG-2 coupled with market demand for it led to the MPEG-7 standard.32
A Brief Note on Metadata and Digital Rights Management (DRM)
Digital Rights Management (DRM) refers to the protection of digital works of art and the preservation of copyright holders' rights from theft. Metadata ordinarily has nothing to do with DRM other than possibly copyright information. However, a few organizations do use it to some extent to allow and/or enforce DRM. One example is Microsoft, which in the past has used some forms of RIFF files for this purpose.
1 Devlin, Bruce; Wilkinson, Jim. (2006). The MXF Book. Elsevier.
12 ETSI TS 126 244 V14.1.0 Release 14  Digital cellular telecommunications system (Phase 2+) (GSM); Universal Mobile Telecommunications System (UMTS); LTE; Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP) (3GPP TS 26.244 version 14.1.0 Release 14). (2018). European Telecommunications Standards Institute. pp: 37-45.
19 Baindbridge, D.; Nichols D.M.; Witten, I.H. (). "How to Build a Digital Library": p. 306.
27 Camera and Imaging Products Association (CIPA). "Exchangeable image file format for digital still cameras: Exif Version 2.32." (May 2019). CIPA DC-008-2019. Section 5: Exif Audio File Specification.
29 While there is technically not a Genre tag included in the ID3v2.x standards, the TCON ID3v2.x tag is widely regarded and most frequently used for this purpose. Officially, TCON's value is supposed to be "Content Type."
31 ISO/IEC: ISO: International Organization for Standardization/International Electrotechnical Commission
32 Baindbridge, D.; Nichols D.M.; Witten, I.H. (7 October 2009). "How to Build a Digital Library." 2nd Ed. p. 306.
Apple, Inc. (13 September 2016). Metadata: QuickTime Metadata Keys. https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/Metadata/Metadata.html#//apple_ref/doc/uid/TP40000939-CH1-SW37
Baindbridge, D.; Nichols D.M.; Witten, I.H. (2009). "How to Build a Digital Library." 2nd Ed.
Devlin, Bruce; Wilkinson, Jim. (2013). "The MXF Book: An Introduction to the Material eXchange Format." 2nd edition.
Embedding Metadata in Digital Audio Files. (15 September 2009). Federal Agencies Audio-Visual Working Group.
Mauthe, Andreas Ulrich; Thomas, Peter. (2004). "Professional Content Management Systems."