Immersive audio has huge creative potential for music production, but it can be hard to get your head around. Here’s the explanation you’ve been waiting for!
From long‑playing vinyl, eight‑track cartridge and compact cassette through to CD, Minidisc and MP3, most consumer audio formats have had one thing in common: stereo. They contain two discrete audio signals, designed to be played back through two loudspeakers or earpieces. A listener positioned between these speakers hears a ‘sound stage’ or panorama, within which individual sources appear at specific positions. For example, a mono vocal at equal levels in both channels will be heard as being halfway between the two speakers, directly in the centre.
This illusion is impressive, but it’s also limited. It lets us localise sources along a line between the two speakers, and as being near or far away, but it can’t convey a sense of height, or reliably convince us that sound is coming from behind us.
Over the years, there have been several attempts to overcome these limitations, most notably quadraphonic sound in the ’70s, Dolby Stereo (ProLogic) through the ’80s and ’90s, and 5.1 surround in the early part of this century. However, these achieved lasting success only in the cinema. Despite considerable investment from record companies, domestic audiences didn’t warm to the new formats.
There were several reasons for this. Marketed as premium products, quadraphonic records, DVD‑Audio discs and multichannel SACDs were more expensive than the stereo versions of the same material. They could not easily be enjoyed on headphones, and required specialised playback equipment, including at least four loudspeakers. This, again, was costly, and it was impractical or at least undesirable in many home environments. Even if you had the money, the space and the domestic goodwill to set up a 5.1 speaker system, moreover, the benefits would be confined to a very small ‘sweet spot’.
What stereo, quad and 5.1 all have in common is that they are channel‑based formats, meaning that there is a fixed relationship between channel count and speaker count. Each discrete channel carries a signal that’s destined for a specific loudspeaker, and the loudspeakers themselves need to be configured in a specific physical relationship. In the right space, with everything set up correctly, the experience could be magical; but in practice, such spaces and setups were few and far between.
There are two key features of modern immersive audio formats. One is that they don’t just represent surround in the horizontal plane: they also contain meaningful height information that allows sounds to be perceived as being above the listener. The other is that nearly all of them break the simple relationship between channels and speakers. The delivery format does not simply contain a separate mono channel destined for each speaker, but a more complex data stream that is decoded in real time to map it onto whatever speakers are available in whatever locations. Unlike older surround formats, immersive audio therefore requires an ‘intelligent’ device in the replay chain to carry out this decoding and customised mapping. But in many contexts, this isn’t really a problem, because the stream is being played back from a computer, server or other device that has plenty of spare processing power.
Certainly, the possible need for an additional device in the signal chain is massively outweighed by the benefits. The key plus is that, in principle, immersive content can be decoded for any speaker arrangement you like, from the bandwidth‑limited mono speaker in a smartphone to a full cinema array with many rear and side speakers, overhead speakers and subwoofers. It can also be fed through a binaural encoder to achieve a sense of immersion on headphones. In other words, whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.
...whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.
We can classify immersive audio formats as being channel‑based, scene‑based or object‑based. The first category refers to surround formats beyond 5.1 that incorporate speakers above the listener, and can therefore claim to be immersive whilst retaining the simple direct and exclusive channel‑to‑speaker mapping. Scene‑based formats, by contrast, present a single, complex data stream that describes a complete three‑dimensional soundfield. Finally, object‑based formats package a number of discrete audio streams along with metadata that tells the decoder how these individual streams should be positioned.
Crudely put, channel‑ and scene‑based formats contain fully mixed audio, whereas an object‑based format contains the major elements of a mix plus metadata explaining how that mix should be implemented in a given playback environment. At the time of writing, there are a number of commercial audio production and distribution formats that are billed as immersive, spatial or 3D, all competing to dominate various market sectors. As we’ll see, many of these are in fact hybrid formats that combine object‑based elements with channel‑ or scene‑based elements.
The main example of a pure scene‑based 3D audio format is Ambisonics. Developed as long ago as the late 1970s by Michael Gerzon and Peter Craven, Ambisonics can be thought of as an extension of Mid‑Sides stereo. Mid‑Sides is also known as ‘sum and difference’, and first‑order Ambisonics extends the concept by adding two additional ‘difference’ channels, representing the front‑back and up‑down axes. The sum or W channel describes the omnidirectional component, while the subsequent X, Y and Z channels describe the directional components of the sound in the three orthogonal planes.
Ambisonics can be scaled to an indefinite number of orders. The number of channels needed to implement a given order n is (n+1) squared, so second‑order Ambisonics requires nine channels, third‑order 16 and so on. The advantage of higher orders is that the listener is able to localise sources more precisely within the soundfield. For example, consider two different sound sources starting at the same point and slowly moving apart. The higher the order, the narrower an angle we can discriminate between them.
There is no real‑world playback system that can play back a raw Ambisonics signal. So, just as Mid‑Sides stereo needs to be matrixed into left and right channels for playback on conventional loudspeakers or headphones, an Ambisonics signal must be processed to adapt it to whatever playback system is available. This can involve extracting separate mono signals for each speaker in a surround array, or running the signal through a binaural encoder for headphone listening.
Ambisonics is thus similar to stereo and 5.1 surround in the sense that each decoded channel describes part of a complete soundfield, rather than a discrete element within that soundfield. Where it differs from those formats is that the one‑to‑one relationship between channels and loudspeakers is broken. An Ambisonic signal is an abstract representation of the soundfield that needs to be matrixed to a particular listening environment.
In an object‑based format, by contrast, each channel describes a specific element of the mix: a vocal, an instrument or group of instruments, a Foley effect such as an explosion, or whatever. Each object is packaged with its own set of timecoded metadata, which is derived from and often identical to mix automation data. For example, the metadata for a percussion track might tell it to remain in the distance at the upper left of the soundfield until 30 seconds into the track, then move slowly across the top of the listener’s head and get closer before performing a quick pirouette. To deploy a rather leftfield analogy, a scene‑based audio mix is like a fully-baked loaf that only needs to be sliced up, while a channel‑based mix is a pre‑sliced loaf — but an object‑based mix is like a part‑cooked loaf that comes with instructions as to how it should be finished off.
To deploy a rather leftfield analogy, a scene‑based audio mix is like a fully-baked loaf that only needs to be sliced up, while a channel‑based mix is a pre‑sliced loaf — but an object‑based mix is like a part‑cooked loaf that comes with instructions as to how it should be finished off.
Scene‑based and object‑based formats can both deliver very impressive results, but in their pure form, they have different strengths. Higher‑order Ambisonics excels at presenting natural‑sounding ambiences: if you want to create the impression that the listener is really standing in the middle of a jungle or a busy city street, this can be achieved very effectively. This is partly because Ambisonics isn’t just a replay format. It’s also a capture format, and the use of an Ambisonic microphone such as the Soundfield permits direct, natural recording of a three‑dimensional soundfield.
The flip side of this is that Ambisonics can feel a little understated, and even in second and third orders, offers relatively limited localisation. An Ambisonic presentation of a soundfield is analogous to a coincident stereo recording, in that all of the directional information on offer derives solely from level and tone differences, and not from time‑of‑arrival differences. Those who have worked with it say that it’s hard to make a single instrument or mix element really leap out at the listener, or to create a really precise sense of its occupying a specific position.
This is where object‑based formats come into their own. Because the positioning of objects is implemented locally and with reference to the particular speaker layout in use, they can be placed and moved with a precision and ‘in your face’ quality that is otherwise difficult to achieve. Rock and pop mixing is often more about power, drama and immediacy than it is about naturalism, and this sort of larger‑than‑life presentation is easier to achieve with object‑based mixing than through Ambisonics.
Although data bandwidth and storage space are much less pressing issues than they used to be, both are still finite resources. If every single source within a mix is treated as an individual object, channel counts and file sizes rise alarmingly, especially in complex soundtrack‑type music, or where music coexists with dialogue and sound effects. Yet in most music mixes, there are quite a few elements that don’t need the precise localisation or detailed positional control that objects offer; and a workflow that forces the mixer to assign every single instrument in a busy mix to its own object would risk being very slow and inefficient.
Consequently, many of the immersive formats that are currently jostling for space in the market have a hybrid nature, combining object‑based playback for important mix elements with one or more scene‑ or channel‑based streams that can mop up the rest. Dolby Atmos is a good example: in addition to its objects, an Atmos mix can contain one or more ‘beds’. Each bed is, in essence, a conventional channel‑based stream in up to 7.1.2 surround (ie. seven horizontal channels, one LFE channel, and two height channels); when the mix is played back on a system with fewer speakers, it gets intelligently folded down to 5.1, quad, or whatever is appropriate. This has the added benefit that older surround mixes are more or less directly Atmos‑compatible, since they can be represented as a simple Atmos stream containing only one bed and no objects.
A Dolby Atmos mix can contain up to 128 discrete mono audio channels in total, divided between beds and objects. It would be unusual to use more than one bed in a music mix, so assuming that’s a single 7.1.2 bed, you’d be left with a maximum of 118 mono or 59 stereo objects available. The rival DTS:X standard is similar to Atmos in that beds are augmented by objects, but in this case there’s no upper limit on the number of objects. Sony’s 360 Reality Audio is built on the MPEG‑3D Audio standard, a more open‑ended format that can contain objects, channel‑based audio and Ambisonics data. Auro Technologies’ Auro 3D, meanwhile, began as a purely channel‑based system, but the most recent AuroMax iteration has added objects.
With stereo and channel‑based surround, individual channels within a mixer are routed directly or indirectly to a two‑channel or 5.1 master bus. The outputs from this bus are then routed to individual speakers, and generating a master recording is simply a matter of capturing the output from this multichannel master bus.
In the immersive world, things get more complicated. The whole point of scene‑based and object‑based formats is to be agnostic about the replay format, so in essence, the appropriate collection of scenes or objects needs to be generated and then decoded in real time for monitoring. As a simple example, consider the creation of a third‑order Ambisonics mix from a number of individual mono sources. Each of those sources must be routed through some sort of 3D panning device or algorithm which can address the 16‑channel Ambisonics bus; and there needs to be a further decoding algorithm that can map the output of that bus onto whatever speaker array we happen to have. If we want to audition the results on headphones, yet another step is involved, as the output must be re‑encoded binaurally.
Object‑based audio adds a further level of complexity, because of the ‘part baked’ nature of the format. We’re not simply recording a multichannel output file that consists only of audio data. Instead, or additionally, we are outputting tens or hundreds of individual mono and stereo streams, each with its own collection of metadata. Few DAWs are currently set up to produce this sort of output natively, so the usual approach is to integrate an additional piece of software that can receive all the necessary channels and data. In Atmos‑speak, this device is called the Renderer, and it acts as both monitor controller and master recorder.
Turning a mix source into an object involves assigning it to its own unique mono or stereo path through this virtual device, and replacing the standard panner on the source channel with a surround panner. This panner sends instructions to the renderer where they are recorded as metadata, and interpreted by the monitor controller according to the appropriate speaker mapping. In other words, the panning and movement of objects within the DAW mixer is actually implemented or duplicated within the renderer. In Pro Tools and Nuendo, the standard channel surround panners are ‘Atmos native’ and can be switched on the fly to produce either object‑based or channel‑based metadata. Logic requires the channel itself to be set to channel or bed mode, with different panners used in each.
Integration between renderer and DAW can be handled in a number of different ways. Apple have recently integrated a streamlined version of the Dolby Atmos renderer within Logic, along with appropriate tools such as object panners, making it possible to create an Atmos mix and author the necessary ADM (Audio Definition Model) file without leaving the application. (In many cases, the ADM file is formatted as a Broadcast Wave, aka BWF, file for the audio objects, with a large data chunk added containing the associated metadata.) Steinberg’s Nuendo also builds in a version of the Atmos renderer. For other DAWs, Dolby themselves make available two separate products called the Production Suite and the Mastering Suite. The cross‑platform Mastering Suite is designed to run on a separate computer from the DAW, essentially combining the functions of master recorder and monitor controller; typically, you’d pipe audio from the DAW machine to the Mastering Suite machine using a high channel‑count digital format such as MADI or Dante. The Mac‑only Production Suite, by contrast, installs a virtual Core Audio soundcard called the Dolby Audio Bridge on the DAW machine. This in turn pipes audio to the renderer software, running on the same machine, and handles monitor control and so on. Both approaches have their pros and cons.
The nature of object‑based immersive audio means that some techniques common in stereo mixing need to be ‘unlearned’. Many rock and pop mixers depend heavily on master bus processing to craft the sound of a track, but in an object‑based immersive format, there is no master bus. Even in channel‑based surround, processing the master bus can produce very different results, and the idea of using master dynamics and EQ to ‘glue’ a mix together is rarely applicable. A related concern is headroom and metering. One of the reasons for using master bus processing in stereo is to increase the perceived loudness, and as long as we don’t attempt to exceed 0dBFS at the master bus, the resulting mixes will be compatible with all stereo playback systems. In an immersive format, by contrast, headroom is not only desirable but essential: what goes to each speaker in the listener’s system is determined on the fly during playback, so there needs to be sufficient headroom to ensure that these calculations never produce an ‘illegal’ signal level. You’ll typically be working to a target loudness measured in LUFS rather than to a peak level.
Finally, one of the fringe benefits of object‑based immersive audio is that it facilitates archiving of projects and transfer between different DAWs. As mentioned above, the ADM format that is used for Atmos masters is built on the familiar Broadcast Wave format, and represents each object as a mono or stereo WAV file plus associated metadata. So, if you take an ADM file created using Pro Tools with the Dolby renderer, and open it in Logic 10.7, you should find all your objects and their panning information faithfully recreated there (as long as it only has one bed: Logic doesn’t currently support multiple beds).