Immersive audio has huge creative potential for music production, but it can be hard to get your head around. Here’s the explanation you’ve been waiting for!
From long‑playing vinyl, eight‑track cartridge and compact cassette through to CD, Minidisc and MP3, most consumer audio formats have had one thing in common: stereo. They contain two discrete audio signals, designed to be played back through two loudspeakers or earpieces. A listener positioned between these speakers hears a ‘sound stage’ or panorama, within which individual sources appear at specific positions. For example, a mono vocal at equal levels in both channels will be heard as being halfway between the two speakers, directly in the centre.
This illusion is impressive, but it’s also limited. It lets us localise sources along a line between the two speakers, and as being near or far away, but it can’t convey a sense of height, or reliably convince us that sound is coming from behind us.
Over the years, there have been several attempts to overcome these limitations, most notably quadraphonic sound in the ’70s, Dolby Stereo (ProLogic) through the ’80s and ’90s, and 5.1 surround in the early part of this century. However, these achieved lasting success only in the cinema. Despite considerable investment from record companies, domestic audiences didn’t warm to the new formats.
There were several reasons for this. Marketed as premium products, quadraphonic records, DVD‑Audio discs and multichannel SACDs were more expensive than the stereo versions of the same material. They could not easily be enjoyed on headphones, and required specialised playback equipment, including at least four loudspeakers. This, again, was costly, and it was impractical or at least undesirable in many home environments. Even if you had the money, the space and the domestic goodwill to set up a 5.1 speaker system, moreover, the benefits would be confined to a very small ‘sweet spot’.
What stereo, quad and 5.1 all have in common is that they are channel‑based formats, meaning that there is a fixed relationship between channel count and speaker count. Each discrete channel carries a signal that’s destined for a specific loudspeaker, and the loudspeakers themselves need to be configured in a specific physical relationship. In the right space, with everything set up correctly, the experience could be magical; but in practice, such spaces and setups were few and far between.
There are two key features of modern immersive audio formats. One is that they don’t just represent surround in the horizontal plane: they also contain meaningful height information that allows sounds to be perceived as being above the listener. The other is that nearly all of them break the simple relationship between channels and speakers. The delivery format does not simply contain a separate mono channel destined for each speaker, but a more complex data stream that is decoded in real time to map it onto whatever speakers are available in whatever locations. Unlike older surround formats, immersive audio therefore requires an ‘intelligent’ device in the replay chain to carry out this decoding and customised mapping. But in many contexts, this isn’t really a problem, because the stream is being played back from a computer, server or other device that has plenty of spare processing power.
Certainly, the possible need for an additional device in the signal chain is massively outweighed by the benefits. The key plus is that, in principle, immersive content can be decoded for any speaker arrangement you like, from the bandwidth‑limited mono speaker in a smartphone to a full cinema array with many rear and side speakers, overhead speakers and subwoofers. It can also be fed through a binaural encoder to achieve a sense of immersion on headphones. In other words, whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.
...whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.
We can classify immersive audio formats as being channel‑based, scene‑based or object‑based. The first category refers to surround formats beyond 5.1 that incorporate speakers above the listener, and can therefore claim to be immersive whilst retaining the simple direct and exclusive channel‑to‑speaker mapping. Scene‑based formats, by contrast, present a single, complex data stream that describes a complete three‑dimensional soundfield. Finally, object‑based formats package a number of discrete audio streams along with metadata that tells the decoder how these individual streams should be positioned.
Crudely put, channel‑ and scene‑based formats contain fully mixed audio, whereas an object‑based format contains the major elements of a mix plus metadata explaining how that mix should be implemented in a given playback environment. At the time of writing, there are a number of commercial audio production and distribution formats that are billed as immersive, spatial or 3D, all competing to dominate various market sectors. As we’ll see, many of these are in fact hybrid formats that combine object‑based elements with channel‑ or scene‑based elements.
The main example of a pure scene‑based 3D audio format is Ambisonics. Developed as long ago as the late 1970s by Michael Gerzon and Peter Craven, Ambisonics can be thought of as an extension of Mid‑Sides stereo. Mid‑Sides is also known as ‘sum and difference’, and first‑order Ambisonics extends the concept by adding two additional ‘difference’ channels, representing the front‑back and up‑down axes. The sum or W channel describes the omnidirectional component, while the subsequent X, Y and Z channels describe the directional components of the sound in the three orthogonal planes.
Ambisonics can be scaled to an indefinite number of orders. The number of channels needed to implement a given order n is (n+1) squared, so second‑order Ambisonics requires nine channels, third‑order 16 and so on. The advantage of higher orders is that the listener is able to localise sources more precisely within the soundfield. For example, consider two different sound sources starting at the same point and slowly moving apart. The higher the order, the narrower an angle we can discriminate between them.
There is no real‑world playback system that can play back a raw Ambisonics signal. So, just as Mid‑Sides stereo needs to be matrixed into left and right channels for playback on conventional loudspeakers or headphones, an Ambisonics signal must be processed to adapt it to whatever playback system is available. This can involve extracting separate mono signals for each speaker in a surround array, or running the signal through a binaural encoder for headphone listening.
Ambisonics is thus similar to stereo and 5.1 surround in the sense that each decoded channel describes part of a complete soundfield, rather than a discrete element within that soundfield. Where it differs from those formats is that the one‑to‑one relationship between channels and loudspeakers is broken. An Ambisonic signal is an abstract representation of the soundfield that needs to be matrixed to a particular listening environment.
In an object‑based format, by contrast, each channel describes a specific element of the mix: a vocal, an instrument or group of instruments, a Foley effect such as an explosion, or whatever. Each object is packaged with its own set of timecoded metadata, which is derived from and often identical to mix automation data. For example, the metadata for a percussion track might tell it to remain in the distance at the upper left of the soundfield until 30 seconds into the track, then move slowly across the top of the listener’s head and get closer before performing a quick pirouette. To deploy a rather leftfield analogy, a scene‑based audio mix is like a fully-baked loaf that only needs to be sliced up, while a channel‑based mix is a pre‑sliced loaf — but an object‑based mix is like a part‑cooked loaf that comes with instructions as to how it should be finished off.
To deploy a rather leftfield analogy, a scene‑based audio mix is like a fully-baked loaf that only needs to be sliced up, while a channel‑based mix is a pre‑sliced loaf — but an object‑based mix is like a part‑cooked loaf that comes with instructions as to how it should be finished off.
Scene‑based and object‑based formats can both deliver very impressive results, but in their pure form, they have different strengths. Higher‑order Ambisonics excels at presenting natural‑sounding ambiences: if you want to create the impression that the listener is really standing in the middle of a jungle or a busy city street, this can be achieved very effectively. This is partly because Ambisonics isn’t just a replay format. It’s also a capture format, and the use of an Ambisonic microphone such as the Soundfield permits direct, natural recording of a three‑dimensional soundfield.
The flip side of this is that Ambisonics can feel a little understated, and even in second and third orders, offers relatively limited localisation. An Ambisonic presentation of a soundfield is analogous to a coincident stereo recording, in that all of the directional information on offer derives solely from level and tone differences, and not from time‑of‑arrival differences. Those who have worked with it say that it’s hard to make a single instrument or mix element really leap out at the listener, or to create a really precise sense of its occupying a specific position.
This is where object‑based formats come into their own. Because the positioning of objects is implemented locally and with reference to the particular speaker layout in use, they can be placed and moved with a precision and ‘in your face’ quality that is otherwise difficult to achieve. Rock and pop mixing is often more about power, drama and immediacy than it is about naturalism, and this sort of larger‑than‑life presentation is easier to achieve with object‑based mixing than through Ambisonics.
Although data bandwidth and storage space are much less pressing issues than they used to be, both are still finite resources. If every single source within a mix is treated as an individual object, channel counts and file sizes rise alarmingly, especially in complex soundtrack‑type music, or where music coexists with dialogue and sound effects. Yet in most music mixes, there are quite a few elements that don’t need the precise localisation or detailed positional control that objects offer; and a workflow that forces the mixer to assign every single instrument in a busy mix to its own object would risk being very slow and inefficient.
Consequently, many of the immersive formats that are currently jostling for space in the market have a hybrid nature, combining object‑based playback for important mix elements with one or more scene‑ or channel‑based streams that can mop up the rest. Dolby Atmos is a good example: in addition to its objects, an Atmos mix can contain one or more ‘beds’. Each bed is, in essence, a conventional channel‑based stream in up to 7.1.2 surround (ie. seven horizontal channels, one LFE channel, and two height channels); when the mix is played back on a system with fewer speakers, it gets intelligently folded down to 5.1, quad, or whatever is appropriate. This has the added benefit that older surround mixes are more or less directly Atmos‑compatible, since they can be represented as a simple Atmos stream containing only one bed and no objects.
A Dolby Atmos mix can contain up to 128 discrete mono audio channels in total, divided between beds and objects. It would be unusual to use more than one bed in a music mix, so assuming that’s a single 7.1.2 bed, you’d be left with a maximum of 118 mono or 59 stereo objects available. The rival DTS:X standard is similar to Atmos in that beds are augmented by objects, but in this case there’s no upper limit on the number of objects. Sony’s 360 Reality Audio is built on the MPEG‑3D Audio standard, a more open‑ended format that can contain objects, channel‑based audio and Ambisonics data. Auro Technologies’ Auro 3D, meanwhile, began as a purely channel‑based system, but the most recent AuroMax iteration has added objects.
With stereo and channel‑based surround, individual channels within a mixer are routed directly or indirectly to a two‑channel or 5.1 master bus. The outputs from this bus are then routed to individual speakers, and generating a master recording is simply a matter of capturing the output from this multichannel master bus.
In the immersive world, things get more complicated. The whole point of scene‑based and object‑based formats is to be agnostic about the replay format, so in essence, the appropriate collection of scenes or objects needs to be generated and then decoded in real time for monitoring. As a simple example, consider the creation of a third‑order Ambisonics mix from a number of individual mono sources. Each of those sources must be routed through some sort of 3D panning device or algorithm which can address the 16‑channel Ambisonics bus; and there needs to be a further decoding algorithm that can map the output of that bus onto whatever speaker array we happen to have. If we want to audition the results on headphones, yet another step is involved, as the output must be re‑encoded binaurally.
Object‑based audio adds a further level of complexity, because of the ‘part baked’ nature of the format. We’re not simply recording a multichannel output file that consists only of audio data. Instead, or additionally, we are outputting tens or hundreds of individual mono and stereo streams, each with its own collection of metadata. Few DAWs are currently set up to produce this sort of output natively, so the usual approach is to integrate an additional piece of software that can receive all the necessary channels and data. In Atmos‑speak, this device is called the Renderer, and it acts as both monitor controller and master recorder.
Turning a mix source into an object involves assigning it to its own unique mono or stereo path through this virtual device, and replacing the standard panner on the source channel with a surround panner. This panner sends instructions to the renderer where they are recorded as metadata, and interpreted by the monitor controller according to the appropriate speaker mapping. In other words, the panning and movement of objects within the DAW mixer is actually implemented or duplicated within the renderer. In Pro Tools and Nuendo, the standard channel surround panners are ‘Atmos native’ and can be switched on the fly to produce either object‑based or channel‑based metadata. Logic requires the channel itself to be set to channel or bed mode, with different panners used in each.
Integration between renderer and DAW can be handled in a number of different ways. Apple have recently integrated a streamlined version of the Dolby Atmos renderer within Logic, along with appropriate tools such as object panners, making it possible to create an Atmos mix and author the necessary ADM (Audio Definition Model) file without leaving the application. (In many cases, the ADM file is formatted as a Broadcast Wave, aka BWF, file for the audio objects, with a large data chunk added containing the associated metadata.) Steinberg’s Nuendo also builds in a version of the Atmos renderer. For other DAWs, Dolby themselves make available two separate products called the Production Suite and the Mastering Suite. The cross‑platform Mastering Suite is designed to run on a separate computer from the DAW, essentially combining the functions of master recorder and monitor controller; typically, you’d pipe audio from the DAW machine to the Mastering Suite machine using a high channel‑count digital format such as MADI or Dante. The Mac‑only Production Suite, by contrast, installs a virtual Core Audio soundcard called the Dolby Audio Bridge on the DAW machine. This in turn pipes audio to the renderer software, running on the same machine, and handles monitor control and so on. Both approaches have their pros and cons.
The nature of object‑based immersive audio means that some techniques common in stereo mixing need to be ‘unlearned’. Many rock and pop mixers depend heavily on master bus processing to craft the sound of a track, but in an object‑based immersive format, there is no master bus. Even in channel‑based surround, processing the master bus can produce very different results, and the idea of using master dynamics and EQ to ‘glue’ a mix together is rarely applicable. A related concern is headroom and metering. One of the reasons for using master bus processing in stereo is to increase the perceived loudness, and as long as we don’t attempt to exceed 0dBFS at the master bus, the resulting mixes will be compatible with all stereo playback systems. In an immersive format, by contrast, headroom is not only desirable but essential: what goes to each speaker in the listener’s system is determined on the fly during playback, so there needs to be sufficient headroom to ensure that these calculations never produce an ‘illegal’ signal level. You’ll typically be working to a target loudness measured in LUFS rather than to a peak level.
Finally, one of the fringe benefits of object‑based immersive audio is that it facilitates archiving of projects and transfer between different DAWs. As mentioned above, the ADM format that is used for Atmos masters is built on the familiar Broadcast Wave format, and represents each object as a mono or stereo WAV file plus associated metadata. So, if you take an ADM file created using Pro Tools with the Dolby renderer, and open it in Logic 10.7, you should find all your objects and their panning information faithfully recreated there (as long as it only has one bed: Logic doesn’t currently support multiple beds).
Most DAWs don’t offer the option to use third‑party stereo panners, because the job of a stereo panner is so basic that there’s not really any point. However, when we move from stereo to surround, and especially to object‑based immersive surround, the panner becomes much more important.
Moving sources in three dimensions raises many new questions: How should that movement be visualised using animation? How should gestures and mouse movements be translated into changes in position? How can complex 3D movement be represented and edited in two‑dimensional automation lanes? Should 3D panning simply mean making sources louder over here and quieter over there, or should those changes also be reflected in the ambience of a virtual space? And how should the perceived size or width of a sound source vary as it is moved closer or further away?
Engineers have different preferences regarding all these issues and more, and so there’s a wealth of third‑party software available that can augment or replace the tools bundled with our DAWs. One example is the Flux IRCAM Tools Spat Revolution software. Much more than ‘just’ a panner, it’s perhaps better thought of as a renderer that can position and move any number of sources within a synthesized acoustic, then generate an Ambisonics, channel‑based or binaural output. And although it can’t directly create object‑based content, it could be integrated within an Atmos workflow, for example to handle the creation of 7.1.2 beds as part of an Atmos mix. The same goes for other products such as Dear Reality’s dearVR.
Even if the ultimate destination format is going to be Dolby Atmos, it’s possible to reach that destination using third‑party tools. A good example here is L‑Acoustics’ L‑ISA system. This is a suite of tools for working with ‘pure’ object‑based immersive audio either in a live or a studio environment. It can be integrated with a DAW in much the same way as the Dolby Atmos Production Suite can, presenting itself as a virtual soundcard and handling speaker management, binaural encoding and so on; and further down the line, L‑ISA objects can be transformed into Atmos objects if you want to render your immersive mix in a format that can be distributed through the usual consumer channels such as Apple Music. Unlike Atmos, L‑ISA also features a Room Engine which allows objects to be positioned within a virtual acoustic space.
Mention of live sound introduces another relevant concern. Dolby Atmos and other formats that originated in movies are typically highly regulated. To be qualified for the production of music or cinematic content in Atmos format, a studio has to meet pretty stringent requirements, which must be certified by Dolby themselves; it must also be calibrated for working at a specific reference sound pressure level. In live sound, by contrast, it’s unreasonable to expect venues and sound systems to exhibit the same uniformity. Creating a workable immersive experience in live or installation sound is about achieving what’s possible in the space that you have to work in, which may have unfriendly acoustics, an awkward shape and an audience too large to fit into any sort of ‘sweet spot’. Hence, systems such as L‑ISA and Spat Revolution are designed to work with non‑standard speaker layouts as well as the ‘approved’ setups you find in studios and cinemas.
As was previously mentioned, master files for immersive formats can be huge. In theory, for example, a Dolby Atmos ADM file can contain up to 128 channels of 48kHz, 24‑bit audio, plus associated metadata. That’s over 18MB per second, or 1.1GB per minute. Clearly, it’s not feasible to stream uncompressed Atmos masters over domestic internet, cable or satellite TV, or phone networks. Some fairly radical data compression is required and, again, the exact approach taken is format‑specific. Since we’re considering music distribution here rather than broadcasting or movies, Atmos is again the most relevant example owing to its adoption in Apple Music.
There are actually three Atmos Delivery codecs that form part of the Dolby Music framework. TrueHD is used exclusively on Blu‑Ray discs, while AC‑4 IMS is a low bit‑rate format designed specifically for streaming: it contains a bitstream that is compatible with conventional stereo devices plus binaural metadata that enables users with headphones to experience the immersive content. The third codec is known variously as Dolby Digital Plus with Atmos Content or Dolby Digital Plus JOC, where the last three letters stand for Joint Object Coding. In essence, this packages a data‑compressed 5.1 surround stream that is compatible with legacy devices with an additional layer that permits object‑based playback to be reconstructed for Atmos‑compatible systems.
Relatively few among the target audience will have Atmos‑compatible speaker systems at home, so the vast majority of immersive listening takes place on headphones.
Most of us are accustomed by now to knowing how our speaker‑based stereo mixes will translate to headphone listening, but immersive content is a whole different ball game! Relatively few among the target audience will have Atmos‑compatible speaker systems at home, so the vast majority of immersive listening takes place on headphones. It’s therefore vital to audition the binaurally encoded output as part of an immersive mixing or mastering workflow, but even then there are pitfalls. For example, objects and beds within an Atmos mix can be tagged with distance metadata that is specific to binaural encoding. Dolby’s own binaural encoder (as used, for example, by Tidal) responds to this metadata, but Apple Music uses a different encoding system which does not. This is problematic, because the same Atmos mix may sound different when auditioned through the Dolby binaural encoder and through Apple Music, but there’s currently no way to audition a mix through the Apple Music processing in real time. The current workaround is to export a mix from the Dolby renderer in MP4 format, Airdrop it to an iPhone and then play it back over some Airpods, which is not ideal.
Both old and new surround formats have, until now, presented a fairly high barrier to entry. Many people who have been excited about the creative possibilities have been frustrated by the cost and difficulty of working in surround. As a result, immersive audio has been almost entirely the province of post‑production until now. If record company budgets permit, artists and producers have been able to go into an Atmos‑certified studio at the mix stage and create an Atmos or other immersive mix to release alongside the stereo mix, but they haven’t been able to work immersively right from the beginning of a project.
With the release of Logic 10.7, that looks set to change. Logic is positioned primarily as a music creation and recording program, and the integration of Atmos tools — at no additional cost — means that anyone with a laptop and a pair of headphones can make music immersively. Music production promises to be very different when immersive listening is borne in mind at every stage of the creative process, and I’ve no doubt that other DAWs will take up the challenge very soon. We live in exciting times!