You are here

An Introduction To Immersive Audio

The SOS Authoritative Guide By Sam Inglis
Published January 2022

Immersive Audio

Immersive audio has huge creative potential for music production, but it can be hard to get your head around. Here’s the explanation you’ve been waiting for!

From long‑playing vinyl, eight‑track cartridge and compact cassette through to CD, Minidisc and MP3, most consumer audio formats have had one thing in common: stereo. They contain two discrete audio signals, designed to be played back through two loudspeakers or earpieces. A listener positioned between these speakers hears a ‘sound stage’ or panorama, within which individual sources appear at specific positions. For example, a mono vocal at equal levels in both channels will be heard as being halfway between the two speakers, directly in the centre.

This illusion is impressive, but it’s also limited. It lets us localise sources along a line between the two speakers, and as being near or far away, but it can’t convey a sense of height, or reliably convince us that sound is coming from behind us.

Over the years, there have been several attempts to overcome these limitations, most notably quadraphonic sound in the ’70s, Dolby Stereo (ProLogic) through the ’80s and ’90s, and 5.1 surround in the early part of this century. However, these achieved lasting success only in the cinema. Despite considerable investment from record companies, domestic audiences didn’t warm to the new formats.

There were several reasons for this. Marketed as premium products, quadraphonic records, DVD‑Audio discs and multichannel SACDs were more expensive than the stereo versions of the same material. They could not easily be enjoyed on headphones, and required specialised playback equipment, including at least four loudspeakers. This, again, was costly, and it was impractical or at least undesirable in many home environments. Even if you had the money, the space and the domestic goodwill to set up a 5.1 speaker system, moreover, the benefits would be confined to a very small ‘sweet spot’.

Breaking The Link

What stereo, quad and 5.1 all have in common is that they are channel‑based formats, meaning that there is a fixed relationship between channel count and speaker count. Each discrete channel carries a signal that’s destined for a specific loudspeaker, and the loudspeakers themselves need to be configured in a specific physical relationship. In the right space, with everything set up correctly, the experience could be magical; but in practice, such spaces and setups were few and far between.

There are two key features of modern immersive audio formats. One is that they don’t just represent surround in the horizontal plane: they also contain meaningful height information that allows sounds to be perceived as being above the listener. The other is that nearly all of them break the simple relationship between channels and speakers. The delivery format does not simply contain a separate mono channel destined for each speaker, but a more complex data stream that is decoded in real time to map it onto whatever speakers are available in whatever locations. Unlike older surround formats, immersive audio therefore requires an ‘intelligent’ device in the replay chain to carry out this decoding and customised mapping. But in many contexts, this isn’t really a problem, because the stream is being played back from a computer, server or other device that has plenty of spare processing power.

Certainly, the possible need for an additional device in the signal chain is massively outweighed by the benefits. The key plus is that, in principle, immersive content can be decoded for any speaker arrangement you like, from the bandwidth‑limited mono speaker in a smartphone to a full cinema array with many rear and side speakers, overhead speakers and subwoofers. It can also be fed through a binaural encoder to achieve a sense of immersion on headphones. In other words, whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.

...whereas previous surround formats required listeners to adapt their setups to suit the format, immersive audio adapts itself to suit whatever setup is available.

We can classify immersive audio formats as being channel‑based, scene‑based or object‑based. The first category refers to surround formats beyond 5.1 that incorporate speakers above the listener, and can therefore claim to be immersive whilst retaining the simple direct and exclusive channel‑to‑speaker mapping. Scene‑based formats, by contrast, present a single, complex data stream that describes a complete three‑dimensional soundfield. Finally, object‑based formats package a number of discrete audio streams along with metadata that tells the decoder how these individual streams should be positioned.

Crudely put, channel‑ and scene‑based formats contain fully mixed audio, whereas an object‑based format contains the major elements of a mix plus metadata explaining how that mix should be implemented in a given playback environment. At the time of writing, there are a number of commercial audio production and distribution formats that are billed as immersive, spatial or 3D, all competing to dominate various market sectors. As we’ll see, many of these are in fact hybrid formats that combine object‑based elements with channel‑ or scene‑based elements.

The Scenic Route

The main example of a pure scene‑based 3D audio format is Ambisonics. Developed as long ago as the late 1970s by Michael Gerzon and Peter Craven, Ambisonics can be thought of as an extension of Mid‑Sides stereo. Mid‑Sides is also known as ‘sum and difference’, and first‑order Ambisonics extends the concept by adding two additional ‘difference’ channels, representing the front‑back and up‑down axes. The sum or W channel describes the omnidirectional component, while the subsequent X, Y and Z channels describe the directional components of the sound in the three orthogonal planes.

Ambisonics is a capture format as well as a replay format, and Ambisonic mics such as this Rode NT‑SF1 can be used to record very convincing surround ambiences.Ambisonics is a capture format as well as a replay format, and Ambisonic mics such as this Rode NT‑SF1 can be used to record very convincing surround ambiences.Ambisonics can be scaled to an indefinite number of orders. The number of channels needed to implement a given order n is (n+1) squared, so second‑order Ambisonics requires nine channels, third‑order 16 and so on. The advantage of higher orders is that the listener is able to localise sources more precisely within the soundfield. For example, consider two different sound sources starting at the same point and slowly moving apart. The higher the order, the narrower an angle we can discriminate between them.

There is no real‑world playback system that can play back a raw Ambisonics signal. So, just as Mid‑Sides stereo needs to be matrixed into left and right channels for playback on conventional loudspeakers or headphones, an Ambisonics signal must be processed to adapt it to whatever playback system is available. This can involve extracting separate mono signals for each speaker in a surround array, or running the signal through a binaural encoder for headphone listening.

Ambisonics is thus similar to stereo and 5.1 surround in the sense that each decoded channel describes part of a complete soundfield, rather than a discrete element within that soundfield. Where it differs from those formats is that the one‑to‑one relationship between channels and loudspeakers is broken. An Ambisonic signal is an abstract representation of the soundfield that needs to be matrixed to a particular listening environment.

Objects Of Desire

In an object‑based format, by contrast, each channel describes a specific element of the mix: a vocal, an instrument or group of instruments, a Foley effect such as an explosion, or whatever. Each object is packaged with its own set of timecoded metadata, which is derived from and often identical to mix automation data. For example, the metadata for a percussion track might tell it to remain in the distance at the upper left of the soundfield until 30...

You are reading one of the locked Subscriber-only articles from our latest 5 issues.

You've read some of this article for free, so to continue reading...

  • Log in - if you have a Subscription you bought from SOS.
  • Buy & Download this Single Article in PDF format £1.00 GBP$1.49 USD
    For less than the price of a coffee, buy now and immediately download to your computer or smartphone.
  • Buy & Download the Full Issue PDF 
    Our 'full SOS magazine' for smartphone/tablet/computer. More info...
  • Buy a DIGITAL subscription (or Print + Digital)
    Instantly unlock ALL premium web articles! Visit our ShopStore.

Claim your FREE 170-page digital publication
from the makers of Sound On SoundCLICK HERE