You are here

Q. How does data reduction work in digital audio files?

Minidisc recorders use ATRAC data compression, a 'lossy' format.Minidisc recorders use ATRAC data compression, a 'lossy' format.

Can you explain to me how data reduction in audio files is achieved? I understand the general idea behind data-compressed formats like MP3, but I have never seen an explanation of the actual process. Secondly, how important is the data that's lost when audio is compressed?

Jane Reynolds

Technical Editor Hugh Robjohns replies: There are four fundamental ways of reducing the amount of data required to describe audio.

The simplest is to reduce the sampling rate. This also reduces the highest frequency that can be encoded, but in many applications this is an acceptable compromise, when properly implemented — the 15kHz audio bandwidth used for FM radio is a good example of this.

The next technique is to reduce the number of quantising levels — instead of using 24 bits, use just 16. This is exactly what happens with commercial CDs, for example. Obviously, the fewer the number of bits used, the smaller the potential dynamic range and the higher the noise floor, but you can be clever about it. In general, post-produced material has a smaller dynamic range anyway (often to a ridiculous degree in the case of pop music) and as it is so loud it will tend to mask the higher noise floor. Another technique is to use non-linear quantisation. Telephones work this way, using 8-bit non-linear coding to achieve performance similar to that of 12-bit coding.

In non-linear quantisation, the quantising levels are closer together for quiet signals, and wider apart for loud signals. The increased quantisation errors in the latter are largely masked by the fact that the signal is louder anyway, although you may become aware of a noise modulation effect — listen out for it the next time you use a normal wired telephone. The main drawback with non-linear quantisation is that it makes signal processing extremely difficult, so it is rarely used in quality audio applications.

Yet another variation on the theme of reducing the number of quantising levels is to use gain ranging, a technique the NICAM (Near Instantaneous Companded Audio Multiplex) stereo audio format uses for terrestrial analogue TV. The broadcast sound signal is quantised to 14-bit (the argument being that broadcast sound has a controlled dynamic range anyway, so won't suffer too much) but then reduced further within the NICAM system to a sliding range of just 10 bits. If the signal is pretty quiet then just the bottom 10 bits are sent (the top four bits are all zeros, and these can be reinserted at the receiver to reconstruct the original 14-bit data). If the signal is loud only the top 10 bits are sent, and the increased quantisation noise (from now having a crudely truncated 10-bit signal) is masked by the fact that it is loud. There are also three intermediate ranges, where one or two bits are discarded from both the top and bottom of the 14-bit original. I know this sounds crude, and it is by modern standards, but it actually works extremely well, and I bet few people listening to their TVs at home realise that they are listening to a 10-bit signal much of the time!

A more sophisticated approach to data reduction is to discard 'redundant information', meaning data that carries no signal information. A good example of this is the way NICAM discards the most significant bits, as stated above: a quiet signal might be coded as 0000 0010 1010 1110, and those first six zeros don't carry anything meaningful about the sound — hence their being described as redundant — so NICAM simply doesn't bother to send them, thus reducing the data rate.

The most sophisticated and contentious data-reduction technique is to remove 'irrelevant information' and this is where coders like MP3, ATRAC, AC3, apt-X and others come in.

There are two sub-divisions here: predictive coders (like apt-X, used in DTS cinema releases) and perceptual coders (used in almost everything else). A predictive coder relies on the fact that most audio signals are fairly simple and repetitive, and thus by looking at what has just passed you can make a good stab at what will come next. The coder then subtracts what it thinks will happen from what actually has happened, and that error signal is what is recorded or transmitted. At the receiver, the same predictive coder looks at the error signal and works backwards to reconstruct the original signal. The whole process can be made more accurate by splitting the audio into multiple bands, which makes the predictions easier and more accurate — apt-X uses four bands, for example. Predictive coders are very fast, and a complete encode-decode cycle takes typically just a few milliseconds, which is why they tend to be used for real-time applications such as two-way broadcasts and telephone links.

The problem with a predictive coder is that the whole thing falls apart if the signal is not predictive, and both noise and transients are inherently non-predictable. Thus, apt-X and similar systems have a real problem conveying transient signals properly, and noise-based signals can become 'coloured'. It's a problem which is ameliorated to a degree by the band splitting approach apt-X takes, and it is usually a subtle effect anyway, but is often easily audible in direct A-B comparisons with suitable material. Well-recorded solo trumpets highlight the transient damage very well, for example.

Perceptual coders, like MP3, AC3, ATRAC and the rest, rely on a model of the temporal and frequency masking characteristics of the human ear/brain, and are inherently much more complicated and processor-intensive. However, they are also potentially more accurate and their effects less audible, and they will improve and become more efficient as the understanding we have of the way human hearing works improves.

The basic idea is that the incoming linear PCM audio is divided into separate 'frames' of samples (anything from a few tens of samples to a hundred or so) so that temporal masking can be determined. For example, we can't hear a quiet signal immediately before a loud one — the brain appears to discard information about the boring quiet bits for the more exciting loud bits — and a similar effect occurs immediately after a loud signal too. So, the first step is to simply not bother to record or transmit any data about quiet stuff that can't be heard in proximity to louder bits before and after!

Each frame is also divided into narrow frequency bands similar to the 'critical bands' that the human ear/brain is thought to use to analyse sounds. This is where something called a polyphase filter bank comes into play. This is a complex array of very selective digital filters, with 100dB/octave slopes and upwards of 32 bands. Some systems employ equal-bandwidth filters, while some vary the bandwidth with centre frequency.

With the audio signal divided into separate frequency bands, the energy content in each band can be compared with the noise masking thresholds for the ear/brain. In the presence of a loud signal at, say, 1kHz, a quieter one at 1.5kHz may become completely inaudible. It's the same with a humming bass guitar amp: when the bassist isn't playing you may be aware of the hum, but when he is playing, the louder bass guitar notes tend to mask the hum. How wide the range of masked frequencies is depends on the masking signal's frequency and volume, hence the complexity of this approach to data reduction.

However, if the brain is unable to detect certain frequencies in the presence of others that are close and louder, those elements in the inaudible bands can be discarded completely: they are deemed 'irrelevant' to the listening experience. This is where this approach to data reduction becomes contentious, because different people have different levels of hearing acuity (as well as monitoring systems of varying resolutions!). Furthermore, whether the discarded information is actually audible or not obviously depends on the accuracy of the coder's noise-masking threshold data.

Once the irrelevant bands have been discarded, those remaining bands are requantised with just enough bits to ensure that the resulting quantisation noise lies below the assumed noise-masking thresholds. So some loud bands may only need to be coded with three or four bits, while quieter ones might be coded with 12 or 13 bits. This collection of data — all of the remaining requantised bands — is then bundled together and recorded or transmitted as a data-reduced audio file.

The decoder simply takes this collection of data and reconstructs the audio signal from the remaining parts. However, the signal elements that were thrown away by the coder can't be brought back or recreated. Hence this is known as a 'lossy' data reduction system — it loses information (which, hopefully, you couldn't have heard anyway!). All of this processing — the division of the signal into frames, and all the digital filtering, analysis, processing and coding — takes a considerable time. The more accurate the system, the longer it takes, and most perceptual coders impose an encode-decode delay of several hundred milliseconds. This isn't a problem in non-real-time applications, like MP3 players and DVD-V discs, but can cause problems for two-way live broadcasts and the like.

It's worth mentioning that there are non-lossy data reduction systems too (such as MLP used on DVD-As), but these can't provide anything like the amount of data reduction that lossy systems are capable of. There are also now more sophisticated data-reduction schemes and 'add-ons' that develop the perceptual coder idea even further, and improve the quality of the audio at very low bit rates. For example, MPEG-4 is significantly more powerful than its forbear MP3, and can be improved even further with add-ons such as 'Advanced Audio Coding' (AAC) and 'Spectral Band Replication' (SBR)... But I think we've covered enough for now!