Do hit songs have a hidden ingredient that makes them successful by appealing directly to our emotions? It's a question that musicians and scientists have sought the answer to for many decades. Acoustic researcher Ernest Cholakis has a new take on the theory...
Occasionally I run into people doing really interesting work, and sometimes I'm lucky enough to convince them that it deserves a wider audience. So at a recent AES convention, when Ernest Cholakis showed me his notes on trying to quantify emotional responses in music, I couldn't resist asking if he would consider sharing his research via the pages of SOS. Ernest is a researcher in Toronto, Canada who is involved in high-level audio analysis and mastering. He's probably best known for his work on the WC Music Research DNA Groove Templates, which arose from analyses done on the importance of small timing differences when creating particular rhythmic grooves and feels. For more about Ernest and his research, see the box below.
Ernest has been analysing, manipulating and deconstructing sound for a long time — in the late '70s, he bought a Digital Equipment PDP1110 minicomputer with a 16-bit D-A converter and started creating software to develop computer-generated sounds, which he later ported over to the Synclavier and then to a Macintosh. Lately, though, the topic that's occupied much of his time has been the issue of why some musical performances are emotionally moving, and others are not.
The process of scientifically measuring and analysing emotion is bound to raise a few eyebrows, and perhaps some hostility: it raises the horrid possibility of programs with 'emotion algorithms' built in, and engineers pressing 'Add Emotion' buttons (and you thought auto-tuning vocals from terrible singers was bad enough!). But such concern is misplaced. When using groove templates to improve rhythms, merely invoking a template is no guarantee that your music will indeed groove. Similarly, just because you know the basic rules of harmony doesn't mean that you will therefore be accomplished at writing multi-part orchestral music. Applying any of these concepts in a haphazard manner is easy, but ultimately unsatisfying.
After working with some of Ernest's techniques, I've found that putting theory into practice is not cut and dried; it demands artistic judgement to be effective. It requires going with your feelings more than using your intellect, and this can often set up preconceived notions that prevent you seeing a situation objectively. So while there may be some techniques that can be applied more or less by rote, they probably won't create the intended results unless you can judge whether those changes make artistic sense.
"Essentially, I deconstruct, analyse and separate sound, by recognising individual events and elements or spectral properties, depending on the situation, and use the resulting components to modify existing sounds, or reconstruct new ones. For example, I might separate a tone into its harmonic or partial and percussive components, and then rebuild those elements into something new.
"In the early '90s, I created the technology which enabled me to extract an extremely accurate MIDI or DNA Groove Template file from an acoustic performance. These DNA extractions, along with other forms of deconstruction, have allowed me to remaster and remix in both subtle and extreme ways. To accomplish this type of work, a series of proprietary software processes have been developed in Mac OS, working entirely in non-real-time mode. These custom processes are not compromised by the limitations imposed by real-time operation and often take between 100 and 200 times longer than the length of the original audio in order to create the final sound.
"Another important component of analysing a direct sound is understanding its dynamic relation to enclosures. The results allow me to create extremely accurate (non-real-time) recreations of the actual reverberation of many unique and aesthetically inspiring places, so a recording can be placed 'virtually' into an acoustic setting of your choice. Some of the enclosures I have worked on include the King's Chamber inside the Great Pyramid at Giza, the ancient Pantheon in Rome, Giotto's bell tower in Florence, Palladio's San Giorgio Maggiore in Venice, the Crypt of the Pantheon in Paris, and the Glen Gould Studio in Toronto, Canada. As well as interior enclosures, I am interested in the dynamics of natural settings and exterior reverb, which I have also recorded and analysed. Some of these include; The Valley of the Kings at Luxor, the streets of Venice, a four-foot crack in a massive rock formation located along the northern shores of Lake Superior, and several forests in Canada.
"What I am particularly interested in now is applying aspects of the reverberent 'signature' from sonically moving and visually beautiful spaces to musical passages, in order to evoke harmonic 'shadows' of these spaces in unpredictable ways."
Ernest's research began because he was wondering why certain performances of the same piano piece would invoke very different emotional responses in him, even when the musicians involved had roughly equivalent technique. Of course, some of this was due to timing differences that created different 'feels', but even taking that into account, he felt there were other factors.
In the process of mastering music and creating sounds for his sampling CDs, Ernest often looks at the dynamics in various frequency bands. Out of curiosity, he started applying this kind of analysis to existing recordings that he considered emotionally satisfying, to see if they had any common characteristics. He also compared this to music he didn't find as satisfying. In this way, he came up with a repeatable process for creating what he calls 'sound graphs' that describe the sound, rather as a score describes the notes themselves. Of course, we all are familiar with hearing a sound change or evolve over time, but accurately communicating what we perceive is another issue altogether. The sound graph is one answer, as it can be used to analyse several different aspects of music.
For example, the graphs simplify the process of comparing interpretations of the same composition by two performers; they also reveal the structure and/or the broad orchestration technique of a particular work, or how the overall sound of a recording changes over time. Sound graphs are also helpful when analysing the particular characteristics of a mix, such as which parts of the audio spectrum are compressed, how the overall EQ changes over time, or the properties of the transient material in a recording.
Creating a sound graph is a relatively complex process. Initially, the original sound file is filtered with up to nine band-pass filters, which split the bass, mid-range, and treble into three bands each. The next step is extracting the amplitude envelope for each of these bands, and plotting the results.
The vertical axis units in a sound graph are the logarithm of the amplitude envelope, represented in decibels (this provides a much more accurate display of the apparent loudness of a sound or frequency band compared to a linear amplitude representation). Splitting apart the sound and graphing the amplitude envelope trend line of a particular band is a good analysis technique, for two main reasons:
- It can reveal a track's overall characteristics without delving into the complexity of taking several FFT snapshots of the spectrum, and attempting to interpret the results over time. The amplitude envelope of between three and nine frequency bands is enough to give a clear and concise picture of the overall sound spectrum; Fourier analysis, by contrast, gives the user an 'explosion of data'. It's difficult to talk about how spectra with 512 or more frequency points change over time without overloading the viewer with too much information, and obscuring the important 'sonic trend lines' in a recording.
- Separating the audio into only nine frequency bands minimises the influence of harmonic and melodic elements, so that the sound graphs instead focus on only the broad overall sound element.
As we go through these examples, remember that Ernest's criteria were not whether the music was 'good' or 'bad', 'commercial' or 'non-commercial'; he was looking solely for music that has an emotional impact, preferably on a significant number of people, so as to provide validation regarding a piece's ability to move people.
Consider Figure 1, which shows amplitude against time for the treble band of two violin performances of the Adagio movement from Bach's Solo Violin Sonata in G. The upper is Sandor Vegh's performance, and the lower is Itzhak Perlman's. When Ernest sent me the graphs along with an audio CD of the performances, I decided to listen to the audio first. I definitely found Vegh's performance more emotionally satisfying than Perlman's — and when I looked at the graphs, I noticed that Vegh's dynamics were not only wider, but more tightly controlled, ramping smoothly from peak to valley.
Figure 2 compares two performances of 'Mars, The Bringer Of War' from Holst's The Planets, this time by the Toronto Symphony on the top, and the Montreal Symphony on the bottom. When listening to the CD, I preferred the Montreal version, and again, their graph shows more tightly-controlled dynamics.
Now, for a complete contrast, consider Figure 3 — the graph of The Chemical Brothers' 'Block Rockin' Beats', and its relative lack of dynamics in the treble, mid-range, and bass frequencies. Actually I like the tune, but that's because it moves my body, not necessarily my emotions. Perhaps this is why many people, when reacting to electronica, find it lacking in emotion; the dynamics-killing aspect of excessive compression could be the culprit. Perhaps it would be a good move for electronic artists to put away their compressors for a bit, and work on putting more dynamics into a piece instead of taking them out.
Figure 4 shows the treble, mid-range, and bass graphs for Phil Collins' 'In the Air Tonight', which many people class as a very emotional song. Of course, the vocal performance is a big part of that, but note how the bass anchors the tune during the first two-thirds of the song, while the mid-range and treble build slowly but relentlessly. The tune starts with a mellow pad sound, then adds vocals, electric guitar, cymbals, and vocoded vocal effects. When the drums enter with their sharp transients, the bass guitar expands and lifts the overall sound. After the first verse, the vocals begin to be processed in subtle ways (by double-tracking and use of EQ), and the vocal develops a more cutting quality as the song progresses.
Ernest offered to do a sound graph on a piece of my choosing, so I selected the composition 'Wisteria' from the Linda Cohen/Michael Kac CD Naked Under the Moon, which I produced. The reason for choosing this was that people always seemed to like this tune, whether in concert or the recorded version, and I was curious if Ernest could detect anything in the piece that might explain this.
Referring to the chart in Figure 5, Ernest profiled four frequency bands: bass (the upper graph), low mid-range, high mid-range (the bottom graph), and high treble. He found that the high mid-range and high treble show an increase in dynamics and level over the entire song, while the bass and low mid-range are fairly constant. This is a 'sonic signature' that he's seen on many pop recordings, where the important mid-range is most subject to dynamic range changes, building steadily from beginning to end, while the bass and treble 'anchor' the experience. Perhaps the reason why this tune has become one of the most well-liked in Linda's repertoire is because of its similarity, at least in terms of overall dynamic feel, to many successful pop tunes.
Figure 6 is an excerpt from the intro of Bob Marley's 'Waiting in Vain', from his album Exodus. The amplitude envelope graph of the lowest two frequency bands is shown: Band 1 (top, 20-60Hz) and Band 2 (below, 60-120Hz). What Ernest found especially interesting in these graphs is the location of the primary energy of the kick and bass.
In western pop, the kick typically has its primary energy in Band 1 and the bass guitar in Band 2. In this reggae classic, the emphasis is reversed. In the 60-120Hz range you can see the four distinct kick peaks that are not present in the 20-60Hz range. The dynamic range of the bass is also very narrow, with the sustain portion of the seven bass events in a 2dB range (quite remarkable). The graph illustrates that the bass envelope harmonic components are about -12dB weaker in the 60-120Hz band than in the lowest band (20-60Hz). This dark and deep bass guitar resonance is the type of bass sound often associated with reggae music. Ernest has noticed a similar sonic signature when analysing bass and drum tracks from Sly Dunbar and Robbie Shakespeare.
Figures 7 to 9 illustrate the effects of guitar overdrive distortion on the spectrum. Figure 7 is the sound graph for Smashing Pumpkins' 'Bullet With Butterfly Wings', and Figure 9 shows the analysis for Pearl Jam's 'Alive'.
The effects of compression are not equal over the entire spectrum. The low-mid treble band (2000-6000Hz) has the most compression/distortion, while the frequencies below 1000Hz have a much wider dynamic range. With the guitar intro on 'Alive', the bass frequencies (upper graph) have a dynamic range of 15dB, whereas the treble (2-4kHz lower graph) has a range of 6dB.
In the case of 'Bullet With Butterfly Wings', the song's treble envelope (Figure 7) illustrates that when the three 'overdrive sections' are heard, the dynamic range narrows to typically less than 6dB. The alternating clean-sounding sections have a much greater dynamic range — around 15-30dB. Alternating between wide dynamic range and compressed passages give a wider perceived dynamic range (as well as drama); in a sense this imitates the experience of hearing a live performance.
A close up of the Smashing Pumpkins' track (Figure 8) between the 55-70 second points visually illustrates that the distortion centres around the 2000-6000Hz frequency band (see the middle graph). The lowest graph (upper-treble region, 7kHz) has a much wider dynamic range, with very clean cymbal transients, and the low frequencies (upper graph) also have a wider dynamic range, with less compression than the middle graph.
The Beastie Boys' song 'Sabotage' (see Figure 10) also has the same kind of heavy, low-treble compression (2-5kHz) as the two previous examples, but it also includes a lo-fi sound component — bass distortion when the vocals are present. The graph shows three frequency bands; the top is the bass, the middle is the treble, and the lower graph is the upper treble (7kHz). Notice at about 17 seconds in, there is a dramatic change into the low-fi sound. The kick points are present, but with fewer bass transients. There is a distinct bass rumble that is not quite noise-like but is weakly correlated with the rhythm of the song. This could be a result of low-resolution samples, as the frequency is in the 20-120Hz range. The song alternates between this lo-fi and hi-fi sound several times; the contrast is obvious and quite dramatic.
The low-treble range (2-4kHz) is so overloaded that when you listen to this audio band in isolation, it sounds almost like an indistinct cacophony of sound. The low-treble range typically provides vocal intelligibility — however, in this recording, the frequency band that does this is one octave lower (1-2kHz). It also sounds like a high-pass roll-off below 1kHz was added to enhance the clarity. This type of processing enables the rap or 'talking' vocals to be heard clearly in a high-distortion mix.
In the mix for the song 'Hey Nineteen' by Steely Dan, from the album Gaucho (see Figure 11), the lowest and highest frequency transients (kick and hi-hat) are performed/recorded in a tight, low-distortion manner, but with very little dynamic range — which is not easy to do! This song has a real drummer playing within a very narrow dynamic range. The dynamic range of the kick is about 5dB and the hi-hat accent's range is about 3dB for the downbeat (weaker event) and 3dB for the upbeat. The graph illustrates the pattern of hi-hat and kick accents over the course of 60 seconds.
Figure 12 shows Herbert Von Karajan's 1963 performance of the first movement of Beethoven's Eroica Symphony. The contrasting 20-35dB swings in the low treble's dynamic range reveals the orchestral scoring technique of the composer more than the recording process, but the sound graph clearly illustrates the constant cyclical change in the dynamics. The alternating dynamic cycles impart a relentless energy to this movement. Not surprisingly, this symphony was written to reflect the spirit of a hero, and was initially dedicated to Napoleon Bonaparte (although this was later removed).
The soft passages are typically 10-45 seconds long. Beethoven concludes the loud/soft cycle by extending the last pianissimo section (720-820 seconds) to almost two minutes before the final forte section. The final forte section is not extended.
U2's 'With Or Without You' (see Figure 13), Queen's 'Bohemian Rhapsody' (Figure 14), Amanda Marshall's 'Beautiful Goodbye' (Figure 16), and Pandit Ravi Shankar's raga 'Dhun Man Pasand' (Figure 17) recall the graphs for 'In The Air Tonight' and 'Wisteria' mentioned earlier. These songs are performances that build in intensity over the course of the song, and reach a musical climax near the song's end. The individual frequency bands of these performances reflect this buildup in emotive intensity, but not all the frequency bands move in unison with equal energy.
The sound graphs show that the bass and lower mid-range (20-500Hz) have less dynamic range than the higher frequencies. These lower bands carry the melodic/harmonic foundation material that also impart warmth to the overall sound. The bass in 'With Or Without You' and 'Beautiful Goodbye' clearly reveal this (see the upper graphs). The Amanda Marshall (look around 220-300 seconds) and Queen (240-300 seconds) middle graphs for the 2-4kHz region, and U2's lower graph (180-210 seconds) show high levels of treble compression at the dramatic, high-energy end of these tunes. Differences between the peak transient and sustain levels in these bands are 5-6dB at these points. The Ravi Shankar track is the only track that does not have compression as the song builds.
In these examples, the mix can become progressively brighter in two ways. The first approach is to add new instruments as the song progress; the second is to play the same instrument in a different manner. With Amanda Marshall (Figure 16), the instrument sequence changes during the course of the song after the first verse: bass, then strings, drums, organ and finally violins, together with Amanda singing in the upper registers generate the progressively brighter spectrum, as evidenced in the lower graph.
A track can become progressively brighter without adding new instruments by playing notes more rapidly, and/or using the upper registers and increasing the sustain of notes in that mid-upper register of an instrument's range. A good example of this is the Ravi Shankar piece (Figure 17); there are only three instruments (sitar, tambura and tabla), yet plotting the treble illustrates a buildup over the course of the 11 minutes of the performance, with the most intense part lasting over three minutes!
U2's example starts with bass, guitar, drums and voice. The singer (Bono) uses the lower register of his voice at the start and gradually uses his upper register more frequently as the track unfolds. The rest of the band also brighten the spectrum; the drummer adds more open hi-hats, then tambourines. The drum post-production changes the snare sound, too (it starts tight, then gets more ambient/larger-sounding with timbre/EQ changes). The guitarist adds more sustain and uses the upper register more often as the song develops. The result of all this is the ramp of about 12dB in the 2-4kHz region between 90 and 210 seconds.
Queen's 'Bohemian Rhapsody' (Figure 14, above) has a much wider dynamic range than one typically sees in recordings of the 1990s and today. The bass (upper graph), mid treble (middle graph), and upper treble (lower graph) all have a dynamic range approaching 40dB. There are many points in the song where the overall sound is darker due in most part to the choice of instrumentation. However, when each consecutive bright passage follows, it picks up where the previous bright passage left off (see Figure 15, above). The result is that the song has a low-treble build to the climax (250-270 seconds) of about 20dB. Most of this buildup of treble energy is through the musical arrangement and incredible vocal performance, aided by a judicious amount of compression. The very wide dynamic range of this recording imparts a theatrical and orchestral scope to this work. The dynamic range is almost as wide as the Beethoven recording discussed earlier.
Comparing the dynamic range of 'Bohemian Rhapsody' with two late 1990s songs, Radiohead's 'Paranoid Android' (see Figure 18) and Oasis' song 'Champagne Supernova' (see Figure 19), it becomes apparent that compression of selective frequency bands is used even more extensively today in pop recordings.
The bass, low-treble, and high-treble regions (from top to bottom on the graphs) of both 'Champagne Supernova' and 'Paranoid Android' have a bass (20-120Hz) dynamic range of less than 5dB throughout each of these songs (once the drums enter in the Oasis song). The upper mid-range and lower treble (1 to 4kHz) have a dynamic range of about 20dB for 'Champagne Supernova' and 12dB for 'Paranoid Android'. In the high treble, the transients in the Oasis song are definitely not as compressed as in the lower bands. The top end of 'Paranoid Android' is much tighter, with a dynamic range of less than 5dB. This is due to the arrangement, which features a constant hi-hat track, acoustic guitar, and a bright top end on the vocal.
One of the interesting aspects of these graphs is the restricted dynamic range of many recorded performances. According to Ernest Cholakis, getting musicians to play within this range so that the compressor/limiter has to do less work will often sound better, or at least more natural. For example, listen to the Miles Davis recording Kind of Blue (above) to hear musicians playing within the limited dynamic range of the recording medium. If you want 'attitude' in your music, compress or selectively distort/compress a particular region (1-4kHz seems about optimum) but leave the upper treble (7kHz and up) clean.
As with the feel factor regarding timing, all of this seems obvious in retrospect — "of course dynamics make a difference in music". But by analysing these graphs, it's pretty easy to see it's not just dynamics that affect our perceptions, but also the degree of control over those dynamics. In other words, practising dynamic control is as important as practicing any other aspect of music, such as pitch discrimination. Perhaps this is also why some people feel automated mixing has taking some of the soul out of music; in the pre-automation days, engineers were more prone to add dynamics to the mix with faders. I vividly remember watching an engineer at CBS Studios (who helped to create several hits, by the way) at work; he kept his eyes completely closed as he mixed, so he could concentrate on moving the faders rhythmically and dynamically. I must say it was a compelling mix.
In any event, in these days when we have so much control over music (be it through digital audio manipulation or MIDI editing), here's one more element that surely deserves further exploration.
Ernest uses the following hardware and software for his analyses:
- Apple Power Mac 7100s (x2).
- Apple PowerMac 9600s (x2).
- 300MHz Pentium II PC.
- Digidesign Audiomedia II soundcard.
- Digidesign Audiomedia III soundcard.
- Digidesign Samplecell II.
- EgoSys Waveterminal 24/96.
- Audio Alchemy Digital Decoding Engine v1.1.
- Digidesign Sound Designer II.
- Digidesign PowerMix.
- Momentum Data Systems' DSP Designer.
- BIAS Peak.
- MOTU Digital Performer.
- Tascam Gigasampler.
- Cakewalk Sonar.