Technology has threatened to put drummers and guitarists out of work, but until now, singers have been safe. Is Vocaloid going to change all that?
We have probably all heard complaints from drummers about how technology, from early drum machines through to software such as Groove Agent, has threatened their livelihood. Steinberg's Virtual Guitarist has caused similar grumbles from the guitar-playing fraternity — and now, with their 'virtual vocalist' Vocaloid, Yamaha seem set to cause the same discontent amongst the ranks of human vocalists.
Vocaloid is a software-based vocal synthesis engine, and Yamaha have entered into licensing agreements with other companies to develop a range of different virtual singers to be used with the engine. Each of these virtual vocalists is, in essence, a sample database built from recordings of a real singer (for example, male or female and suitable for a particular style of music such as soul, dance or rock). First off the blocks are British sample library developers Zero-G. Their first two vocalists, the female Lola and male Leon, were launched in time for the January 2004 Winter NAMM show in Los Angeles; a third singer, Miriam, based upon the voice of Adiemus singer Miriam Stockley, is due in time for the Frankfurt Musikmesse in late March 2004.
Of course, speech synthesis (as opposed to singing synthesis) has been around for many years, but the image most of us have of it is of a somewhat robotic, Stephen Hawking-esque cliche. Zero-G's advertising for Lola and Leon suggests that Vocaloid goes well beyond this, offering the facility to create vocal lines and harmonies that, with suitable editing of the synth engine parameters, can sound very much like a live singer. So, if you can't sing a note or are just sick of the attitude of your singer, is Vocaloid about to become a viable alternative?
While Vocaloid has no direct competition from other products, technology that can enhance an existing vocal performance has been with us for a number of years, and some sort of comparison may be useful here. For example, hardware processors such as the Digitech Vocalist range are capable of generating automatic four-part vocal harmonies from a real vocal input signal in real time, following scales, chords or MIDI control. Products such as Antares's Auto-Tune are now routinely used to rescue performances that are high on emotion but lacking in pitch control. Using MIDI control, it is also possible to re-pitch a melody line using Auto-Tune, but perhaps better suited to this type of application is Celemony's Melodyne, which can give audio an almost elastic property, providing the user is prepared to get stuck into the serious editing work required to keep any extreme pitch-shifting of audio from sounding unnatural. And if you want to change the character of a recorded vocal, the TC-Helicon Voice One hardware processor (and recent software equivalents for the TC Powercore) allows parameters such as breathiness, growl and resonance to be altered.
In some respects, Vocaloid includes elements of all these technologies. It can pitch a vocal melody as accurately as the user requires. Having created a vocal phrase, Vocaloid also makes it possible to create harmony parts. And by using different virtual singers (with Lola and Leon available now) it is also possible to create vocals with different characters. But, of course, the key difference is that Vocaloid doesn't require a real singer to provide the original vocal line, as this is all generated via the synthesis engine. In this sense, therefore, there is also an element of the 'virtual musician' about it. So, how does Vocaloid achieve its vocal magic?
- Pentium III 1GHz or faster, 512MB RAM, Windows 2000 or XP, Ethernet LAN card, 600MB hard disk space.
As outlined above, each of Vocaloid's virtual vocalists depends upon two basic elements. The first is Yamaha's singing synthesis engine. This provides a software environment in which notes can be entered into a familiar piano-roll-style MIDI editor and lyrics can be added for each note entered. The software then attempts to translate each syllable into suitable phonemes (see the Phenomenal Phonemes box) and the combination of phonetic sounds is used to form the word that will be sung back by the synthesis engine.
The second element is the singer database, which is what gives each virtual vocalist its particular character. Samples are created of each singer performing all possible phonemes and transitions between phonemes — which is exactly what Zero-G have done to create Lola and Leon. Clearly, given the complexities of the English language, this is a lot of individual 'sound blocks' and, as such, the sample databases for each virtual singer are quite large after extraction from the install CD (550 and 750 MB for Lola and Leon respectively), so you need to allow plenty of hard disk space. Given the lyrics that have been entered, the appropriate phonetic sounds are extracted from the database and assembled to create the chosen words.
The synthesis engine then derives the required pitches by shifting the fundamental and overtone elements of the sounds while leaving the vowel formants relatively intact. In order to reduce the degree of pitch-shifting required, the phonetic sounds within the database are also multisampled at a number of different pitches, helping to improve the realism of the end result. Even so, Vocaloid's initial interpretation of the chosen lyrics hits each note spot on, with no pitch variation during each note. This does, of course, sound somewhat mechanical (imagine Auto-Tune with a very fast Retune and high Tracking settings), so the final step in the process is to add expression such as attack, vibrato and dynamics via the appropriate tools included within the software.
Given how difficult it is to reproduce real instruments such as solo violins or saxophones, either via synthesis or sample-based technology, delivering all the subtle nuances of a lifelike lead vocal performance from a piece of software seems like a very ambitious undertaking. So just how close does the first generation of Yamaha's Vocaloid technology get to this very difficult target?
Zero-G provided SOS with both the Lola and Leon virtual vocalists for review. Each virtual vocalist is supplied on a single CD, and included in the box is a 100-page printed manual. The full install process includes the Vocaloid Editor application, the various database files for the particular vocalist, and a VST Instrument plug-in. The main Vocaloid Editor is a stand-alone application, but includes Rewire support (see the Sequencer Integration box for details of the Rewire and VSTi support). A custom installation can be used to add further vocalists.
The install process does have one catch. In order to activate the installation (which has to be done within the first five days of use), the host PC needs to have an Ethernet LAN card fitted. The LAN card hardware is identified by Vocaloid and needs to be present whenever the software is run, essentially acting as a DIY dongle. While the LAN card does not need to be connected to a network, users without such a card will face an extra expense to obtain something suitable. As the PC I used for the review only connects to the outside world via a humble dial-up modem, I purchased a USB-based LAN device (costing about £20) and this proved to be perfectly adequate for the purpose. The actual activation process can be completed either on-line (which I did via my modem) or off-line from another computer that does have Internet access.
The printed manual does a reasonable job of describing the main functions of the software, while an introductory video tutorial is included on the CD (this is also available on the Zero-G web site). A few Vocaloid demo files are also present and, while these have a '.MID' file extension, they are clearly not standard MIDI files.
When it comes to pop singers, visual image is, of course, quite an important factor. While Lola and Leon will, unfortunately, be unavailable for video shoots, the appearance of the Vocaloid user interface is neat enough. Indeed, Yamaha have adopted a fairly standard piano-roll-type MIDI editor design that most sequencer users will find very familiar, and which they are calling the sequence track. The sequence track includes the usual scroll bars and, in the horizontal direction, buttons to zoom in and out. As Vocaloid Editor supports multiple tracks for the creation of harmony parts, at the base of the note editor area is a series of tabs allowing the editor to toggle between different tracks. Only one track can be displayed and edited at once, but the Track / Overlay setting allows notes from the underlying tracks to be seen in a semi-transparent fashion, making note selection for harmony parts a little easier.
Along the top of the window are conventional drop-down menu options to access Vocaloid's various functions, and many of the key ones, including a set of transport controls, are represented by icons immediately above the note editor area. Beneath these are the Measure, Tempo and Beat rulers. The first of these can be used to place the Position Indicator or the Start and End markers (to set up a cycle region). The Tempo and Beat lanes can be used to enter beats per minute or time signature changes anywhere in the sequence.
The base of the screen is dominated by a single MIDI controller lane, which Vocaloid refers to as the control track (in the screen shot on the first page of this review, note velocity is displayed) and different controllers can be selected from a drop-down list to the right of the control track itself. This includes the ability to select the required singer if several virtual vocalists are installed. As shown in the screen shot (right), each note can have a variety of annotation added to it including the lyrics, which appear directly above the note itself. As we'll see, much of this relates to the phonetic and expression functions required to improve the realism of the synthesized vocal. If you need to focus on one particular element when editing, the display of each type of annotation (lyrics, attack, vibrato, dynamics and phoneme) can be toggled off using the appropriately labelled buttons in the toolbar.
On first launch, a small number of initial settings concerned with the audio device being used and the MIDI resolution need to be made via the Settings menu. While Vocaloid operates at a fixed 16-bit resolution, sampling rates can be adjusted depending upon what is supported by the available soundcard.
Phonetics is commonly described as the science of sounds, especially as this relates to the human voice. Most standard dictionaries, as well as providing spellings and meaning of words, include a phonetic breakdown of the word so that its correct pronunciation can be made (the phonetic version of a word can be thought of as 'sound spelling'). Therefore, in turning lyrics into sung words, Vocaloid has to perform a phoneme transformation, converting each syllable into an appropriate combination of phonetic sounds.
In a dictionary, each phonetic sound is usually represented by a combination of letters, with some accents used for emphasis. Vocaloid uses a similar system of representation. Usefully, if the automatic phoneme transformation doesn't produce the desired result, both in the software's Word Dictionary and the printed manual, a phonetic symbol chart is provided so that the correct phonetic symbol for the required sound can be looked up. Some of these are obvious. For example, if the 'w' sound is needed as used in the word 'way', then the phonetic symbol is, simply, w. However, a sound made by the 'l' in the words 'feel' and 'list' is somewhat different, and these therefore have different phonetic symbols.
Once Vocaloid has identified the correct combination of phonetic sounds, the appropriate samples are extracted from the virtual vocalist sample database (Lola or Leon) and then manipulated within the synthesis engine to generate an accurate pronunciation of the words required. It all sounds fairly easy on paper, but when pitch gets added into the equation, it is clear that both the construction of the Lola and Leon sample databases, and the subsequent processing of these samples by Vocaloid, are quite impressive feats.
In very basic terms, creating a vocal track with Vocaloid is as simple as entering the notes of the melody and typing in the lyrics. As in any other piano-roll editor, a combination of the Pencil and Eraser tools can be used to add and delete notes via the mouse. Note entry from a MIDI keyboard is not supported, but MIDI files created in a sequencer can be imported, and MIDI data on different channels is split into the appropriate number of Vocaloid tracks. Notes can be repositioned by clicking and dragging. Somewhat oddly, note length can only be adjusted by dragging the note end point — the same stretching is not possible at the note start, which can be a bit irritating for anyone used to this ability in the equivalent editor in their main sequencer.
The editor includes the usual range of 'snap to grid' options and fixed-length notes (including dotted and triplet notes), although 'freehand' note entry is very straightforward using the mouse. As each track in Vocaloid is strictly monophonic, overlapping notes are not allowed and any notes that do overlap appear faded in the display. The Jobs / Normalise Object menu option can automatically adjust note lengths to remove any overlaps. Although it is a little crude in operation (it simply truncates the first note of any overlapping pair), it is useful for imported MIDI files where note entry via a keyboard might need to be cleaned up a little.
A default 'ooh' lyric is added to each note as it is entered, and if playback is initiated at this stage, Vocaloid will sing back the melody just using this vowel sound. This can be useful while you are fine-tuning the melodic phrase itself. Clicking on the lyric box then allows the actual words required to be entered, and the Tab key automatically moves to the next note. Some care is needed, however, as the manual suggests that each syllable of a word should be given its own note. Syllables can be joined using the minus (-) sign. Once the lyrics are entered, pressing the Phoneme Transformation button (with the 'æ' icon) triggers Vocaloid to work out which phonetic sounds should be used to construct each syllable of the lyric. These phonemes are then displayed underneath each note.
At this stage, playback can be triggered and Vocaloid will, after a short pause, sing back the lyric following the melody. I say "after a short pause" because the first time a sequence is played back after entry or editing, the synthesis engine has to do its stuff to assemble the appropriate combination of phonemes from the sample database of the chosen singer. If the phrase being constructed is just a few bars in length, then this wait is not too long (a few seconds on the reasonably well-specified test PC), but when I tried to construct a full vocal track over several tens of bars, I was left twiddling my thumbs for a little while.
Somewhat oddly, the synthesis process operates on the entire track, even when you've only made a minor change such as deleting just one note, which can make creating longer tracks a little frustrating. While the synthesis process is obviously very complex, I wonder whether this is something that Yamaha might address via a future software update? Could the engine be forced to reprocess just the area immediately surrounding any edits made since the previous playback or, when in cycle playback, just the bars within the Start and End cycle markers? Having either or both of these options available would certainly speed up the editing process.
If the PC has enough processing grunt, the Settings / Play menu option allows Vocaloid to operate in a 'Play with Synthesis' mode. In this mode, the delay before playback starts is much shorter and Vocaloid attempts to synthesize 'on the fly' while playback is in progress. Unfortunately, on my system at least, this resulted in very glitchy playback of the generated singing, making it difficult to judge the quality of what was being produced.
Up to this point, the note/lyric entry process is both simple and speedy thanks, in the main, to a user interface that will be very familiar to most sequencer users. However, unless a perfectly pitched robotic vocal is the effect you are after, some expression now has to be added and, depending upon the pronunciation produced by the automatic phoneme transformation process, some phoneme editing may be needed.
In terms of basic expression, the floating Icon Palette (called up from the View menu) provides a starting point. Attack and Vibrato icons are simply dragged and dropped from the palette onto the required note. One of each type can be used on an individual note. The note Attack types include accents, pitch-bend up (a common trait of many singers is to 'scoop' their pitch up into the note), trills and legato (smoothing the pitch transition between notes). When an Attack style is added, a small icon then appears beside the start of the note. Vibrato is added in a similar fashion and, by default, the vibrato object extends to cover the second half of the chosen note. The length of this can be changed by clicking and dragging the ends of the vibrato icon (a double-headed arrow appears when the mouse is correctly positioned to change the length of the vibrato icon).
Dynamics objects are dragged and dropped into the sequence in the same fashion but, instead of being attached to individual notes, they apply to all notes until the next Dynamics object is encountered. The relationship between these Dynamics objects and note velocity (which can be edited via the control track) is not made very clear in the manual. My own experimentation suggested that they are different ways of producing the same result — a louder or quieter voice — but neither seem to change the actual style of the vocal delivery. For more gradual changes of volume, the crescendo and diminuendo objects can be placed within the sequence. As with the vibrato objects, the length of these can be adjusted as required. Again, these interact with note velocity data, but they can also be used to produce a change of volume during a note, whereas note velocity just controls the volume at the start of a note.
Double-clicking on any expression controls placed within the sequence allows their properties to be edited in more detail. For example, the screen on the previous page shows a crescendo curve. Here, additional edit points can be added and the curve can be shaped as required, giving considerable control over the volume of phrases. Once edited, right-clicking on any expression objects allows them to be saved as presets for use in other Vocaloid projects.
If the automatic phoneme transformation process has not created quite the pronunciation required, three options are available. First, clicking on the phonemes displayed under each note allows alternative phonetic symbols to be edited manually. Second, having selected the note that requires altering, clicking on the A icon on the toolbar opens the Phoneme Edit window (above). From here, the phonemes used for each note can again be edited manually, with a look-up table provided for easy reference. The 'Protect' column allows any manual edits to survive if Vocaloid performs a subsequent automatic phoneme transformation. The third option is to use the Word Dictionary (left). Here, a user dictionary of words can be compiled and, while Vocaloid can be set to automatically generate phonetic symbols for any words entered, these can also be edited by hand and are then used if the word is entered as part of a lyric — although beware if you enter combinations of sounds that would not naturally go together, as the synthesis engine tends to ignore them.
The final major element of expression editing is provided via the control track. The Pencil Tool can be used to add individual control points (Dots), draw freehand (Free) or add straight line elements (Line). The drop-down menu allows a number of different parameters to be selected for editing. Things like note velocity, pitch-bend and pitch-bend sensitivity are fairly self-explanatory. The four Resonance controls each provide access to frequency, bandwidth and amplitude parameters, and while this provides a good deal of tonal control, it can be a little cumbersome to make full use of them, as only one parameter can be edited at a time. Other parameters include Harmonics, Noise, Brightness, Clearness and Gender Factor. These names give a clue as to their purpose, but the manual is a little unclear as to exactly how each alters the character of the resulting vocal, so some trial-and-error experimentation is required. Needless to say, each produces some variation in the voice characteristics and, with some careful editing, can help add a further sense of realism to the final vocal.
Once all the editing is complete, phrases can be copied to another position on the same track or to a second track. From the Singer Window (below), it is also possible to create copies of the installed singers (a second Lola or Leon for example) which have slightly different default tonal characteristics (such as Gender Factor). These can then be used to add variety to harmony vocal parts spread over several tracks. The simple Mixer window (above) provides a way of adjusting the balance between each track. When the whole vocal arrangement is finished, the File / Export option allows the synthesized vocals to be exported as a WAV file.
I've spent a good deal of time describing the key editing features used in constructing a vocal line with Vocaloid but, as yet, said very little about what it sounds like. Keep with me here, as an understanding of how the editing process operates is important in appreciating what is possible in terms of Vocaloid's output.
Even an inexperienced Vocaloid user will find it very easy to create 'robotic' special-effect type vocals, and these could work really well in some dance music contexts, although the same results could probably be achieved with a 'real' singer (good or bad) and some over-cooked pitch correction. However, creating a convincing and realistic solo lead vocal is more of a challenge. This is not to say that it cannot be done, but perhaps the best way to describe the process is that once the initial notes and lyrics have been entered, the vocal then has to be 'crafted' using the various expression tools and the control track parameters. This can require some pretty detailed editing work at the level of each note and/or syllable. If all that is required is a short vocal phrase of a few bars, this process is not so bad — but the prospect of doing this through the three minutes or so necessary for an entire song would be quite a daunting challenge.
When creating harmony backing vocal parts based upon short(ish) phrases, some of the editing may only need to be done once. The track can then be copied and some fine-tuning done to the various copies, both in terms of re-pitching notes and varying some details of the expression controls. Again, if this is done without sufficient editing work, the output can be a little mechanical, in a way that's not dissimilar to the results obtained via some of the less sophisticated automatic harmony processors that create harmonies from a live vocal. However, with enough time spent tweaking, the end results can be very good indeed and, sat in a full mix, can give that polished and tight backing vocal sound that is found in a lot of pop and dance music styles. The ability to use a mixture of female (Lola) and male (Leon) vocal parts certainly adds to the overall effect.
A further challenge when first using Vocaloid is getting the phrasing of lyrics to sound natural. When working with vowel-based oohs and ahhs, this is relatively straightforward and, again, in a backing-vocal context these can be made to work really well. Think of the kind of vocal soundscapes that might sit behind an Enya-type track or the solo vocalisations used by Lisa Gerrard in the opening scenes of Ridley Scott's Gladiator. I'm not suggesting here that Vocaloid could replicate the delicate expression that either of these singers possess, but the comparison provides a sense of the type of thing that is possible.
For proper lyrics, it can take a considerable time to fine-tune the way each syllable is executed, and the process requires careful use of both expression settings and, on occasion, phonetic transformations. All this said, while I found my initial vocal creation efforts to be somewhat frustrating, some persistence and patience eventually started to pay off. Vocaloid is one of those pieces of software that does require serious trial-and-error experimentation before things come together and the workflow improves. New users beware — don't expect instant results straight out of the box.
In this regard, I think Yamaha and Zero-G have missed a small trick here, although this would be easily remedied. Both the Lola and Leon CDs include a small number of example MIDI files that can be loaded and used as a basis for auditioning what the software is capable of. While these are useful, they are very few in number and don't really do much to either show off Vocaloid's capabilities or provide a basis from which new users might learn how to improve their own vocal creations. However, on the Zero-G web site, there is a larger collection of MP3 audio examples of Vocaloid in action. These include both backing vocal and lead vocal examples and, for new and prospective users alike, are well worth a listen. While the synthesized nature of the voices is apparent in some of these (deliberately so in one or two tracks), there are also examples when the lead vocals are very effective and frighteningly realistic. The inclusion on the install CD (or on the web site) of the Vocaloid MIDI tracks that were used to generate these vocal parts would provide an excellent illustration to new users of what was possible and how it can be achieved.
Even in its first generation, Vocaloid is a remarkable effort on behalf of Yamaha and, in combination with the Lola and Leon virtual vocalists supplied by Zero-G, the results can be remarkable, being suitable for a range of soul, pop or even classical styles. Just be prepared for the long haul if you want to construct a complete lead vocal line.
While Vocaloid Editor can be run as a stand-alone application, it can also be integrated into a sequencer environment via Rewire or as a VST Instrument. I tried both using Cubase SX (v2.0.1) with mixed success. On the test system, the Rewire support worked flawlessly. Having launched SX and then Vocaloid Editor, SX recognised the presence of Vocaloid Editor as a Rewire client and, via SX 's Devices menu, Vocaloid's master outputs and up to 16 individual channels could be activated. Transport controls operated perfectly in both applications. A polite 'please wait....' message is generated by Vocaloid if the synthesis engine needs to do its work prior to playback getting underway.
In contrast, I had absolutely no joy with the VST Instrument. This requires Vocaloid MIDI files created in the stand-alone version to be loaded within SX and then played back via the separate Vocaloid VSTi plug-in. The plug-in gives real-time control over the control track parameters. However, on the test system, playback was intermittent and SX became remarkable sluggish. These problems may be system-specific, so a try-before-you-buy approach is probably the best advice I can give. This said, my own choice with SX would be Rewire anyway.
Impressive as Vocaloid is, there is clearly a huge amount of scope for further development. As mentioned earlier, Zero-G have a third vocalist, Miriam, due for release in March, and this should add to the versatility of the product. Details on the Yamaha web site suggest that other developers are also planning vocalist releases. As mentioned earlier, there are a few operational frustrations, most particularly the need to re-synthesize the entire vocal track after any track editing, however minor.
Other minor quibbles include the difficulty of selecting individual note velocity values for editing within the control track and the lack of a mute function for individual controllers (this would allow A/B comparisons to evaluate the influence a controller is having on the vocal sound). A preview function in both the Word Dictionary window and the Phoneme Editor would also be useful to audition pronunciation. The ability to randomise note lengths and start/end points, or levels of controllers, by some variable amount around their current values, would also be useful to speed up the generation of harmony parts. More speculatively, I wonder if down the line somewhere, Yamaha might be able to add an 'auto-expression' function — perhaps using templates based upon real singers — that applies some AI based upon a particular singing style, analyses the vocal track and then attempts to automatically add some expression to get things started?
So, if you can't sing, can't afford to employ a good session singer or are just plain sick of the temperamental vocal talent you have available, is Vocaloid the answer to your prayers? Well, perhaps. Lola is not going to threaten the Britneys and Christinas of this world and Craig David, as yet, has nothing to fear from Leon but, given that this is the first generation of Vocaloid, and considering the undeniably difficult task Yamaha and their partners are attempting in synthesizing the human singing voice, this is a very impressive debut.
The down side is that creating a vocal part with a convincing degree of realism still requires more than just entering the notes and typing the lyrics — a lot of detailed editing is needed to introduce the subtle variations of pitch, tone, emphasis and phrasing that add life to the synthesized end result. In use, I initially found Vocaloid fascinating, frustrating and downright fun in equal measure, but as I gradually became more familiar with what was required, some of that frustration started to diminish. While I would not, as yet, advocate turning to Vocaloid for all routine pop vocal needs, I can think of a number of creative situations where it would certainly be useful, particularly for backing vocals. I'm very keen to see where Yamaha and Zero-G take this product next, with the release of Miriam being the obvious next highlight. Vocaloid is a brave undertaking, and hats off to both Yamaha and Zero-G for having the ambition to take such a significant first step!