Yamaha's Vocaloid technology has now been upgraded to version 2 and Sweet Ann, from PowerFX, is the first virtual singer based on the new release. So just how much further forward have Yamaha moved their intriguing vocal synthesis technology?
Yamaha's Vocaloid technology caused quite a stir when ZeroG released the first virtual singers, Lola and Leon. These were reviewed in the March 2004 issue of SOS (endpoint29cc8e70.chios.panth.io/sos/mar04/articles/vocaloidlandl.htm), and ZeroG followed the initial releases with Miriam, based on the voice of Miriam Stockley and reviewed in the December 2004 issue (endpoint29cc8e70.chios.panth.io/sos/dec04/articles/miriam.htm). For many songwriters and producers, the possibility of creating complete vocal performances by simply typing in lyrics to go with a MIDIbased melody was — and still is — an appealing prospect.
In its first incarnations, Vocaloid was undoubtedly a remarkable and innovative product and, with experience and patience, was capable of producing results that could be frighteningly realistic. The catch, however, was gaining the experience and having the patience. Although creating backing vocal phrases and harmonies was a feasible proposition, attempting to craft a realistic lead vocal (that is, something not intended to be in the 'special effect' category) represented a significant undertaking. Detailed editing of the phonetic sounds was necessary to get Vocaloid's pronunciation right, and a lot of work on the various expression controls was required to give the vocal some 'life' and dynamics — factors that come built in with most warmbodied singers!
Of course, creating a virtual vocalist is an ambitious project and, to their credit, Yamaha have persisted with the challenge. Vocaloid 2 is now upon us and PowerFX are the first company to license the technology and release a product based upon it. They describe Sweet Ann as a 'space lounge robovocalist sensation', and also have Big Al — a male singer — in the pipeline. Meanwhile, ZeroG have announced that their classical vocalist Prima is slated for release before the end of 2007. So how much closer have Yamaha moved us to having a singer in a box, and just what does a 'space lounge robovocalist sensation' sound like?
As before, Vocaloid is provided as a standalone editor application with Rewire support and as a VST Instrument plugin. New with version 2 is a 'VSTi realtime' plugin, of which more a little later. Although there are some significant changes in the new version of Vocaloid, the basic principles of its operation remain the same, so if you are new to the product, the previous SOS reviews mentioned above are well worth a quick read.
Sweet Ann, like all Vocaloidbased virtual singers, consists of two elements: the Yamaha synthesis engine and a singer database. The former provides a pianorollstyle editor into which the user can enter notes to create a melody, before assigning lyrics to each of these notes, along with various controls to add expression. The singer database consists of a sample library where the singer has been sampled singing all possible phonetic sounds and transitions between different phonemes. Once the lyrics are entered, Vocaloid extracts the necessary set of phonetic samples, links them together at the required pitch, adds the expression and — as if by magic — sings the vocal part required. The editing required is less of an issue when creating short phrases suitable for backing vocals and, usefully, Vocaloid allows multiple tracks to be created for harmony production.
While there are all sorts of detailed changes in Vocaloid 2, the most significant new features include a new synthesis engine, some improvements to the user interface, and the addition of the realtime instrument plugin version. One outcome of the changes to the synthesis engine is that it does not seem possible to use sample databases created for the version 1 engine with the v2 engine. However, I was able to open files created in Vocaloid 1, and the new editor did a decent job of translating them into Vocaloid 2 format.
The pianoroll editor is essentially the same as before, but the toolbar has been streamlined (with some options moved to the main menus). Perhaps the more significant change, however, is in the control track. A list of control parameters is displayed down the lefthand edge, with the currently selected parameter highlighted in blue. The control track is now semitransparent, with the previously edited parameter remaining visible when a new parameter is selected. This makes adjusting multiple parameters much easier.
The set of control parameters has also changed. The rather mysterious 'resonance' controls from version 1 have gone and the relationship between Velocity and Dynamics parameters seems better defined, with Velocity influencing the length of consonants at the beginning of notes (useful for adjusting pronunciation), while Dynamics alters loudness, for adding volume changes. The majority of the other parameters — Breathiness, Brightness, Clearness, Opening and Gender — change the character of the voice, although they do need to be used carefully or obvious audio artifacts can be introduced.
One of the comments I made in reviewing Lola, Leon and Miriam was that it would be nice if Vocaloid included some default 'styles' for expression that could be automatically applied to a vocal line to speed up the initial stages of vocal creation. Yamaha made some useful moves in that direction with an updater for Vocaloid 1 and this process has been developed a little further in the new release. The Settings / Singing Style Defaults option offers a selection of templates, and also allows manual setting of a number of pitch and dynamics controls. These settings can then be applied to all notes in the current track, providing a good starting point for further notebynote editing.
The control track aside, editing of all the details associated with individual notes is now done from within the Note Property dialogue, which is available via a popup menu when you rightclick on a note. This includes further dropdowns to customise the expression and vibrato settings, and edit either the lyric or its phonetic translation. It is also possible to protect a note once you are happy with its execution, which prevents any subsequent trackbased editing from overwriting the note properties. For detailed editing, this dialogue is simple and effective, but it would be a really big help if it included an 'audition' button so the single note could be rendered by the synthesis engine, allowing you to hear the results of any edits without returning to the pianoroll editor. As it stands, you have to close the dialogue and play through the arrangement to hear the changes, which is not great for the workflow as you are finetuning a performance.
Yamaha have, undoubtedly, made some excellent improvements to the Vocaloid editing process, but it is a little surprising that the Undo feature still only supports the last action — multiple levels of undo would be considerably more useful. Basic MIDI input into the standalone editor to create your initial melody would also have been nice to see.
For me, the real highlight of Vocaloid 2 is the realtime VSTi plugin. This uses the same engine as the standalone editor but, as far as the user is concerned, operates in a very different fashion. Once added to a project (for example, via the VST Instruments panel in Cubase), like other VSTis the plugin is then available as a possible output destination for a MIDI track. Clicking on the Lyrics Edit button opens a further dialogue into which lyrics can be typed or pasted from another application. When the phrase is complete, Vocaloid breaks it into its various phonetic sounds. The user then presses the 'aA' button and the synthesis engine does its work behind the scenes. The phrase can then be triggered either live from a MIDI keyboard or from a prerecorded MIDI track, with each MIDI note triggering a single syllable.
The user still has control over a number of expressive options. The Settings button opens a dialogue for customising the vibrato, pitchbend and portamento properties. The faders, which can be controlled in real time either via a mouse or a hardware controller, allow changes in the voice quality to be made, while note velocity (attack of first consonant), pitchbend, modulation (to control vibrato) and expression (volume) can also provide realtime control.
The Delay and Decay faders provide some useful control over pronunciation. Delay influences the length of the consonant at the beginning of a note, while Decay adjusts the length of the consonant at the end, with the numeric values in milliseconds. I found it useful to adjust these, using different settings for rapid and slower phrases, but they can be bypassed by pressing the Fixed button, which just applies a preset length.
The other interesting options concern the Mono and Poly buttons. In Mono mode, the plugin creates a single voice. Unlike the standalone editor, however, overlapping notes are allowed, which permits a syllable to be stretched over multiple pitches — great for adding movement to a melody. In Poly mode, the plugin provides up to a fourpart harmony, and the obvious application here is for backing vocals. If you press a single note and hold it and then add a second, third and fourth notes, the voices gradually appear singing the same syllable. Only when all notes are released and a new set of notes are triggered does the engine move on to the next syllable. Again, this add further flexibility, although it does take a bit of practice to get used to 'playing' a set of voices in this fashion.
For me, the VSTi realtime interface is a much more intuitive approach to creating a synthetic vocal than the standalone editor. If Yamaha continue to develop Vocaloid, I would imagine this is the direction in which they will take it.
So much for the operational improvements in Vocaloid 2 — but do they result in a more convincing virtual vocal, and how does Sweet Ann sound? In brief, the answers to these two questions are 'sometimes' and 'sweet'. I tested both the standalone editor and the realtime plugin within Cubase, and the changes to the user interface made it much easier to get an initial vocal line beyond that 'this is obviously synthetic' stage. That said, going that extra mile to edit the expression and pronunciation in such a way that the vocal sounds convincingly natural is still a challenge. In fairness to both Yamaha and PowerFX we should not lose sight of what is being attempted here: synthesizing the most expressive of musical instruments is not an easy task. Vocaloid 2 is most certainly a step up from the earlier release, both in terms of ease of use and synthesis quality, but the technology still has some way to go before human singers need worry about being routinely replaced.
That said, how appropriate Vocaloid and Sweet Ann are as tools depends upon the job you have in mind. For gentle 'ooh' or 'aah'style vocal lines or harmonies, it is perfectly possible to create something very convincing, and you can do it much quicker with this release of Vocaloid. The same is true for short phrases using real words for vocal hooks or backing vocals; these require a little more work, but it can certainly be done. And as with Vocaloid 1, creating vocals that are intended to sound synthetic (as might be used in some dance styles, for example) is a breeze. That leaves the question of whether a convincing lead vocal can be created and, at this stage, I think that's still right at the edge of Vocaloid's talents, although I'm sure a dedicated few will attempt it.
In terms of vocal styles, Sweet Ann is — as her name suggests — quite sweetsounding and, synthesis quality aside, the voice is perhaps more suited to certain pop styles or the purer side of musical theatre than sexy R&B or raunchy rock. This is perhaps understandable, as I would imagine it would be more difficult to reproduce a variable smoker's rasp via the synthesis process!
So where have Vocaloid 2 and Sweet Ann taken us to? The new synthesis engine does seem to have bought some improvements in the quality of the vocal that can be created, but these improvements should probably be seen as incremental steps rather than a revolutionary leap forward. In practical terms, the biggest strides — at least on the basis of Sweet Ann — appear to have been made in the area of the user interface. This now makes it much less hard work to get towards the best that the synthesis engine has to offer, and the VSTi realtime plugin suggests the beginnings of an approach that is much more musicianfriendly.
I can see many musicians (myself included) putting Sweet Ann to use for a range of supporting vocal tasks, and although Vocaloid has not yet reached the point where it could be regarded as a firstclass singer in a box, Yamaha and partners such as PowerFX deserve considerable credit for pushing the envelope a little further. I, for one, hope they can continue to resource further developments, because this is remarkable and fascinating technology.
- Windows XP, 512MB RAM (2GB recommended when using realtime VSTi plugin), 2GHz Pentium 4, Athlon XP200+ or faster CPU, 2GB hard disk space, DVDROM drive.