TC‑Helicon's revolutionary VoiceCraft card for their VoicePrism processor allows the user to radically reshape the whole character of a vocal sound, adding tonal and inflectional changes that go way beyond the capabilities of current voice processors. Paul White talks technology with designer Fred Speckeen.
As mentioned in last month's NAMM show report, TC‑Helicon have developed a system of voice modelling that allows the characteristics of the human voice to be changed quite radically while retaining a natural quality. The first product using this technology will be the VoiceCraft expansion card, which plugs into the back of the Helicon VoicePrism vocal processor. I caught the impressive demos of the VoicePrism and VoiceCraft combination at the NAMM show, and immediately wanted to know more about how it works. Fortunately, I managed to collar TC‑Helicon's Fred Speckeen before he left the show, and he provided some of the answers, while clearly protecting some of the company's more sensitive technical details as the VoiceCraft card nears completion.
Dial E For Elvis?
Having seen the demo of the VoiceCraft card, it seems that the general aim is to give singers the ability to change the character of their voice so that they can sound like somebody else altogether if they want to. Alternatively, they can make minor adjustments to shape their own voice the way they want to. Is this the first step towards a 'dial‑E‑for‑Elvis' type of box that will allow anyone to sound like anyone else?
"The strength of this technology, and the excitement about it amongst those professionals who've seen it seems to be more in the areas of vocal enhancement rather than character transformation. We're not setting out to create models of specific singers. What we're setting out to do is to understand more about the human voice and provide controls that accurately reflect changes in the physiology of the voice. You can get head and body resonances that are somewhat like those of other singers, but you're still not going to be able to create the ultimate emulation. Even if you could completely resynthesize a voice coming into the VoicePrism and then accurately model the target voice — the vocal tract, the excitation signal and all of the resonances — you'd still only be 40 percent of the way there, because of the way individual singers phrase things. You can implement some special effects in VoiceCraft's glottal section, doing things like inflection and vibrato, but you still need the delivery. That's what timing and musicianship is about. A concert pianist friend said to me recently that everyone can use technology to be a musician now, so all that's left is timing."
What kind of control will the VoicePrism and VoiceCraft combination give the singer?
"The VoicePrism itself was designed for the serious project studio owner and for the live performer. The whole user interface is quick and easy with simple icons and meta‑parameters [parameters which themselves control other parameters not accessible by the user — Ed]. That's what we'll be delivering with VoiceCraft. There'll be a number of faders where the user may control things like vibrato. You'll select from a number of styles of vibrato and then adjust the amount, but the software will work out the most natural place to apply it. This will carry through to the other parameters — you pick a style and set an amount.
"Behind that single vibrato control, there are probably 35 to 40 parameters that the user doesn't see. Things that govern how long a note sustains before vibrato is applied, the shape of the vibrato and so on. Furthermore, vibrato comprises both amplitude and pitch modulation, and those two elements are not necessarily synchronous. There's also formant modulation. Most people don't modulate their formants a lot, but the resonances are changing during vibrato."
Do the mechanics of vibrato vary a lot from singer to singer?
"Yes. We recorded a lot of singers and then grouped them according to style. Then we started noticing that there were common elements and identifiable characteristics. Take something as simple as the deviation in pitch above and below the intended note — is that an even split? Take a look at it in an audio editor and you might be surprised at what you find."
Understanding how you analyse and recreate vibrato isn't difficult, even though the process itself may be far from straightforward, but modelling the sound of the voice itself is a different story. Presumably you're using the singer's own voice as an excitation source to drive a physical model that replicates the vocal tract, head resonances, and so on — but don't you have to take away the singer's own formants before you can make that work?
"Yes. The algorithms on the VoiceCraft card analyse the input vocal signal, modify certain parameters and then resynthesize the vocals. During the analysis phase, we separate the vocal signal into its constituent parts, including formant structure. The individual constituent parts are modified or substituted in some way, and then the signal is resynthesized to create a processed vocal signal that is either subtly or dramatically different from the original."
No two people have exactly the same formant structure, so I'd guess that you have to subtract 'Mister Average' from the input signal to get a usable excitation signal?
"Doing things in real time is challenging. It would require a lot of processing just to come up with an approximation of the formant structure in real time. You have to take a more generalised approach."
Last time we spoke, I suggested that a non‑real‑time Learn mode would make this easier — you'd just have to sing a few lines into the unit so it could work out your vocal characteristics (using some kind of correlation technique) and store the necessary parameter values.
"You and I have talked about that before, and we're thinking about it, but not for VoiceCraft."
So, once you've deconstructed your voice and fed it into modelled vocal tracts, nasal cavities and whatever else, what variables will we be able to control in this first incarnation?
"Other than vibrato, which we've already mentioned, there will be some control over how long the vocal tract is, style of resonance and so on, and there may also be some pitch‑shifting built into styles so that you could take a baritone male and turn him into a female or vice versa. At the moment, there are parameters for Growl and Rasp which can emulate some forms of vocal chord damage. That would move you closer to a Rod Stewart or Joe Cocker type of sound. Their sound comes from modulating that damage with the attack of the voice, so we would be foolish not to be considering amplitude‑driven onset of things like rasp and growl. Additionally, there's an effect called Inflection, which adjusts the way singers scoop up or down to a note rather than hitting it straight away. As with the other parameters, we've looked at a lot of different singers and come up with a set of styles the user can choose from. The vocal character will come from combining these parameters. What makes up an Ozarkian twang, for example? [For those who don't know, this is a nasal regional North American accent — Ed.] And how tied up is that with the nasal resonance?
"That leads onto what you can do with the vocal tract model. There's really two things the user can adjust, though again, these are really meta‑parameters that are controlling many others. One is the vocal tract length, and the easiest way of thinking of it is that women tend to have a shorter vocal tract length than men. Even when a man is singing at a high pitch like Michael McDonald, you're putting a high pitch into a big tube. The other user parameter is the actual resonances or formants, and that has to do with the spectral tilt of those areas. There's also a Mix fader, so you'll be able to create a great duet with you and this other person you've made! We're not going to do pitch correction or any form of real‑time pitch control though — the system is purely there to change the vocal sound."
Could you feed LFOs into the vocal tract size, so you could have 'body size vibrato', modulating between male and female...?
"That's a great idea — we should do that! I'll write it down... We're currently spending a lot of time voicing these effects, and will continue to work toward presenting them as style presets, so that the user can select from them and just apply an amount. We'll also try to choose some names or icons that are evocative of the type of sound. The goal is to give the end user the controls they need to get a result very quickly, so they can get on with making music."
You also showed a slider to add breathiness to the voice. Is that just adding noise to the excitation signal before it hits the modelled vocal tract?
"It has to be the right kind of noise. Is breathiness white noise or pink noise? In theory, you should be able to pass noise through a vocoder and emulate breathiness, but that doesn't work. Why it doesn't work is the real question."
The human body is very complex, so I assume that the noise has a spectrum of some kind and that different noise components are added in different places. Your challenge must be to simplify this to the level where it's practical to model it.
"It's no mean feat to physically model instruments, but when you come to the human voice, everyone is an expert in what it should sound like, and anything unnatural really stands out. What's more, you don't know how people will use and abuse this technology. The Cher 'Auto‑Tune' effect has been done to death, but it was interesting at first, and maybe someone will discover a similarly distinctive sound by using VoiceCraft in unexpected ways. That's why we provide controls that 'go to eleven'.
And, as you've already said, designing a unit that models in response to real‑time input is challenging.
"We had a conversation some years ago where we talked about what you could derive from a human voice in real time. There are some things that you can know fairly quickly and accurately, but what happens if you start applying those over a range of parameters that you can adjust in a model? It's impossible to explore all the possibilities. First you have to have the tools, then you start to hypothesise and read a lot about physiology, listen to a lot of singers, then you might come up with a theory about why somebody sounds the way they do. Part of the fun of working in hi‑tech is striking the right balance, so that you're not creating a solution that's in search of a problem. But hopefully we're creative enough to allow a little insanity to creep in!"
How long will it be before we can buy VoiceCraft? And what can we expect next?
"We have the user interface pretty much completed, but you always leave the door open for new developments. The plan is to have the card available by later in the Spring.
"As to what comes next, that's a question that we're asking too. As we work with the technology, we envision products that are dedicated to extending what we're currently doing, and of course we're already refining our techniques; for TC‑Helicon, this is the next stone in the foundation of the company. In the future we'll be looking at both the spoken and the singing voice. When you consider that around 20 percent of the cinema box office last year was science‑fiction, it's logical to target the broadcast and post‑production areas too. Remember, this is just the first incarnation of a new type of modelling technology."