We think nothing of using computers to help us create music via the wonders of MIDI — so why not use the computer to generate the actual sounds, and take entire synthesis systems into the realm of computer software? Paul D. Lehrman puts the case for and against this move, and talks to some software synth developers.
When boat owners gather, one of the debates you'll inevitably hear is 'inboard vs. outboard': which is better, in terms of power, efficiency, and maintenance? Now that same argument is being heard in synthesis circles: do we need external hardware or even soundcards to produce our music — or should we be doing it within our computers?
Now, don't panic: no one's going to make you throw away your precious M1s and K2000s. We're not talking about professional studios here, we're talking about the multimedia world, in which MIDI plays a relatively small, but important role: to deliver high quality music and sound effects behind games, presentations, interactive programs, and other desktop media.
In these circles — and watch out, they're getting bigger all the time — the new, superfast CPUs now available for our favourite desktop platforms have spurred a movement to bring the creation of MIDI‑controlled music right into the computer's main brain. Thanks to chips like the PowerPC and Pentium, as well as the development of cheap digital‑to‑analogue converters (or DACs), companies are looking to move the task of synthesizing audio into the computer itself, as a way of bypassing the notorious difficulties of installing soundcards, particularly on MS‑DOS machines. IRQ conflicts, specialised driver software, competing synthesis technologies, and sound library incompatibilities frequently make setting up a soundcard and getting it to play a nightmarish experience. If the cards could be eliminated, the thinking goes, but the quality of sound preserved, one of the music‑for‑multimedia community's greatest headaches would be removed.
Today's machines are so fast that dedicated sound chips are no longer absolutely necessary for a computer to perform real‑time synthesis. The technology is pretty straightforward: a bunch of audio samples (or 'wavetables') are loaded into the computer's RAM, and when a command is received by the operating system to play a note, that sample is retrieved, run through a pitch‑shifter or other real‑time processor, and sent to a DAC to turn it into something we can hear. DACs cost about $50 or so at retail (considerably less at the manufacturing level), and can be found already installed on the motherboards of many multimedia computers, particularly slotless notebooks. From the DAC, the sample then goes out as an ordinary analogue signal to speakers, headphones, or amplifiers. The commands controlling the process are normally in the form of MIDI messages, sent through the computer's operating system using special drivers or extensions, such as Opcode's Open Music System (or OMS; this has been licensed for use in future versions of both Apple's QuickTime and Windows) or Microsoft's Multimedia Control Interface.
The quality of the sound depends on the usual suspects: sample length (8‑ or 16‑bit), and rate. But there's another important factor: the number of simultaneous sounds that can be played at one time. Since the CPU is spewing out samples at a fixed rate, the number of samples it can generate at one time is a function of its speed. Also affecting the polyphony is how much of the CPU's power is available for this task, and how much needs to be reserved for other things like video, graphics rendering, or digital audio. A slow CPU might be able to handle four real‑time 'voices' at a time, while faster ones can handle commensurately more. If a system is to be considered a true General MIDI synth, it needs to be able to play at least 24 voices — if there are fewer available than that, many GM scores will play incorrectly.
Yet another variable is how big the sound set is. A typical GM synth has between 1 and 4 Mb of sample ROM, and for a software synth to have similar capability, all of that has to be available for loading into the computer's RAM.
There are some tricks that designers of software synthesis systems can use to maximize performance without overwhelming the computer. To keep RAM requirements down, a system can 'cache' sounds, loading into RAM only the sounds needed in a particular sequence just before playing it. Another technique is to compress the sounds, using JPEG or a similar algorithm, and decompress on the fly. This requires a coder/decoder (or 'codec'). Codecs and caching put an extra burden on the CPU, however. To minimise CPU load, sophisticated voice‑stealing algorithms can limit the number of simultaneous voices sounding at one time, without the music suffering unduly.
There are already nearly a dozen companies involved in software‑based synthesis, designing synthesis engines or supplying sound sets, and they are being driven in large part by two very major players: Intel and Apple. But there are many within the music and multimedia communities who think the whole idea is over‑rated, that software synthesis is not ready for prime time, and may never be. Whichever view prevails, the result will have a major effect on the future of music in multimedia.
Let's take a look at some of the companies currently active in this area, starting with Apple.
Apple's QuickTime 2.0 multimedia standard includes a set of samples provided by Roland called QuickTime Musical Instruments, which was drawn from Roland's GM sound set and reduced to about half a megabyte. A sequence track, consisting of a standard MIDI File, can be added to any QuickTime movie. When the movie plays, the sequence track plays the QuickTime Musical Instrument sounds. The quality of the sounds depends on the Mac you're using (8‑bit for 68030‑ or '040‑series machines, 16‑bit for AV and PowerMacs) and the number of voices available ranges from 4‑6 on a Mac LC to as many as 30 on the fastest PowerMacs.
The first incarnation of QuickTime Musical Instruments, released early in 1994, was pretty basic. Only a few sounds from each GM instrument bank were present, and when a program change called for a sound that wasn't there, it defaulted to the first sound in the bank. The problem was, somebody goofed in determining the way program changes were interpreted (that old question of 'do we start counting at 0 or at 1?'), and sometimes the wrong bank would get called up. In addition, no real‑time control over the sounds, except for note velocity, was possible. Finally, there was a significant delay when playing built‑in sounds: the software takes approximately 100 milliseconds to process a MIDI command and play a sound. If you're just playing back files, this is no real problem (all sounds are delayed equally), but synchronising built‑in sounds with external MIDI sound sources is difficult, and playing the Mac sounds from a MIDI keyboard is nearly impossible.
The next version of the Musical Instruments extension, 2.1, which will probably be out by the end of the summer, takes care of some of these problems. The program change bug is gone. More multisamples are being used, and real‑time controllers including volume and modulation are being recognised. "A lot of GM scores will play almost correctly," says a source at Apple. The sound set is the same, but the system now allows third‑party sounds to be included. Apple is reportedly looking closely at Sound Fonts, a standard being developed by Emu Systems, as a way to organise and exchange sounds. There's nothing in the 2.1 version to address the delay problem, but the source says Apple are "going to make an aggressive effort to address it with the next go‑round. We're looking to get it down to 10‑15 milliseconds".
INVISION INTERACTIVE CYBERSOUND GM
Another Macintosh system has been announced by InVision Interactive of Palo Alto, California. CyberSound GM, due in October, is a subset of a high‑end, professional synthesis system that will be available at a later date. The larger system will support multiple synthesis types including analogue, PCM, and physical modelling, and will come complete with a complex modulation matrix, real‑time controllable resonant filters, and programmable time‑based effects such as reverb and chorus. The GM set will incorporate all of these technologies, but they won't be under direct user control. The system will load as a control panel, and be playable using either MIDI Manager, OMS or QuickTime. "There's no delay problem," says InVision spokesman Tim Gehrt. "We've specifically paid close attention to latency problems associated with software synthesis. Using technologies such as dynamic sample downloading and intelligent algorithm scaling, the InVision solution manages the synthesizer's load on the CPU and provides better overall performance".
According to Gehrt, CyberSound will ship with two GM sound sets, one for the 68000 platforms and a high‑level set for the Power PC environment. The smaller GM set trades off response time for better overall performance, but the high‑level Power PC set, according to Gehrt, "will perform equal to all the professional synths out there in terms of response time, sound quality, and performance". Both sets use 16‑bit samples, and up to 65 voices will be available on the Power PC, with a minimum of 24 on a 68040 Mac. In addition, CyberSound will include various sounds programmed using the analogue and physical modelling algorithms. The US price was originally projected to be between $200 and $300, but the product is currently being revised to include a sequencer (although specific details are sketchy at present). However, Gehrt maintains that despite this addition, the current price projections are now lower than they were originally. As usual, the only advice one can give with a product in development like this is "watch this space!"...
US soundcard manufacturer Turtle Beach include V‑Synth, a set of software‑based sounds licensed from Seer Systems (a Silicon Valley company run by Stanley Jungleib, one of the forces behind the seminal synth manufacturer Sequential Circuits) with the cheap Turtle Beach Monte Carlo game soundcard for IBM‑compatible PCs. V‑Synth is a 32‑voice synth engine that works with any 486 or Pentium system, and according to spokesman Roy Smith, "any soundcard with the right drivers". At present, however, the Monte Carlo, a 16‑bit SoundBlaster clone, is the only available card that works. The drivers in Windows 95, Smith expects, will make it possible to run the system through any Windows‑compatible soundcard. However, since the system depends on the virtual device architecture that Windows provides, it doesn't work with DOS games. Smith also hopes to have the company's SampleVision and WavePatch editing programs available for the new platform.
Software synthesis is a great advertisement for hardware synthesis.
Another PC‑orientated company is Brooktree. This 12‑year‑old San Diego semiconductor company has developed WaveStream, which works with Windows on 486/33 machines and up. An MPU401 emulation mode allows use with DOS games. The sound set, which contains 16‑bit samples, is from Q‑Up Arts, and takes up 8Mb of RAM with the sample rate fixed at 22 kHz. A caching system, which loads the sounds required with any sequence, uses anywhere between 600K and 2.3Mb, depending on the emulation quality required.
Real‑time control is provided over pitchbend, aftertouch, mod wheel, and sustain, as well as dynamic filters, filter envelopes, and LFO. The number of voices is selectable on the fly, using a feature called 'dynamic polyphony', from 2 to 32 voices. Another feature called MultiSynth will, if there's an FM card present, offload some of the instruments to the card, reserving wavetable voices for when they will do the most good. Senior product marketing manager Joe Monastiero comments: "With a 486/33 we recommend using nine wavetable voices, with the rest in FM."
Although the software was developed on a SoundBlaster 16, plans are to ship it initially only with the company's own MediaStream chipset, which, in addition to audio, handles graphics and video on a PCI card. The release is scheduled for October. Ports to other soundcards are expected shortly, and user‑designed sounds will be accommodated through a program called SampleXchange.
Even IBM is getting into the game, with not one, but three software‑based systems. Some models of the company's ThinkPad computers have a fairly simple 16‑note synth running on the machineUs 80486 processor. More interestingly, IBM's Power Personal Systems division has announced a software synthesis system for the company's PowerPC‑based computers. It ships with the NT 3.51 operating system, and can play up to 32 notes simultaneously using 16‑bit, 22‑kHz samples. Three different configurations are supported, each using a different amount of CPU and memory usage. The first configuration is called BASIC and utilises the subtractive synthesis technique employed in the first ThinkPad MIDI synth. This mode requires a very small amount of CPU cycles and zero sample memory. The second configuration (PREMIER) performs wavetable synthesis using about 2Mb of samples. Finally, the third mode combines the first two modes, offering a combination of wavetable and subtractive synthesis. A stereo audio codec is included on the motherboard, along with a joystick/MIDI port that uses the same cable as a SoundBlaster. The system includes a stereo mixer that can combine up to four signals at any standard MME sample rate or word length, in mono or stereo. IBM reports that Sonic Foundry has ported their highly regarded Sound Forge 3.0 audio editing software to run in native mode on the system.
Finally, IBM are also shipping a third software synth with their PCMCIA card. This makes use of the same samples used in the PowerPC synth (16‑bit, at a rate of 22kHz) but does not perform per‑voice configurability.
The Kurzweil Technology Group, at the Young Chang R & D Institute in Massachusetts, is working on a software synthesis engine for a number of large PC manufacturers. The sound set is almost the same as the 2Mb ROM that goes into several of the GM chipsets that Kurzweil market to other equipment manufacturers, and features a maximum 32‑voice polyphony at a maximum 44.1kHz bandwidth. According to marketing manager Fred Lapitino, "the number of voices will be software‑controllable by the user. This will allow you to optimise the synth structure for the varying demands of multimedia applications, and the varying PC hardware on which these are run".
The synth can be loaded with new samples using MIDI Sample Dump Standard, but since a real MIDI cable isn't involved, the operation is quite fast, according to Howard Brown, chief architect of the group. The size of the set is not fixed, but can be expanded up to the RAM available in the host computer. The processing delay for live MIDI control is, according to Brown, "comparable to most MIDI instruments, or under a millisecond, except that the operating system sometimes gets in the way".
And finally, in a surprise entry, Altec Lansing, the Pennsylvania speaker manufacturer, is dipping its corporate toes into software, with a wavetable synthesis system for Windows scheduled for shipment in late Autumn. It uses only 1Mb of RAM, and will run with a 75MHz (or faster) Pentium. The price is minimal too: $29.95. No further details were available at the time of going to press.
Even the sceptics see a role for software synthesis in the future. "Music professionals will never accept the performance," says Turtle Beach's Roy Smith, "but for what the masses need to play back games, it can be quite happening. As time goes on, the proportion of power being taken will diminish by orders of magnitude, until it gets to the point where it's a background task, like a screen‑saver."
Emu's Dave Rossum agrees: "In the very long term, as the machines get an order of magnitude faster, we can go from a situation where the software synth occupies 25% of the CPU to 2.5%, and then synthesis becomes small change. Of course, we continue to discover new things — right now, we're looking at doing 3‑D audio from two channels — that may keep demand on audio performance apace with CPU performance. But people using this in the future, in a room with the disk drives yelling and the kids whirring — they'll be very happy if they can get from the software the equivalent quality of a 1995 General MIDI synth."
As to who is going to win out among the players, perhaps Fred Lapitino has the right philosophy: "Just as with hardware, the people who have been doing it the longest will do it the best. All wavetable software synths basically do the same thing, but it's the craft of the people putting the sounds together that will make the difference."
And as to how all this affects the professional musician, nobody's going to make you record all your music on this stuff. But just as the professional composer for television checks his work on a 2‑inch speaker, the multimedia composer may want to start getting used to hearing what his music sounds like played back on it.
Onboard synthesis is not a new concept. Years ago, you could get a computer to play music by writing a machine‑language program that triggered its internal speaker at speeds in the audio range. When you toggled the speaker on and off 440 times a second, you got something approximating a square wave playing the note 'A'. In 1983, Commodore put a 4‑voice simple‑waveform synthesizer chip known as 'SID' (Sound Interface Device), in its model 64 computer, which was well utilised by game designers for the platform. Then, on the first Macintosh, Apple included a 4‑voice, 8‑bit, wavetable sound chip that could carry out real‑time synthesis on the fly. It was supported by a number of products that let you design sounds and play scores, and even use the sounds with a MIDI keyboard. Later, the so‑called 'AV' Macs included a 16‑bit high sample rate sound chip, but manufacturers who supported it used it for recording and playing digital audio files, not synthesis.
With so many companies, including some truly heavy hitters, backing software synthesis, is this the inexorable wave of the future? Is hardware, at least as far as PC‑based music goes, doomed?
Well, it's probably still a little premature to melt down your soundcards. Even those who are working on software synthesis are a bit sceptical. The reason is that the performance 'hit' on a computer trying to generate 32 MIDI voices in real time — even a fast one — can be substantial, and some feel that playing music is not the most efficient use of a CPU's time. According to Kurzweil's Fred Lapitino, "playing 32 high quality voices at 44.1kHz bandwidth uses up almost all of an average 486, and about half the processing power of a 100MHz Pentium". Lapitino's colleague Howard Brown adds: "Software synthesis is a great advertisement for hardware synthesis. The hit is considerable: at 44.1kHz, you're looking at 2 to 5% per voice on a 90MHz Pentium. It's a false economy — it looks like the synth is free, but actually, it's taking up as much as 60% of your CPU. When you add in the cost of the RAM that is dedicated to making music, you're spending $750 to do what a $150 soundcard can do."
Some would consider Brown's figures to be a bit high, but many in the audio community share his concern. Gordon Currie, audio designer at Microsoft, says software synthesis is "a great idea, but it still has some way to go. It reminds me of a lot of software implementations of stuff that originated in hardware: it's always slower in software, and not as good quality. Everybody and his brother's going towards software, assuming that with the new CPUs there's so much idle time to spare, but if all of your DSP is in software, you end up with a dog‑slow CPU that's trying to do everything at once: audio decompression, video, 3‑D graphics, and a modem in the background. The hardware is finally fast enough that we can start writing code with fast hardware in mind, but when everybody does it, it's going to slow down again. The user will think 'This is ridiculous — I paid all this money and it's all hype, the system is slow as molasses!'"
Others point out that with hardware‑based synthesis as cheap as it is, it makes more sense to use the CPU for tasks which can't be so easily duplicated in hardware. "A Pentium 120 with 16 megs has enough memory and processing power to do high‑quality synthesis," says Microsoft's Geoff Dahl, "but some games would prefer to use the resources for improved 3D rendering, or MPEG video or audio decompression. Maybe you can do one or two of these at once, but it would probably not do a great job with all four at the same time."
Dave Rossum, director of the Emu/Creative Labs technology center, calls software synthesis "a cool thing to use your computer for, but right now the speed of the machines is such that anything that's going to sound interesting is taking a significant hit. If all you're doing is synthesizing music, that seems to work. It's not necessarily the best sound quality and most accurate sequence playback, but I don't retch and vomit when I hear it. But when you get into MIPS‑hungry (millions of instruction per second) applications, everyone's going for as many MIPS as they can get, and reasonable synthesis takes 25% or more of performance, so basically, you'll end up sacrificing something".