Nov. 16, 2003. 07:59 AM
Researcher David Gerhard trains computers to distinguish song from speech by using recordings of identical words spoken and sung by the same people. Here are two examples:

Oh Canada

Speaking

Singing

Row Row Row

Speaking

Singing

Teaching computers how we sing
Even babies know how to separate speech from song
So why do the best of our computers find it so difficult?

PETER CALAMAI
SCIENCE WRITER

U2 artist Bono didn't sing his message when he spoke about international development to the Liberal leadership convention in Toronto Friday. The effect of his performance would have been decidedly different if he'd sung "Sunday, Bloody Sunday" or any other U2 classic — but his change of medium would have gone undetected by even the most powerful and sophisticated computer. In fact, it likely wouldn't even recognize that Bono was singing, rather than speaking.Even babies quickly grasp the difference between speech and song, so why do computers find it difficult, and does it matter? After all, as everyone realizes who has ever phoned to check a credit card balance or get a new directory listing, we're already using computerized speech recognition for simple daily tasks.David Gerhard, a professor of computer science at the University of Regina, says the speech-song distinction is important if we expect computers to ever come close to standards of artificial intelligence that interact smoothly with human beings and their surroundings, as regularly depicted in science fiction movies.Computers programmed to recognize and analyze the sung voice could have numerous practical applications — speech therapy, transcribing words and musical notes from a song, training singers, even retrieving songs that fit your personal tastes from the immense and growing online music collections. Gerhard is among a small coterie of researchers tackling the song and music limitations of existing computerized sound-recognition systems. And the researchers are confident their work will not, as some might fear, erase yet one more element of chance and mystery from life and human interaction.An accomplished guitarist and singer, Gerhard believes properly programmed computers can deepen our appreciation of music's magic and mysteries. He says the inspiration for his current research — described last week at the Acoustical Society of America meeting in Austin, Texas — struck when he was directing a choir."I was listening to the voices and how they all adjusted and blended together and I thought: I wonder just what is going on here," he says.At the core of that question is determining the elements people use to distinguish sung lyrics from spoken words. How important is vibrato or pitch? And how does the human computer, our brain, handle the huge range between the vibrato of an operatic basso profundo and a quivering elderly voice, yet realize that both are examples of singing? "Computers really have only one powerful processing site," notes Gerhard, "but our brains have millions of simple interconnected sites. "It's what's called massively parallel computing."A single chip has to make a yes-no decision on some sound which can lie on a fuzzy continuum that runs through poetry and rap music. "The brain can play that back and forth between different sites to realize that things which sound very similar really fall into different categories."So, Gerhard set out to discover precisely which elements people rely on in deciding when words are being sung and not spoken. That alone turned out to be a big job, enough to produce a successful Ph.D thesis. A key stage in the research was gathering hundreds of examples of people using the same words in speech and song and having scores of other people listen and note the differences they heard.

`What we're really interested in is the style of the utterance'David Gerhard, University of Regina
"What we're really interested in is the style of the utterance," Gerhard explains.People identified the biggest distinction between identical words spoken or sung by the same person in areas such a vibrato, the tremulous effect singers produce through minute and rapid pitch changes, and in a feature that Gerhard calls "voicing.""Some parts of speech have obvious pitch and some don't," he says, "but song has more because singers hold notes. That's voicing." Gerhard then used these results to write complex instructions — known as algorithms — that tell a computer how to extract information about vibrato and other meaningful features of the sound waves that make up speech or song.This becomes the computer's detection tool for that particular feature. In effect, the researchers are training the computer to do digitally what the human ear and brain do by innate perception.The training seems to be working — but slowly. When Gerhard exposed the various computer feature detectors to his library of recorded speech-song pairs, some correctly distinguished four out of five, but other detectors fared no better than chance. He described the algorithms publicly for the first time Thursday at the acoustical society meeting.Columbia University electrical engineering professor Daniel Ellis, who took part in the Texas session, says the current explosion in legal online music (as opposed to file-sharing that ignores copyright) illustrates one pressing application for this research. "We're going to get more and more artists contributing their own material, without any of the expensive marketing that exists now," he says."The challenge is going to be finding something that you like in these huge databases."Ellis suggests that people eventually could load into their computer examples of the songs and singers they enjoy. The still-unborn software would then distil those into a "My Music" model that it could use to look for similar music in the profusion of MP3 files available online.That day may not be all that far off. There already exists rudimentary software technology, like a German program called Query By Humming, a computerized version of the old game show, Name That Tune.In the Query system, the user hums a tune that's recorded by the computer and made into a MIDI for comparison with entries in the computer's database.But programs like Query are flummoxed by words accompanying the hummed notes.Back in Regina, Gerhard is preparing the next step in his quest for a song-savvy computer. He'll soon combine the various individual feature detectors into one unified computer model and check how well it distinguishes speech from song.After that, who knows? Maybe computers could be trained to write and sing like Bono.
To hear examples of David Gerhard's speech-song pairings, go to http://www.thestar.com/calamai and click on this article.

Additional articles by Peter Calamai


› Get 50% off home delivery of the Toronto Star.