moderated Re: synthesizer versus voice


Orlando Enrique Fiol
 

At 09:00 PM 9/20/2020, Mark asked:
what's the difference between a synthesizer and a voice?
A synthesizer uses electronic processes to fashion complex timbres from acoustic or electronic sound sources. For example, a triangle wave may be combined with clarinet samples to produce a "synthesized" clarinet.
However, I suspect your question pertains to our text-to-speech engines. There, the distinction between speech synthesizer and voice operates on two levels. The synthesizer is the speech engine as a whole, while individual voices (such as male, female, child, etc.) can be chosen.
On a deeper level, though, the difference between synthesizer and voice rests in the sources for phonemes used by a text-to-speech engine. With purely synthesized speech, human speech is electronically modeled, just as digital FM synthesizers such as the Yamaha DX7 attempted to create acoustic-sounding timbres using electronic sources rather than actual samples. There's a vital difference between trying to make an electronic keyboard sound like a violin or banjo, and actually recording single notes on violin or banjo in order to spread them out across the keyboard.
The old-fashioned speech synthesizer uses no human speech samples, while most text-to-speech engines today do indeed use exclusively human speech samples. That's why today's voices sound more realistic and human; they're fashioned from recordings of human beings speaking different words or parts of words, from which the speech engine constructs its vocabulary libraries.
As a sidenote, this human speech sampling and modeling technology is at the point where one can theoretically make a speech engine from anyone's voice, which has produced some unintended byproducts. It is now possible to create convincing audio recordings of people allegedly saying things they never actually said. This is done by sampling enough of their recorded speech to formulate a lexicon not only of vocabulary, but more important, of their vocal inflections, the rises, falls, breaths and pauses in their speech.
With this modeling technology, we soon will not know for certain whether people have actually said what we've heard them say on audio recordings or videos.
So, there you have it: a little primer on synthesis and sampled sound.


Orlando Enrique Fiol
Ph.D. in Music theory
University of Pennsylvania: November, 2018
Professional Pianist/Keyboardist, Percussionist and Pedagogue
Charlotte, North Carolina

Join main@jfw.groups.io to automatically receive all group messages.