moderated Re: synthesizer versus voice


JM Casey
 

Cool writeup/analysis. I've no doubt we will get there, but I don't think
we're there yet -- I've heard a few top-of-the-lie commercial voice
synthesisers and to me they still haven't quite grasped the inflection and
intonations of the human voice. But they're getting eerily close. So ..in
time. And of course, all our ears are different, too, and this "uncanny
valley" aspect is probably already nonexistent for some people.

-----Original Message-----
From: main@jfw.groups.io <main@jfw.groups.io> On Behalf Of Orlando Enrique
Fiol via groups.io
Sent: September 20, 2020 11:10 PM
To: main@jfw.groups.io
Subject: Re: synthesizer versus voice

At 09:00 PM 9/20/2020, Mark asked:
>what's the difference between a synthesizer and a voice?

A synthesizer uses electronic processes to fashion complex timbres from
acoustic or electronic sound sources. For example, a triangle wave may be
combined with clarinet samples to produce a "synthesized" clarinet.
However, I suspect your question pertains to our text-to-speech engines.
There, the distinction between speech synthesizer and voice operates on two
levels. The synthesizer is the speech engine as a whole, while individual
voices (such as male, female, child, etc.) can be chosen.
On a deeper level, though, the difference between synthesizer and voice
rests in the sources for phonemes used by a text-to-speech engine. With
purely synthesized speech, human speech is electronically modeled, just as
digital FM synthesizers such as the Yamaha DX7 attempted to create
acoustic-sounding timbres using electronic sources rather than actual
samples. There's a vital difference between trying to make an electronic
keyboard sound like a violin or banjo, and actually recording single notes
on violin or banjo in order to spread them out across the keyboard.
The old-fashioned speech synthesizer uses no human speech samples, while
most text-to-speech engines today do indeed use exclusively human speech
samples. That's why today's voices sound more realistic and human; they're
fashioned from recordings of human beings speaking different words or parts
of words, from which the speech engine constructs its vocabulary libraries.
As a sidenote, this human speech sampling and modeling technology is at the
point where one can theoretically make a speech engine from anyone's voice,
which has produced some unintended byproducts. It is now possible to create
convincing audio recordings of people allegedly saying things they never
actually said. This is done by sampling enough of their recorded speech to
formulate a lexicon not only of vocabulary, but more important, of their
vocal inflections, the rises, falls, breaths and pauses in their speech.
With this modeling technology, we soon will not know for certain whether
people have actually said what we've heard them say on audio recordings or
videos.
So, there you have it: a little primer on synthesis and sampled sound.


Orlando Enrique Fiol
Ph.D. in Music theory
University of Pennsylvania: November, 2018 Professional Pianist/Keyboardist,
Percussionist and Pedagogue Charlotte, North Carolina

Join main@jfw.groups.io to automatically receive all group messages.