moderated Re: synthesizer versus voice


David Diamond
 

Funny because some prefer eloquence over real speak from JAWS. The person who did the Australian voice for JAWS said she had a huge manuscript the size of a phone book to record. Also the Texas version of U S English had slight variations. For me, the word motor sounded like murder. It could have been my hearing disability though.

-----Original Message-----
From: main@jfw.groups.io <main@jfw.groups.io> On Behalf Of JM Casey
Sent: September 20, 2020 8:20 PM
To: main@jfw.groups.io
Subject: Re: synthesizer versus voice

Cool writeup/analysis. I've no doubt we will get there, but I don't think we're
there yet -- I've heard a few top-of-the-lie commercial voice synthesisers
and to me they still haven't quite grasped the inflection and intonations of
the human voice. But they're getting eerily close. So ..in time. And of course,
all our ears are different, too, and this "uncanny valley" aspect is probably
already nonexistent for some people.



-----Original Message-----
From: main@jfw.groups.io <main@jfw.groups.io> On Behalf Of Orlando
Enrique Fiol via groups.io
Sent: September 20, 2020 11:10 PM
To: main@jfw.groups.io
Subject: Re: synthesizer versus voice

At 09:00 PM 9/20/2020, Mark asked:
>what's the difference between a synthesizer and a voice?

A synthesizer uses electronic processes to fashion complex timbres from
acoustic or electronic sound sources. For example, a triangle wave may be
combined with clarinet samples to produce a "synthesized" clarinet.
However, I suspect your question pertains to our text-to-speech engines.
There, the distinction between speech synthesizer and voice operates on
two levels. The synthesizer is the speech engine as a whole, while individual
voices (such as male, female, child, etc.) can be chosen.
On a deeper level, though, the difference between synthesizer and voice
rests in the sources for phonemes used by a text-to-speech engine. With
purely synthesized speech, human speech is electronically modeled, just as
digital FM synthesizers such as the Yamaha DX7 attempted to create
acoustic-sounding timbres using electronic sources rather than actual
samples. There's a vital difference between trying to make an electronic
keyboard sound like a violin or banjo, and actually recording single notes on
violin or banjo in order to spread them out across the keyboard.
The old-fashioned speech synthesizer uses no human speech samples, while
most text-to-speech engines today do indeed use exclusively human speech
samples. That's why today's voices sound more realistic and human; they're
fashioned from recordings of human beings speaking different words or
parts of words, from which the speech engine constructs its vocabulary
libraries.
As a sidenote, this human speech sampling and modeling technology is at the
point where one can theoretically make a speech engine from anyone's
voice, which has produced some unintended byproducts. It is now possible
to create convincing audio recordings of people allegedly saying things they
never actually said. This is done by sampling enough of their recorded speech
to formulate a lexicon not only of vocabulary, but more important, of their
vocal inflections, the rises, falls, breaths and pauses in their speech.
With this modeling technology, we soon will not know for certain whether
people have actually said what we've heard them say on audio recordings or
videos.
So, there you have it: a little primer on synthesis and sampled sound.


Orlando Enrique Fiol
Ph.D. in Music theory
University of Pennsylvania: November, 2018 Professional
Pianist/Keyboardist, Percussionist and Pedagogue Charlotte, North Carolina









Join main@jfw.groups.io to automatically receive all group messages.