According to experts from Emergen Research, the global market for robotic text-to-speech will grow to $7+ billion by 2028. Let’s look at how speech synthesis works and why it’s more convenient to deploy it in the cloud.
What Is Speech Synthesis, And What Is Its Use Of It?
Automatic speech synthesis is a robotic voicing of text. The application receives text in a known language as input and then reads it in an announcer’s voice.
This technology has several applications, for example:
- Adaptation of interfaces and sites for people with poor eyesight. Speech synthesis allows you to read interface elements aloud;
- voiceover of critical functions of the application, for example, commands in the navigator;
- conversion of test scripts for automated calling by robots;
- Voice acting of text exercises and lectures in online education.
Often synthesis works together with speech recognition. For example, voice assistants Siri, Cortana, Alexa, and others combine automatic analysis and synthesis of sounding speech: they turn the speech stream into text, isolate the request, and then read the answer aloud. Or ironic – how lucky.
How A Speech Synthesizer Works
Let’s understand the classification of speech synthesis. There is a main approach: concatenative speech synthesis.
Concatenative method: It’s older and more straightforward. Its essence is gluing a finished phrase from small pieces, which were voiced in advance by a live announcer. Such a speech synthesizer parses the text received at the input into minimal blocks, takes the recorded pieces, and sequentially assembles a whole phrase from them.
The main advantage of this method for the end-user is the speed of speech generation. The robot translates text into audio format almost instantly, with minimal delay.
The main disadvantage of such a speech synthesis system is an unpleasant, lifeless voice. In natural speech, as a rule, there is intonation, which occurs due to a smooth change in voice pitch within a sentence, acceleration, deceleration of the speech tempo, and some other parameters.
To understand with what intonation to pronounce a sentence, you need to parse its meaning correctly. The concatenative engine is not very good because it simply breaks the text into fragments. Algorithms try to adjust the pitch to produce, for example, the intonation of interrogative sentences, but this is usually their limit. Therefore, users often do not like the voiced text of such an electronic voice simulator.
Another disadvantage of the concatenative engine is that rendering requires massive initial sound sets. Moreover, if this set does not contain the desired recording, it will not work to synthesize the missing sound. This is incredibly annoying when working with tonal languages like Chinese, where there can be hundreds of thousands of slightly different sounds. But even in Russian, some sounds in combination do not sound relatively standard, which can interfere with the voice acting.