Home » News » AI Takes a Leap: AI Model Learns to Speak Naturally, Even for Complex Sentences!

AI Takes a Leap: AI Model Learns to Speak Naturally, Even for Complex Sentences!

(Image Credit Google)
A state-of-the-art text-to-speech (TTS) model developed by Amazon researchers is capable of speaking even complex words naturally thanks to its "emergent abilities." This advancement may hold the key to enabling compelling and lifelike AI voices, ultimately enabling us to go past the "uncanny valley." In comparison to its predecessors, the model—aptly termed BASE TTS (Big Adaptive Streamable TTS with Emergent abilities)—is enormous. It has a staggering 980 million parameters and was trained on a vast dataset of 100,000 hours of speech, including English, German, Dutch, and Spanish. Researchers think that its immensity is what has uncovered the secret to its remarkable performance. However, what precisely are these "emergent abilities"? Think of the model as easily navigating complex noun pronunciations ("The Beckhams' countryside cottage"), expressing enthusiasm ("Oh my gosh! the Maldives?!" with delight), or even bringing in foreign words and punctuation ("Mr. Henry's pièce de résistance"). BASE TTS outperformed other models, such as Tortoise and VALL-E, in addressing these problems with astounding accuracy. Also Read: Amazon Is Shutting Down the Echo Connect The intriguing element is that reading texts aloud isn't the only thing involved. The subtleties of human speech, such as intonation, stress, and even whispers, are captured by BASE TTS. Imagine learning resources that communicate to students in a way that they can understand, or audiobooks that are recorded with real emotion. There are countless options! However, BASE TTS isn't simply about powerful speech. Moreover, it is "streamable," which means that less bandwidth is needed because speech is produced instantly. Real-time applications such as interactive games and voice assistants are made possible by this. Furthermore, the model creates a distinct stream for speech metadata (such as emotion), which facilitates control and customization of the result. There are certain limitations to this research. The model is still in the experimental stage, and because of possible abuse concerns, its source code has not been made public. Nonetheless, there is no denying the potential advantages, particularly in terms of accessibility. Imagine language learners getting individualized pronunciation feedback, or visually impaired people enjoying expressive audiobooks. One thing is certain, though: BASE TTS represents a substantial advancement in text-to-speech technology, even though the future is yet unknown. It's evidence of the capability of huge language models and their potential to completely change the way humans communicate with computers and data.

By Omal J

I worked for both print and electronic media as a feature journalist. Writing, traveling, and DIY sum up her life.

RELATED NEWS

The IT community is buzzing with excitement as we ...

news-extra-space

Are you having trouble organizing and designing yo...

news-extra-space

Prepare to put an end to unauthorized screenshots!...

news-extra-space

Google Chrome users, prepare for an interesting up...

news-extra-space

Is this the future of video, or a dystopian dream?...

news-extra-space

Windows PCs with Arm chips are gradually gaining t...

news-extra-space
2
3
4
5
6
7
8
9
10