There are paid services that offer this (e.g. resemble.ai), and a few colab notebooks that I haven't found very helpful, but I wanted to know whether anyone here has had any luck with free text to speech (tts, t2s) tools. Thank you!
Hey! That ships with my voice! Well, a synthetic version of it anyway. It used to be horribly robotic, but they've improved it quite a bit for mimic-3. Worth a look.
I actually looked into this a month or so ago, what I was looking for was just reasonable sounding simple tts driven by the cli (so heavyweight things were out, as were most things with a local server, though I looked at some I think).
I ended up going with pico-tts[0]. I remember looking at a few other things and left myself the following comment:
# checked out mimic as well. Didn't seem great, espeak is like nails on a chalkboard
# haven't checked out marytts or larynx or anything, but this is good enough™
It's such an obvious answer perhaps is why nobody has commented it. But depending on the use, you might try web speech API synthesis. For example a Windows user might see a Cortana option whereas a Mac user might see Siri.
I have a ton of fun using the "say" program on MacOS to write toy programs with my kids, have often wanted a version that could run on my eldest's Manjaro laptop. Are any of the above analogously simple to use?
This doesn't answer the question, but I thought it might be relevant to mention here that I've been using chatGPT + resemble.ai to create what I believe is the first kid's stories podcast created entirely by AI.
Here's how it works:
- Kid requests a story about a, b, c on www.makedupstories.com
- chatGPT generates the text for a story, summary, and title
- we send this to resemble.ai (sounds like Tortoise TTS would work just as well), which has a clone of my voice
In every HN thread there are tangential subthreads, including plugs.
Personally - despite being neither the target market or that interested in an answer to the original question, I found the reply you are objecting to interesting and useful, and not at all the crass promo you are claiming.
I'm not exactly trying to grow my kids stories podcast by plugging it to a developer audience. My comment is relevant because it's a cutting edge use case of text-to-speech.
This is such a cynical, unhelpful response. Someone asked HN for recommendations for text-to-speech software. I provided a recommendation and explained my use-case. I wasn't trying to grow my audience by posting this comment. I was trying to explain how I combined two new AI technologies to create a new offering. Of course I could be wrong, but based on my decade-plus experience on Hacker News, I think that this is precisely the kind of project PG intended to be discussed here when he created Hacker News.
Well, to state the obvious, OP chose not to google it and to ask HN instead, which is why I provided a technology recommendation and an example of a solution I hacked together using the technology.
I don’t really get why you’re doubling down on your obvious self-promotion. You literally started with “This doesn’t answer the question…” but now you’re claiming that you did answer the question by providing a technology recommendation?
I'm defending myself vigorously because I truly wasn't trying to promote myself, despite what you claim. I took time out of my busy day to explain on this forum my use-case for text to speech, which combines two AI technologies and happens to involve kid's stories. I did not ask people to follow or subscribe to the podcast - that is you putting words in my mouth. I thought this audience would find this use-case interesting because this is Hacker News and I hacked together this solution. I also started my answer with "This doesn't answer the question" because it's not a literal list of TTS technologies; instead, it provides an interesting example of how I used one such technology.
How do you make sure that children don't get inappropriate content? I know ChatGPT is pretty good at filtering already but to me it seems like a high risk undertaking - a single lapse can sink your ship.
I anyone else bothered by "maked up" in the url? Granted my grammar is slipping as I get older, but it doesn't sound like correct English to me. If incorrect, it feels a bit odd to be reinforcing something like that in the context of storytelling, which I would hope, is partly about the grammar being used.
I'm assuming they were trying to imitate a child's manner of speech - i.e. children tend to lack the experience with language to know about a lot of special cases with the English past tense, leading to "eated" and "fighted", etc. The website name is just a play on words based on this.
This is correct. Hopefully whatever improvement children get in their imagination by listening to these stories will offset whatever minor regression in language skills a child may experience due to this semantical humor, which indeed is intended to follow the form of a typical child's mistake.
That's not true. There's joy in making kids happy. Here's actual feedback from kids:
"Hi richardfeynman and Merry Christmas!
Carissa enjoyed her story and decided after that she needs to submit another one
She couldn’t tell a difference. My husband and I could tell it was different, but still pretty impressive how the AI works! We will share with friends this week!"
"Thank you so much for the story of Anders and his goat Gizmo. We love Maked up Stories and listen to one every evening, so it really made Anders’s day to have his own story. We have shared it with family and friends."
Are you kidding? there's tons of joy. I have a backlog of hundreds of kids story requests and now instead of being able to satisfy one kid per day I can satisfy as many as I want. moreover, kids can't tell the difference and love the stories.
They're too young to articulate the difference caused by a lack of emotion in the storytelling; but as kids, they are still early enough in their developmental process that I imagine hearing massive amounts of spoken audio, which lacks emotional depth, will harm them. I'd be cautious.
On the one hand, there's clear empirical evidence from parents that this helps improve kids' imagination and storytelling ability, and on the other hand we have your pure conjecture that "spoken audio which lacks emotional depth" can harm kids. I'm not buying it, but even if it were true it's pretty clear by the rate of improvement in voice cloning that soon there will be more emotional depth in this form of audio.
From your other comment with the parent's feedback -
> She couldn’t tell a difference. My husband and I could tell it was different [...]
I wonder if the child will eventually be able to tell the difference, when the machine-generated audio is a large fraction of what they hear in their early years. Or if they just learn to consider that 'normal', and maybe model their own speech patterns after it.
Also, anecdotes from parents (is that what you're referring to by "clear empirical evidence"?) are not evidence.
Current literacy theory believes there is an education gap formed between pre-school age children that hear a lot of words vs. those that don’t get to hear as many.
There exists an interpretation of that statement that makes it a tautology.
Edit: and/or makes it circular/symmetrical - "What is a "kid"" (a Turing test mis-labeller). A good AGI was required to be able to convince a panel - assuming an ideal panel, the usual simplification "employing" perfect agents in economic modelling -, and there now exist new means to conversely assess the panel.
I like your creativity. If I did not know it was AI, it would be hard for me to tell.
I'm not a kid and have no kids so it's hard for me to appreciate this type of storytelling. Have you been able to gather an audience that regularly tunes in?
If sitting your kids in front of an AI generated content farm bears little difference to sitting your kids in front of a human generated content farm, then yes, this is for you. It’s endless, zero marginal “entertainment” for your kids, an extra moment while you check your
Depends on how you define "good". Espeak-ng, for example, works just fine as such. But the quality of the freely available voices is nothing close to the Siri / Google Assistant / Alexa / whatever standard. Understandable? Yes. Usable? Yes. But "good"? Mmmmm... YMMV.
I have no recommendations, but I'm curious if someone has tried to train a TTS on the data made by one of the commercial services. Generating data would be very cheap, labels perfect, and there would be less noise than in the human datasets.
Since you would be learning from another AI and not humans, there would be much less variation in the way words are pronounced and you would have a lot more data.
Founder of dubverse.ai here. As someone who has done production level TTS(deployed on India's largest news network) I can say there is alot of room for improvement in terms of intelligibility. Most of these open source toolkits/models offer only a certain quality of TTS which is IMO good to play around with but damn too tough to make it sound studio-quality
I've been quite happy with Mimic3 lately (https://github.com/MycroftAI/mimic3), the engine that powers Mycroft. It also comes with an easy-to-install Docker image.
Not sure if you’re looking to train your own model or just run inference on pretrained models, but if it’s the former, you can find espnet, TensorflowTTS and coqui on GitHub.
The TTS in Google Translate is not exactly great, do they have something better?
Other than that, I've never used the product, but I've seen Youtube ads for Speechelo, which seems to be quite decent (and a bunch of Youtube ads for other things that quite obviously were using Speechelo (same voice))
@yacineMTB (Twitter) used Tortoise to diy his own podcast replicating Joe Rogan (by ChatGPT) & the results are amazing worth a quick listen to get the gist [1]
I wrote a script that
- pulled @_akhaliq's last 7 days of tweets
- fished out the arxiv links
- downloaded raw paper .tex
- parsed out intros & conclusions
- automated a podcast dialogue about the papers w/ web automation & GPT
- generated a podcast
As someone that works in generative modeling (vision) there's something that sparks suspicion. They note that these are hand picked results. Has anyone used this and can report the actual quality of results? I bring this up because anyone that has used Stable Diffusion or DallE will know why. Hand picked results are good, but median results matter a lot too.
I'm the author of FakeYou.com and can speak to Tortoise and the TTS field.
Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases. You can't build an app with it. It's good for creatives making one-off deepfake YouTube videos, and that's about it.
You're looking for Tacotron 2 or one of its offshoots that add multi-speaker, TorchMoji, etc. You'll want to pair it with the Hifi-Gan vocoder to get end-to-end text to speech. (Avoid Griffin-Lim and WaveGlow.)
Your pipeline looks like this at a high level:
Input text => Text pre-processing => Synthesizer => Vocoder => [ Optional transcoding ] => Output audio
TalkNet is also popular when a secondary reference pitch signal is supplied. You can mimic singing and emotion pretty easily.
These three models are faster than real time, and there's a lot of information available and a big community built up around them. FakeYou's Discord has a bunch of people that can show you how to train these models, and there are other Discord communities that offer the same assistance.
If you want to train your own voice using your own collected sample data, you can experiment with it on Google Colab and on FakeYou, then reuse the same model file by hosting it in a cloud GPU instance. We can also do the hosting for you if that's not your desire or forte.
In any case, these models are solid choices for building consumer apps. As long as you have a GPU, you're good to go. If you're not interested in building or maintaining your own, you can use our API! I'd be happy to help.
Thanks for this, I actually appreciate the honesty. It is always difficult for me to parse the actual quality of things I don't have intimate experience with.
Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)
They provide support for running in a Raspberry Pi and it runs in real-time. I have tried the desktop version and the quality is good enough when the audio is clean.
There are other ML TTS models that are both lightweight and can run on a CPU. Check out Glow-TTS for something that will probably work.
Also swap out the HifiGan vocoder for Melgan or MB-Melgan as these will also better support your use case.
I ran this exact setup on cheap Digital Ocean droplets (without GPUs) and it ran faster than real time. It should work on a Pi.
Unfortunately I'm not aware of STT models that operate under these same hardware constraints, but you should be good to go for TTS. With a little bit of poking around, I'm sure you can find a solution for STT too.
> Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a NVidia Tesla K80, expect to generate a medium sized sentence every 2 minutes.
I suspect that for a real(-ish) time TTS system, something else is needed. OTOH if you want to record some voice acting for a game or other multimedia product, it still may be more cost-effective than recording a bunch of live humans.
(K8 = NVidia Tesla K80, GPU, $800-900 for a 24GB version right now.)
Would it still require a 3080 to run adequately, that is, with 1-2 seconds of delay? I've no idea what consumer-grade hardware works well for ML loads.
Kepler, Maxwell, Turing, Volta, ampere, Lovelace, hopper. it's 6 generations old when you include the micro architectures. it would be about a 10x improvement.
NLP is going to have this problem for a long time. Obviously most original research is done by Americans in English. There are really only valid training sets for languages that NLP researchers or engineers speak.
It also works in German and I'm relatively certain it's not translated outside the model itself. I've asked it to generate puns incorporating certain words and while the English results were subjectively somewhat better, the German ones were still "fine" and definitely wouldn't work in English.
ChatGPT is so crazy it even works in fluent Thai. That's better than any machine translation I've ever tried so far. It even takes cultural differences into account. For example when you ask it to translate "I love you" into Thai, it mentions, that normally you would not say this in the same circumstances as you would say it to your lover in the West, correctly explaining in what circumstances people would really use it, and what to use instead. That's revolutionary for minority languages without a lot of learning material available online.
Also I am a native Swiss German speaker. For those who don't know: Swiss German is a dialect continuum, very very different from standard German to an extend, that most untrained German speakers don't understand us. There is no orthography (writing rules), no grammar rules etc. It's a mostly undocumented/unofficial writing system. Only spoken, and the varieties are vast. And guess what, I can write in completely random, informal Swiss German dialect and ChatGPT understands everything, but answers in standard German.