Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Are there any good open source text-to-speech tools?
242 points by fumblebee on Jan 1, 2023 | hide | past | favorite | 121 comments
There are paid services that offer this (e.g. resemble.ai), and a few colab notebooks that I haven't found very helpful, but I wanted to know whether anyone here has had any luck with free text to speech (tts, t2s) tools. Thank you!



I've been playing with mimic 3 from Mycroft lately. It's pretty usable out of the box and is self hostable. https://mycroft.ai/mimic-3/


Hey! That ships with my voice! Well, a synthetic version of it anyway. It used to be horribly robotic, but they've improved it quite a bit for mimic-3. Worth a look.

I wrote a blog post recently which talks a little about the origin of my voice in Mycroft, in case anyone is interested. https://popey.com/blog/2022/10/blog-to-speech-in-my-voice/


The bonus with mimic is that it runs reasonably well on constrained hardware (such as an RPi).


I actually looked into this a month or so ago, what I was looking for was just reasonable sounding simple tts driven by the cli (so heavyweight things were out, as were most things with a local server, though I looked at some I think).

I ended up going with pico-tts[0]. I remember looking at a few other things and left myself the following comment:

  # checked out mimic as well. Didn't seem great, espeak is like nails on a chalkboard
  # haven't checked out marytts or larynx or anything, but this is good enough™
[0] https://github.com/Iiridayn/pico-tts


The British lady voice it does (-l=en-GB option) is pretty decent.


I just had a look at that one, and I agree (I've switched to using that one now, thanks!).


I'm not sure about the licensing of all the models/etc, but Coqui AI's 'TTS' python package is fairly good.

https://github.com/coqui-ai/TTS


It's such an obvious answer perhaps is why nobody has commented it. But depending on the use, you might try web speech API synthesis. For example a Windows user might see a Cortana option whereas a Mac user might see Siri.

Demo Here: https://mdn.github.io/dom-examples/web-speech-api/speak-easy...

Read more here https://github.com/mdn/dom-examples/tree/main/web-speech-api


I've had good luck with https://github.com/espeak-ng/espeak-ng (for very specific non-english purposes, and I was willing to wrangle IPA)


Mimic3 from the Mycroft project https://github.com/MycroftAI/mimic3


I have a ton of fun using the "say" program on MacOS to write toy programs with my kids, have often wanted a version that could run on my eldest's Manjaro laptop. Are any of the above analogously simple to use?


espeak (festival)




I've had good results with larynx: https://github.com/rhasspy/larynx


Yeah me too! Sadly unmet dependencies in Debian Sid; it doesn't work anymore :/


pico2wave with the -l=en-GB flag to get the British lady voice is not too bad for offline free TTS. You can hear it in this video: https://www.youtube.com/watch?v=tfcme7maygw&t=45s


I’m interested in the opposite: I want to transcribe meetings at work because my memory and note taking are inadequate.

I’m familiar with things like otter.ai but I am not risking my job by sharing data with something I don’t control.


OpenAI’s whisper[1] should do the job for you.

[1] - https://github.com/openai/whisper



I have heard good things about Mozilla's TTS: https://github.com/mozilla/TTS


It’s dead unfortunately.


Fairly certain the team working on it spun off and made Coqui TTS.


does festival count as good these days? https://www.cstr.ed.ac.uk/projects/festival/

it's the only one I have any experience with


This doesn't answer the question, but I thought it might be relevant to mention here that I've been using chatGPT + resemble.ai to create what I believe is the first kid's stories podcast created entirely by AI.

Here's how it works:

- Kid requests a story about a, b, c on www.makedupstories.com

- chatGPT generates the text for a story, summary, and title

- we send this to resemble.ai (sounds like Tortoise TTS would work just as well), which has a clone of my voice

- the audio file then gets sent to anchor.fm

you can listen to example episodes here on Spotify: https://open.spotify.com/show/6liL4T3kJf1scHq134s0mJ

And here on Apple podcasts: https://podcasts.apple.com/us/podcast/kidscast-kids-stories-...


This doesn’t answer the question, but let me plug my entirely unrelated product. Remember to like and subscribe and turn on notifications!


In every HN thread there are tangential subthreads, including plugs.

Personally - despite being neither the target market or that interested in an answer to the original question, I found the reply you are objecting to interesting and useful, and not at all the crass promo you are claiming.


Sure, but IMO this is well over the line.

OP's project is really cool and would make for a great Show HN. This thread is just the wrong place for it.


I'm not exactly trying to grow my kids stories podcast by plugging it to a developer audience. My comment is relevant because it's a cutting edge use case of text-to-speech.


Three of your last four (top-level) comments on this site are talking about your podcast.


I’m not trying to plug my unrelated product, I’m just explaining how my unrelated product is cutting edge!


This is such a cynical, unhelpful response. Someone asked HN for recommendations for text-to-speech software. I provided a recommendation and explained my use-case. I wasn't trying to grow my audience by posting this comment. I was trying to explain how I combined two new AI technologies to create a new offering. Of course I could be wrong, but based on my decade-plus experience on Hacker News, I think that this is precisely the kind of project PG intended to be discussed here when he created Hacker News.


Imagine googling “open source text to speech software” and you get a link to Apple Podcasts for children’s stories. That would be wild!


Well, to state the obvious, OP chose not to google it and to ask HN instead, which is why I provided a technology recommendation and an example of a solution I hacked together using the technology.


I don’t really get why you’re doubling down on your obvious self-promotion. You literally started with “This doesn’t answer the question…” but now you’re claiming that you did answer the question by providing a technology recommendation?


I'm defending myself vigorously because I truly wasn't trying to promote myself, despite what you claim. I took time out of my busy day to explain on this forum my use-case for text to speech, which combines two AI technologies and happens to involve kid's stories. I did not ask people to follow or subscribe to the podcast - that is you putting words in my mouth. I thought this audience would find this use-case interesting because this is Hacker News and I hacked together this solution. I also started my answer with "This doesn't answer the question" because it's not a literal list of TTS technologies; instead, it provides an interesting example of how I used one such technology.


Thank you for taking the time out of your busy day to provide multiple links to your podcast as a public service.


It's still actually helpful for showing other tools and example use cases. You're the one who contributes nothing with your needless sarcasm.

Please abide to HN Rules and Guidelines next time.


How do you make sure that children don't get inappropriate content? I know ChatGPT is pretty good at filtering already but to me it seems like a high risk undertaking - a single lapse can sink your ship.


I anyone else bothered by "maked up" in the url? Granted my grammar is slipping as I get older, but it doesn't sound like correct English to me. If incorrect, it feels a bit odd to be reinforcing something like that in the context of storytelling, which I would hope, is partly about the grammar being used.


I'm assuming they were trying to imitate a child's manner of speech - i.e. children tend to lack the experience with language to know about a lot of special cases with the English past tense, leading to "eated" and "fighted", etc. The website name is just a play on words based on this.


This is correct. Hopefully whatever improvement children get in their imagination by listening to these stories will offset whatever minor regression in language skills a child may experience due to this semantical humor, which indeed is intended to follow the form of a typical child's mistake.


> whatever improvement children get in their imagination

I would have been warier of exposing them to products of unintelligence.


There is no joy in this process.


The joy is where you rake in the money. At least, that's how content farming used to work.


That's not true. There's joy in making kids happy. Here's actual feedback from kids:

"Hi richardfeynman and Merry Christmas! Carissa enjoyed her story and decided after that she needs to submit another one She couldn’t tell a difference. My husband and I could tell it was different, but still pretty impressive how the AI works! We will share with friends this week!" "Thank you so much for the story of Anders and his goat Gizmo. We love Maked up Stories and listen to one every evening, so it really made Anders’s day to have his own story. We have shared it with family and friends."


Are you kidding? there's tons of joy. I have a backlog of hundreds of kids story requests and now instead of being able to satisfy one kid per day I can satisfy as many as I want. moreover, kids can't tell the difference and love the stories.


They're too young to articulate the difference caused by a lack of emotion in the storytelling; but as kids, they are still early enough in their developmental process that I imagine hearing massive amounts of spoken audio, which lacks emotional depth, will harm them. I'd be cautious.


Another issue is that chatgpt has no humor. It can't tell jokes.


ChatGPT was great in assisting me in coding (I ask about postgresql stuff). But I do noticed it sucks at humor too.

There's a saying that the best jokes are somewhat offensive if not very witty, and ChatGPT might be playing safe lol.


On the one hand, there's clear empirical evidence from parents that this helps improve kids' imagination and storytelling ability, and on the other hand we have your pure conjecture that "spoken audio which lacks emotional depth" can harm kids. I'm not buying it, but even if it were true it's pretty clear by the rate of improvement in voice cloning that soon there will be more emotional depth in this form of audio.


From your other comment with the parent's feedback -

> She couldn’t tell a difference. My husband and I could tell it was different [...]

I wonder if the child will eventually be able to tell the difference, when the machine-generated audio is a large fraction of what they hear in their early years. Or if they just learn to consider that 'normal', and maybe model their own speech patterns after it.

Also, anecdotes from parents (is that what you're referring to by "clear empirical evidence"?) are not evidence.


Current literacy theory believes there is an education gap formed between pre-school age children that hear a lot of words vs. those that don’t get to hear as many.

https://www.greatschools.org/gk/articles/word-gap-speak-more...


ChatGPT passes the kid Turing test!


> passes the kid Turing test

There exists an interpretation of that statement that makes it a tautology.

Edit: and/or makes it circular/symmetrical - "What is a "kid"" (a Turing test mis-labeller). A good AGI was required to be able to convince a panel - assuming an ideal panel, the usual simplification "employing" perfect agents in economic modelling -, and there now exist new means to conversely assess the panel.


Now build all that into a teddy bear.

First, though, read "I Always Do what Teddy Says", by Harry Harrison.


Wow, that's actually exactly what I was thinking. I will check that out! Thanks so much, I wish I could buy you a beer.


Can you rethink your revenue model? Selling ads that'll be played to kids is pretty grim.


While I don't think it's grim to show ads to kids, particularly for relevant products, yes, I can and will rethink the revenue model.


Sponsorship maybe?

Have Tony the Tiger explaining the food pyramid or something.


That’s GRRRRR-REAT!


I like your creativity. If I did not know it was AI, it would be hard for me to tell.

I'm not a kid and have no kids so it's hard for me to appreciate this type of storytelling. Have you been able to gather an audience that regularly tunes in?


If sitting your kids in front of an AI generated content farm bears little difference to sitting your kids in front of a human generated content farm, then yes, this is for you. It’s endless, zero marginal “entertainment” for your kids, an extra moment while you check your


@wheelsatlarge the maked up stories podcast has 6 million downloads and this is growing quickly


Bonus points for models that work well offline on mobile devices.


Depends on how you define "good". Espeak-ng, for example, works just fine as such. But the quality of the freely available voices is nothing close to the Siri / Google Assistant / Alexa / whatever standard. Understandable? Yes. Usable? Yes. But "good"? Mmmmm... YMMV.


I have no recommendations, but I'm curious if someone has tried to train a TTS on the data made by one of the commercial services. Generating data would be very cheap, labels perfect, and there would be less noise than in the human datasets.


Why do that instead of using an existing dataset like Common Voice?


Since you would be learning from another AI and not humans, there would be much less variation in the way words are pronounced and you would have a lot more data.


What's the goal? What would be the benefit of training on a single TTS speaker


To have your own AWS TTS, which you can run offline and for free.


We just released & open sourced this as a UI & API: https://tts.themetavoice.xyz/

It's free up to $30 & then cost price after that. It's exceptionally realistic, but can take a bit of time to synthesise as a result.


Papers-with-code would be the first place to look:

https://paperswithcode.com/task/text-to-speech-synthesis


Founder of dubverse.ai here. As someone who has done production level TTS(deployed on India's largest news network) I can say there is alot of room for improvement in terms of intelligibility. Most of these open source toolkits/models offer only a certain quality of TTS which is IMO good to play around with but damn too tough to make it sound studio-quality


So what alternatives are there?



If your use case allows for a web API, I've had good experience running OpenTTS[0].

It packages several models, including Coqui AI's TTS which I tend to use the most. There's a handy Docker image, too.

[0] https://github.com/synesthesiam/opentts


There's... the web platform. No really, there's a SpeechSynthesis API:

https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynth...


It works, and sounds absolutely terrible on Firefox.


I've been quite happy with Mimic3 lately (https://github.com/MycroftAI/mimic3), the engine that powers Mycroft. It also comes with an easy-to-install Docker image.


Old, but may be of interest:

Speech synthesis in Python with pyttsx

https://jugad2.blogspot.com/2014/03/speech-synthesis-in-pyth...


Yeah, I have used espeak, flite tts, RH voice, and a couple of others and they work very well.


Coqui is a open source text to speech solution.

I haven’t used it in a while but I seen a lot a of new feature listed over the last year or so.

Give it a try

https://github.com/coqui-ai/TTS


Funny how "everybody" is working into the same ChatGPT projects right now (Speech to Text, API integration, TTS...)

Somehow it's a nice to observe this trend to start working in other areas.


Not sure if you’re looking to train your own model or just run inference on pretrained models, but if it’s the former, you can find espnet, TensorflowTTS and coqui on GitHub.


I've have pretty good luck with flowtron after watching an nvidia screencast on it. CPU only inference performance isn't great though.


When I last compared (about a year ago) Google was the best of the commercial solutions. Is that still the case?


The TTS in Google Translate is not exactly great, do they have something better?

Other than that, I've never used the product, but I've seen Youtube ads for Speechelo, which seems to be quite decent (and a bunch of Youtube ads for other things that quite obviously were using Speechelo (same voice))


CoquiAi seems very good from the work I've done with it.


Given a URL, this service return an audio file / stream (in WAV format) that reads out the main content of the webpage.

https://github.com/tslmy/tts


Are there any that use TensorflowLite?


The best is probably tortoise but you have to run it yourself https://github.com/neonbjb/tortoise-tts

here are some demos https://nonint.com/static/tortoise_v2_examples.html


@yacineMTB (Twitter) used Tortoise to diy his own podcast replicating Joe Rogan (by ChatGPT) & the results are amazing worth a quick listen to get the gist [1]

   I wrote a script that 
   - pulled @_akhaliq's last 7 days of tweets
   - fished out the arxiv links
   - downloaded raw paper .tex
   - parsed out intros & conclusions
   - automated a podcast dialogue about the papers w/ web automation & GPT
   - generated a podcast
[1] https://scribepod.substack.com/p/scribepod-1#details


As someone that works in generative modeling (vision) there's something that sparks suspicion. They note that these are hand picked results. Has anyone used this and can report the actual quality of results? I bring this up because anyone that has used Stable Diffusion or DallE will know why. Hand picked results are good, but median results matter a lot too.


I'm the author of FakeYou.com and can speak to Tortoise and the TTS field.

Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases. You can't build an app with it. It's good for creatives making one-off deepfake YouTube videos, and that's about it.

You're looking for Tacotron 2 or one of its offshoots that add multi-speaker, TorchMoji, etc. You'll want to pair it with the Hifi-Gan vocoder to get end-to-end text to speech. (Avoid Griffin-Lim and WaveGlow.)

Your pipeline looks like this at a high level:

  Input text => Text pre-processing => Synthesizer => Vocoder => [ Optional transcoding ] => Output audio
TalkNet is also popular when a secondary reference pitch signal is supplied. You can mimic singing and emotion pretty easily.

These three models are faster than real time, and there's a lot of information available and a big community built up around them. FakeYou's Discord has a bunch of people that can show you how to train these models, and there are other Discord communities that offer the same assistance.

If you want to train your own voice using your own collected sample data, you can experiment with it on Google Colab and on FakeYou, then reuse the same model file by hosting it in a cloud GPU instance. We can also do the hosting for you if that's not your desire or forte.

In any case, these models are solid choices for building consumer apps. As long as you have a GPU, you're good to go. If you're not interested in building or maintaining your own, you can use our API! I'd be happy to help.


> "Tortoise produces quality results with limited training data, but is an extremely slow model that is not suitable for real time use cases"

What would run if you had large set of training data (and time and money) but your focus is on quality? Still Tortoise?



Thanks for this, I actually appreciate the honesty. It is always difficult for me to parse the actual quality of things I don't have intimate experience with.

Can I ask another question? If I wanted to hack around with STT and TTS (inference only) on a pi (4B+) is there anything that is approximately appropriate and can be done on device? (I could process on my main machine but I'd love to do it on the pi even with a decent delay)


For STT, take a look at Wenet: https://github.com/wenet-e2e/wenet

They provide support for running in a Raspberry Pi and it runs in real-time. I have tried the desktop version and the quality is good enough when the audio is clean.


No problem!

There are other ML TTS models that are both lightweight and can run on a CPU. Check out Glow-TTS for something that will probably work.

Also swap out the HifiGan vocoder for Melgan or MB-Melgan as these will also better support your use case.

I ran this exact setup on cheap Digital Ocean droplets (without GPUs) and it ran faster than real time. It should work on a Pi.

Unfortunately I'm not aware of STT models that operate under these same hardware constraints, but you should be good to go for TTS. With a little bit of poking around, I'm sure you can find a solution for STT too.


From thee link:

> Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a NVidia Tesla K80, expect to generate a medium sized sentence every 2 minutes.

I suspect that for a real(-ish) time TTS system, something else is needed. OTOH if you want to record some voice acting for a game or other multimedia product, it still may be more cost-effective than recording a bunch of live humans.

(K8 = NVidia Tesla K80, GPU, $800-900 for a 24GB version right now.)


I see 24GB Tesla K80s on ebay for $90...what am I missing?


a k80 is extremely old by now, so I'd expect this to be maybe an order of magnitude faster.


Would it still require a 3080 to run adequately, that is, with 1-2 seconds of delay? I've no idea what consumer-grade hardware works well for ML loads.


I haven't tried it, but the k80 is about 6 years old/5 generations. there have been massive leaps since then.


6 years old is nowadays more like 3 generations and it's definitely not a magnitude (10x) of difference.


Kepler, Maxwell, Turing, Volta, ampere, Lovelace, hopper. it's 6 generations old when you include the micro architectures. it would be about a 10x improvement.


Oh, if it's kepler, absolutely. Thought 6 years thus Ampere.


Does anybody run Tortoise on cloud serverless GPUs? If yes, can you please recommend a setup?


English only from a cursory glance


NLP is going to have this problem for a long time. Obviously most original research is done by Americans in English. There are really only valid training sets for languages that NLP researchers or engineers speak.


Chinese is well-represented among ML researchers.


Because 14% of the world's population speaks Mandarin Chinese. But what about Yoruba, Burmese or even Hakka Chinese?


Speech will have this problem but text based NLP can be translated and we have pretty good translators


ChatGPT works in Russian for example, don't know about other languages


I suspect it might be translated


It also works in German and I'm relatively certain it's not translated outside the model itself. I've asked it to generate puns incorporating certain words and while the English results were subjectively somewhat better, the German ones were still "fine" and definitely wouldn't work in English.


ChatGPT is so crazy it even works in fluent Thai. That's better than any machine translation I've ever tried so far. It even takes cultural differences into account. For example when you ask it to translate "I love you" into Thai, it mentions, that normally you would not say this in the same circumstances as you would say it to your lover in the West, correctly explaining in what circumstances people would really use it, and what to use instead. That's revolutionary for minority languages without a lot of learning material available online.

Also I am a native Swiss German speaker. For those who don't know: Swiss German is a dialect continuum, very very different from standard German to an extend, that most untrained German speakers don't understand us. There is no orthography (writing rules), no grammar rules etc. It's a mostly undocumented/unofficial writing system. Only spoken, and the varieties are vast. And guess what, I can write in completely random, informal Swiss German dialect and ChatGPT understands everything, but answers in standard German.


Unless it was trolling I saw evidence it was trained on Russian texts, how else it could do convincing style transfer from Russian poets for example.

But as always only successful prompts are shared so I don't know how hit or miss it is


Ooh, that is a nice one!


Examples 4 and 5 sound like George Clooney for some reason.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: