Free and Open Source Software

22584 readers

18 users here now

If it's free and open source and it's also software, it can be discussed here. Subcommunity of Technology.

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

Gaywallet@beehaw.org

alyaza@beehaw.org

Current SOTA in local FOSS speech to text? (lemmy.ml)

submitted 1 day ago by solrize@lemmy.ml to c/foss@beehaw.org

10 comments fedilink hide all child comments

Are there currently usable FOSS tools for speech to text conversion (transcription) available under GNU/Linux? Purpose is transcribing stuff like downloaded podcasts. I don't need or want any kind of GUI tool. Just a CLI program that takes an audio file and converts it to text. I know there are various proprietary systems that do this, such as youtube transcription. One of my questions is whether the free stuff that's out there is anywhere near as good. I'm not too concerned about the input format (I can convert with ffmpeg), or about CPU time within reason (I don't mind letting my server spend all night crunching a 1 hour audio). I'd prefer to not require a GPU but if that helps a lot, I can get hold one of one as needed.

Question is about speech to text (STT). I'm not asking about the opposite, text-to-speech (TTS). For some reason people often confuse the two of these.

Thanks!

top 10 comments

sorted by: hot top controversial new old

[–] chicken@lemmy.dbzer0.com 5 points 19 hours ago (1 children)

Whisper works pretty well.

[–] solrize@lemmy.ml 1 points 16 hours ago (2 children)

Thanks, I have heard of it and the configuration does look manageable. Do the more accurate models run at tolerable speed without a GPU? Or even with one for that matter? I'd hope to get transcriptions as good as Youtube's (they have lots of errors but they're fairly usable). Is that realistic?

[–] TehPers@beehaw.org 3 points 9 hours ago (1 children)

The README lists the VRAM requirements for their different models if you plan to run with a GPU. Without a GPU, you can translate those roughly to system RAM.

Note that ML models pretty much always runs faster on a GPU due to the kinds of operations needed to execute them. If you have the option to run on a GPU, you probably should just do that. Even their largest model only requires ~10GB VRAM based on their table, and if you only need English, you can use a smaller one specialized for English (like medium.en).

[–] solrize@lemmy.ml 1 points 9 hours ago (1 children)

Thanks, I don't have a GPU so if the program required one, I'd rent an hourly one (vast.ai has lots of affordable rentals). But it's easier if I can run cpu-only. If I were doing this a lot (I don't expect to), I could see getting a Ryzen APU server, if the GPU in those is supported.

[–] TehPers@beehaw.org 1 points 8 hours ago (1 children)

If you have an integrated GPU in your processor (I'm assuming you do unless you have no graphical output at all), you can also try to run one of their really small models on it. Otherwise, their smaller models are also faster, so I'd recommend trying those on your CPU to start with.

[–] solrize@lemmy.ml 1 points 4 hours ago

Thanks, I'm more interested in quality than speed, but will try the smaller models to see if the results are usable. My CPU has Intel Integrated Graphics 4000 (this is in an i7-3770 which is way old by now) and I kind of doubt Pytorch supports that, though it's possible. I can imagine upgrading to a newer processor but probably not a real GPU. Actually it looks like everything is way more expensive than it used to be, no big surprise.

[–] skarn@discuss.tchncs.de 3 points 13 hours ago (1 children)

On my potato powered laptop (mid range thinkpad from 2018) it does not run in real time on the CPU. Particularly if you want to use a decent model, which is needed for my foreign accent.

I would say that quality generally exceeds YouTube, even with the worst model.

[–] solrize@lemmy.ml 1 points 9 hours ago

Thanks. My old i5-something server is probably in the same speed range as your laptop. It's good to hear about the transcription quality. If conversion is slower than real time, I can live with it. I can just throw a bunch of files at it and let it run overnight. Faster is always nicer of course.

[–] leds@feddit.dk 2 points 19 hours ago* (last edited 19 hours ago) (1 children)

I was just playing around with notely voice https://f-droid.org/packages/com.module.notelycompose.android

[–] solrize@lemmy.ml 1 points 16 hours ago* (last edited 16 hours ago)

Thanks, that's interesting and I might try it, though it's an Android app. I was hoping for a Linux CLI tool that I could run on a server.

Added: the phone app says it wants microphone and video permissions, which is a bit annoying. I don't care at all about live transcribing. I only want to convert files. It also wants network permissions and special protocols, which seems weird. I can understand if it wants to download models, but I wonder what else it wants the network for.