tgn-whisperer

所属分类:C/C++基础
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-08-05 05:07:48
上 传 者sh-1993
说明:  使用sleep.cpp自动转录整个播客,
(Automate transcription of an entire podcast using whisper.cpp,)

文件列表:
Makefile (931, 2023-11-12)
app/ (0, 2023-11-12)
app/__init__.py (0, 2023-11-12)
app/bitly (1353, 2023-11-12)
app/bitly.json (4176, 2023-11-12)
app/episode.py (7397, 2023-11-12)
app/process.py (13697, 2023-11-12)
app/shownotes.py (925, 2023-11-12)
app/unwrap-bitly.py (1076, 2023-11-12)
archive/ (0, 2023-11-12)
archive/2049759.rss (414012, 2023-11-12)
archive/episode_links (14295, 2023-11-12)
archive/html_links (13366, 2023-11-12)
archive/htmlfetcher.sh (37791, 2023-11-12)
archive/links (24990, 2023-11-12)
archive/old-makefile (1741, 2023-11-12)
archive/wordcloud.png (519395, 2023-11-12)
archive/wordcloud_wcl.png (521171, 2023-11-12)
episode-transcribed.json (1959230, 2023-11-12)
episode_makefile (2185, 2023-11-12)
episodes/ (0, 2023-11-12)
episodes/Makefile (1111, 2023-11-12)
junk-through-chatgpt.json (6714, 2023-11-12)
junk-through-claude.json (2981, 2023-11-12)
junk.json (48314, 2023-11-12)
models/ (0, 2023-11-12)
process_podcast.py (3522, 2023-11-12)
requirements.txt (55, 2023-11-12)
show-notes-export.csv (1151418, 2023-11-12)
sites/ (0, 2023-11-12)
sites/tgn/ (0, 2023-11-12)
sites/tgn/docs/ (0, 2023-11-12)
sites/tgn/docs/img/ (0, 2023-11-12)
sites/tgn/docs/img/favicon.ico (28445, 2023-11-12)
sites/tgn/docs/img/logo.png (82038, 2023-11-12)
sites/tgn/docs/index.md (1644, 2023-11-12)
... ...

## Introduction With my discovery of the [whisper.cpp project](https://github.com/ggerganov/whisper.cpp) I had the idea of transcribing the podcast of some friends of mine, [The Grey Nato](https://thegreynato.com/) initially, and now also the [40 and 20](https://watchclicker.com/4020-the-watch-clicker-podcast/) podcast that I also enjoy. It's running on my trusty M1 Mac Mini and the results (static websites) are deployed to - [The Compleat Grey Nato](https://www.phfactor.net/tgn/) - [The Compleat 40 & 20](https://www.phfactor.net/wcl/) Take a look! This code and the sites are provided free of charge as a public service to fellow fans, listeners and those who find the results useful. After I got whisper.cpp working, an acquaintance on the TGN Slack pinged me to try their [OctoAI paid/hosted version](https://octoml.ai/models/whisper/) with speaker diarization and I've rewritten the code to use that. Diarization works well, the next step is naming each speaker via a combination of heuristics and an LLM. This repo is the code and some notes for myself and others. As of 10/9/2023, the code handles two podcasts and is working well. ## Goals 1. Simple as possible - use existing tools whenever possible 2. Incremental - be able to add new episodes easily and without reworking previous ones ### Workflow and requirements 1. Download the RSS file (process.py, using Requests) 2. Parse it for the episode MP3 files (xmltodict) 4. Call Whisper on each (command line, pass by reference) 5. Speaker attribution (episode.py, work in progress) 5. Export text into markdown files (to_markdown.py) 6. Generate a site with mkdocs 7. Publish (rsync) All of these are run and orchestrated by two Makefiles. Robust, portable, deletes outputs if interrupted, working pretty well. Makefiles are tricky to write and debug. I might need [remake](https://remake.readthedocs.io/en/latest/) at some point. The [makefile tutorial here](https://makefiletutorial.com/) was essential at several points - suffix rewriting, basename built-in, phony, etc. You can do a _lot_ with a Makefile very concisely, and the result is robust, portable and durable. And fast. Another good tutorial (via Lobste.rs) [https://makefiletutorial.com/#top](https://makefiletutorial.com) Directory [list from StackOverflow](https://stackoverflow.com/questions/13897945/wildcard-to-obtain-list-of-all-directories) ... as one does. ### The curse of URL shorteners and bit.ly in particular For a while, the TGN podcast shared episode URLs with bit.ly. There are good reasons for this, but now when I want to sequentially retrieve pages, the bit.ly throws rate limits and I see no reason to risk errors for readers. So I've built a manual process: - Grep the RSS file for bit.ly URLs - Save same into a text file called bitly - Run the unwrap-bitly.py script to build a json dictionary that resolves them - The process.py will use the lookup dictionary and save the canonical URLs. ### Episode numbers and URLs For a project like this, you want a primary index / key / way to refer to an episode. The natural choice is "episode number". This is a field in the RSS XML: itunes:episode however! TGN was bad, and didn't include this. What's more, they had episodes _in between_ episodes. The episode_number function in process.py handles this with a combination of techniques: 1. Try the itunes:episode key 2. Check the list of exceptions, keyed by string title 3. Try to parse an integer from the title 4. Starting at 2100, assign a number The story is very similar for per-episode URLs. Should be there, often are missing, and can sometimes be parsed out of the description. 40 & 20 has clean metadata, so this was a _ton_ easier for their feed. ### Optional - wordcloud I was curious as to how this'd look, so I used the Python wordcloud tool. A bit fussy to work with my [python 3.11 install](https://github.com/amueller/word_cloud/issues/708): python -m pip install -e git+https://github.com/amueller/word_cloud#egg=wordcloud cat tgn/*.txt > alltext wordcloud_cli --text alltext --imagefile wordcloud.png --width 1600 --height 1200 ![wordcloud](archive/wordcloud.png "TGN wordcloud") 40 & 20, run Sep 24 2023 - fun to see the overlaps. ![wordcloud_wcl](archive/wordcloud_wcl.png "40 & 20 wordcloud")

近期下载者

相关文件


收藏者