Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**How to use this Dataset / Data set / Database?** Please acknowledge use of this OSF project with the DOI attached to the overview page (10.17605/OSF.IO/VM9NT). Transcripts include timestamped information on words spoken, speaker, sentiment, and more. To access plain-text transcripts, use the full JSON or Parquet files. To access timestamps, use the timestamped JSON files. **Current Status** Up to date as of 01/30/2024 **What is this Project?** This project used the Python library from assemblyAI to automatically generate transcripts for every single episode of the Stuff You Should Know podcast --- yup, all of them! All this together creates a natural language / plain text dataset of a complete podcast corpus. The transcripts for all Stuff You Should Know episodes are made available in parquet dataframe and JSON format. **What is Stuff You Should Know?** It is a long-running, very popular, wide-ranging podcast. From the Apple Podcasts page: "If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered." https://podcasts.apple.com/us/podcast/stuff-you-should-know/id278981407 **Why Transcripts?** To build a dataset. I wanted to build an app which would use vector similarity to find the best episodes to listen to if I wanted to know about, for example, under-sea volcanoes. SYSK is a prolific podcast, which covers many topics. This makes it ideal as a set of text to search for many different things. Once the app is functional, any podcast can be put through the transcription process and indexed to become searchable. In the meantime, transcripts of SYSK will be useful to others looking for a large corpus of text, of conversations, or of url-linked content. Generating the transcripts using AssemblyAI took about 1 working day per every 75 to 175 episodes (depending on other factors) on a laptop computer. As such, providing the output here will save others time. There do exist websites where transcripts of a random subset of SYSK episodes can be found. However, to collect each subset into a format usable in Data Science, one needs to build a web scraper. To get all episodes, one needs to build many web scrapers and hope the subsets each site has chosen to make public will overlap so as to create a complete set. With this dataset, all episodes are available already in DataFrame and JSON format, no scraping required. In addition, a script will run regularly (probably around major holidays) to pull new transcripts and update the corpus. **Why Include Selects and Commercials?** SYSK currently pulls from old episodes to air on Saturdays, called SYSK Selects. While the majority of each Select episode is an exact repeat of the original episode, each Select also contains a new introduction explaining why it has been selected. In addition, some include updated information at the end. Because of this additional information, Selects are included in the dataset. Similarly, the commercials are included in each episode. Commercials offer information about the advertisers (in the US market) contracted with the podcast at the time. This can offer interesting information. If the end-user wishes to remove commercials, it can be done fairly easily by searching for the intro and outro audio segments that SYSK uses before and after commercials. One might also be able to use the Entities information to remove text during the time segments assigned to advertising entities.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.