SUBTLEX-PT-BR: A 61 Million Word Corpus of Brazilian Portuguese Film Subtitles as a Resource for Linguistic Research

doi:10.17605/OSF.IO/VB5YP

Title	Authors

Home

## Data description SUBTLEX-PT-BR is a frequency corpus of Brazilian Portuguese. It was compiled using subtitle texts. It contains 61 million words (61,609,241 word tokens and 136,147 word types). ## What are the files `SUBTLEX_PT-BR_CDAbove2_Alpha_Spellcheck.tsv` is a tab-separated file. This version contains the word entries after a simple filter (see processing steps below). This is useful if you would like to check both typical and potentially atypical Brazilian Portuguese words that are not typically in a dictionary. Therefore, it can also contain non-Brazilian Portuguese words. `SUBTLEX_PT-BR_CDAbove2_Alpha_SpellcheckTrue.tsv` is a tab-separated file. It contains only the word entries that have been tagged as a word by a Brazilian Portuguese spellchecker. It contains 59 million words (58,665,120 word tokens and 78,908 word types). This version is useful if you would like to model the lexicon of Brazilian Portuguese. ## How to load? A tab-delimited file can be opened using various editors or office suites e.g., Excel/LibreCalc. You need to select the delimiter as a tab. In `R`, you can use ``` df = read.table("SUBTLEX_PT-BR_CDAbove2_Alpha_Spellcheck.tsv", sep='\t',header=TRUE,quote='')) df_spellchecked = read.table("SUBTLEX_PT-BR_CDAbove2_Alpha_SpellcheckTrue.tsv", sep='\t',header=TRUE,quote='')) ``` ## The file has the following columns: * Word: the word form * FREQcount: the token frequency count * CDcount: contextual diversity, the number of subtitle files that a word has occurred in. (Max: 12096, Min: 3). The minimum is three because the files with CD less than three were filtered to remove noisy files (with typos and other encoding issues). * Spellcheck: Spellchecked (TRUE or FALSE). ## How was the corpus processed? 26,627 subtitle files were obtained. Files were removed if they are duplicates, or were detected to be non-Portuguese. The remaining 12,104 subtitle files were compiled to produce a corpus with a list of words, their frequency and contextual diversity (CD) (a measure of the number of subtitle files that a word has occurred in). A few filters were used to further clean our corpus: * Filter out some web URLs and e-mail addresses. * Filter out words that do not consist only of Brazilian Portuguese graphemes “áâãàéêı́óôõúçüabcdefghijklmnopqrstuvwxyz” * Filter out words with CD of 2 and below * Spellcheck was added in 2025.04.22. The column `Word` was processed using the Hunspell spellchecker through the R package (hunspell) and the Brazilian Portuguese library (https://github.com/elastic/hunspell/commit/cdec4da839f9c3ac4d88420a10bf358f1d9133dc). Finally, this yielded a corpus with 61,609,241 tokens and 136,147 types. For full information, see Tang (2012). ## How to cite this corpus? Please cite both the paper below and this osf repository. Bibtex Citation: ``` @Article{Tang2012_SUBTLEXPTBR, Title = {A 61 Million Word Corpus of {Brazilian} {Portuguese} Film Subtitles as a Resource for Linguistic Research}, Author = {Tang, Kevin}, Journal = {UCL Working Papers in Linguistics}, Year = {2012}, Pages = {208--214}, Volume = {24}, Address = {London}, Institution = {University College London} } ``` [PDF of the article](https://github.com/tang-kevin/tang-kevin.github.io/blob/master/Files/publications/Tang_2012_SUBTLEXPTBR_UCLWPL2.pdf) ## References: Ooms J (2025). _hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker_. R package version 3.0.6, <https://CRAN.R-project.org/package=hunspell>.

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.