Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
## Data description SUBTLEX-PT-BR is a frequency corpus of Brazilian Portuguese. It was compiled using subtitle texts. It contains 61 million words (61,609,241 word tokens and 136,147 word types). ## What are the files `SUBTLEX_PT-BR_CDAbove2_Alpha_Spellcheck.tsv` is a tab-separated file. This version contains the word entries after a simple filter (see processing steps below). This is useful if you would like to check both typical and potentially atypical Brazilian Portuguese words that are not typically in a dictionary. Therefore, it can also contain non-Brazilian Portuguese words. `SUBTLEX_PT-BR_CDAbove2_Alpha_SpellcheckTrue.tsv` is a tab-separated file. It contains only the word entries that have been tagged as a word by a Brazilian Portuguese spellchecker. It contains 59 million words (58,665,120 word tokens and 78,908 word types). This version is useful if you would like to model the lexicon of Brazilian Portuguese. ## How to load? A tab-delimited file can be opened using various editors or office suites e.g., Excel/LibreCalc. You need to select the delimiter as a tab. In `R`, you can use ``` df = read.table("SUBTLEX_PT-BR_CDAbove2_Alpha_Spellcheck.tsv", sep='\t',header=TRUE,quote='')) df_spellchecked = read.table("SUBTLEX_PT-BR_CDAbove2_Alpha_SpellcheckTrue.tsv", sep='\t',header=TRUE,quote='')) ``` ## The file has the following columns: * Word: the word form * FREQcount: the token frequency count * CDcount: contextual diversity, the number of subtitle files that a word has occurred in. (Max: 12096, Min: 3). The minimum is three because the files with CD less than three were filtered to remove noisy files (with typos and other encoding issues). * Spellcheck: Spellchecked (TRUE or FALSE). ## How was the corpus processed? 26,627 subtitle files were obtained. Files were removed if they are duplicates, or were detected to be non-Portuguese. The remaining 12,104 subtitle files were compiled to produce a corpus with a list of words, their frequency and contextual diversity (CD) (a measure of the number of subtitle files that a word has occurred in). A few filters were used to further clean our corpus: * Filter out some web URLs and e-mail addresses. * Filter out words that do not consist only of Brazilian Portuguese graphemes “áâãàéêı́óôõúçüabcdefghijklmnopqrstuvwxyz” * Filter out words with CD of 2 and below * Spellcheck was added in 2025.04.22. The column `Word` was processed using the Hunspell spellchecker through the R package (hunspell) and the Brazilian Portuguese library (https://github.com/elastic/hunspell/commit/cdec4da839f9c3ac4d88420a10bf358f1d9133dc). Finally, this yielded a corpus with 61,609,241 tokens and 136,147 types. For full information, see Tang (2012). ## How to cite this corpus? Please cite both the paper below and this osf repository. Bibtex Citation: ``` @Article{Tang2012_SUBTLEXPTBR, Title = {A 61 Million Word Corpus of {Brazilian} {Portuguese} Film Subtitles as a Resource for Linguistic Research}, Author = {Tang, Kevin}, Journal = {UCL Working Papers in Linguistics}, Year = {2012}, Pages = {208--214}, Volume = {24}, Address = {London}, Institution = {University College London} } ``` [PDF of the article](https://github.com/tang-kevin/tang-kevin.github.io/blob/master/Files/publications/Tang_2012_SUBTLEXPTBR_UCLWPL2.pdf) ## References: Ooms J (2025). _hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker_. R package version 3.0.6, <https://CRAN.R-project.org/package=hunspell>.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.