Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
#### Data can be downloaded below: The file [**RastrOS_Corpus_Predictability_Norms.tsv**](https://osf.io/pmz7q/) is a tab-separated values file that contains the predictability norms. This file can be used to explore how different factors influence the Cloze task responses. The file [**RastrOS_Corpus_Eytracking_Data.tsv**](https://osf.io/8ames/) is also a tab-separated values file, which contains the eye-tracking data. The file [**Rastros_Corpus_Response_Annotation.tsv**](https://osf.io/2fn5z/) is also a tab-separated values file, with three columns: Response / Correction / Is_RANDOM , that is, all the responses that were corrected for spelling. The file [**Rastros_Corpus_Cloze_FULL.tsv**](https://osf.io/7rpnc/) is also a tab-separated values file, composed of all the responses of all the participants, to make it easier to analyse and study predictability questions. **OBS:** Fifteen paragraphs used in the RastrOS corpus were taken from proprietary websites, therefore they are not included in the file **RastrOS_Corpus_Predictability_Norms.tsv**. Instead, we list the link where they can be found and a complete description of how to find them in their respective text: **Paragraph 4** is in the beginning of the first paragraph of the text in the link https://abmn.com.br/acoes-e-projetos/abmn-news/a-senha-para-a-felicidade/ and is composed of 3 sentences. **Paragraph 13** is composed of the three sentences: the first 2 sentences are in the first paragraph of the text in the link https://sciam.com.br/como-o-derretimento-do-gelo-do-artico-esta-elevando-os-niveis-dos-oceanos-em-todo-mundo/ and the third is the first sentence in the second paragraph. **Paragraph 14** is in the beginning of the first paragraph of the text in the link https://revistagalileu.globo.com/Ciencia/Espaco/noticia/2019/09/milhoes-de-buracosnegros-de-alta-velocidade-estariam-se-aproximando-da-lactea.html and is composed of 2 sentences. **Paragraph 24** is in the first paragraph of the text in the link https://veja.abril.com.br/cultura/trecho-mais-antigo-da-odisseia-de-homero-e-descoberto-na-grecia/ and is composed of 2 sentences. **Paragraph 26** is in the first paragraph of the text in the link https://www.sonoticiaboa.com.br/2019/08/19/cientistas-criam-protetor-solar-ecologico-de-cascas-de-caju/ and is composed of 1 sentence. **Paragraph 27** is in the sixth paragraph of the text in the link https://www.sonoticiaboa.com.br/2019/07/21/sai-livro-avo-aprendeu-ler-por-causa-neto-adotado/ and is composed of 2 sentences. **Paragraph 28** is in the beginning of the first paragraph of the text in the link https://g1.globo.com/bemestar/noticia/mais-de-20-milhoes-de-brasileiros-tem-alguma-dificuldade-para-escutar.ghtml and is composed of four sentences. **Paragraph 29** is in the first paragraph of the text in the link https://www1.folha.uol.com.br/fsp/1994/9/07/informatica/2.html, starts in the fourth sentence of the paragraph and is composed of four sentences. **Paragraph 30** is in the first paragraph of the text in the link https://canaldopet.ig.com.br/curiosidades/2018-03-01/austismo-racas-de-caes.html, starts in the first sentence of the paragraph and is composed of three sentences. **Paragraph 33** is composed of the four sentences: the first one is in the first paragraph of the text in the link https://www.sonoticiaboa.com.br/2019/07/16/aberta-temporada-ipes-amarelos-abelhas-agradecem-video/, the second and the third are in the second paragraph and the forth one is in the third paragraph. **Paragraph 34** is composed of three sentences: the first and the second are in the first paragraph of the text in the link https://www.sonoticiaboa.com.br/2019/08/01/homem-fez-2o-grau-prisao-passa-universidade-agradece/ and the third is the first sentence of the second paragraph. **Paragraph 35** is in the first paragraph of the text in the link https://www1.folha.uol.com.br/fsp/1994/12/04/mais!/17.html, starts in the first sentence of the paragraph and is composed of two sentences. **Paragraph 37** is in the first paragraph of the text in the link https://canaltech.com.br/redes-sociais/twitter-abandona-estrategia-exclusiva-para-streaming-115162/, starts in the first sentence of the paragraph and is composed of two sentences. **Paragraph 39** is in the first paragraph of the text in the link https://www.bbc.com/portuguese/geral-44765295, starts in the first sentence of the paragraph and is composed of three sentences. **Paragraph 40** is composed of three sentences: the first and the second are in the first paragraph of the text in the link https://www.sonoticiaboa.com.br/2019/08/14/cientistas-recriam-perfume-cleopatra-chanel-5-egito/ and the third is the first sentence of the second paragraph. ---------- #### Frequency Lists: The file [**brWaC_Normalized_Frequencies.tsv**](https://osf.io/74j9h/) has the word frequencies from The brWaC Corpus with Part-of-Speech Tags. There are 4 tab-separated columns: Word / Tag / Frequency_FPM / Frequency_Log. The file [**brasileiro_Normalized_Frequencies.tsv**](https://osf.io/8k5jd/) has the word frequencies from The Corpus Brasileiro with 3 tab-separated columns: Word / Frequency_FPM / Frequency_Log. ---------- #### Papers, Thesis and Master's Degree Dissertation that used the RastrOS corpus: LEAL, S. L.; VIEIRA, J. M. M.; RODRIGUES, E. S.; TEIXEIRA, E. N. and ALUÍSIO, S. M. (2020) **Using Eye-tracking Data to Predict the Readability of Brazilian Portuguese Sentences in Single-task, Multi-task and Sequential Transfer Learning Approaches**. In: Proceedings of the 28th Interna-tional Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), pp 5821–5831, DOI 10.18653/v1/2020.coling-main.512, URL: https://www.aclweb.org/anthology/2020.coling-main.512. VIEIRA, J.; LEAL, S.; RODRIGUES, E. S.; ALUÍSIO, S. M., Drieghe, D. and TEIXEIRA, E. N. **Lexical and partial prediction in a Brazilian Portuguese eye-tracking corpus**. Proceedings of the 34th Annual CUNY Conference on Human Sentence Processing. University of Pennsylvania (Online), 2021, URL: https://www.cuny2021.io/wp-content/uploads/2021/02/CUNY_2021_abstract_304.pdf Leal, Sidney Evaldo. **Sentence-based readability prediction in written Brazilian Portuguese, using linguistic, psycholinguistic, and eye tracking metrics**. (PhD Thesis). https://doi.org/10.11606/T.55.2021.tde-16072021-115303 VIEIRA, João Marcos Munguba. **The Brazilian Portuguese eye tracking corpus with a predictability study focusing on lexical and partial prediction**. (Master's Degree Dissertation). http://www.repositorio.ufc.br/handle/riufc/55798