RastrOS: A large eye-tracking corpus of reading data of Higher Education students in Brazil including norms of predictability
Date created: | Last Updated:
: DOI | ARK
Creating DOI. Please wait...
Description: Currently, eye tracking corpora are often used in studies of language structure processing costs to, for example, (i) evaluate models and metrics of syntactic difficulty, (ii) improve or evaluate computational models of simplification via sentential compression, and (iii) evaluate the quality of machine translation with objective metrics. However, there are only few of these corpora for a small number of languages, for example: English (Luke and Christianson, 2018; Cop et al., 2017), English and French (Kennedy et al., 2013), German (Kliegl et al., 2004), Russian (Laurinavichyute et al., 2018), Hindi (Husain et al., 2015) and Chinese (Yan et al., 2010). For Portuguese, there is no large eye tracking corpus with predictability norms like those mentioned above. This is a gap that hinders the advance of research in the areas of Cognitive Psychology, Psycholinguistics and Natural Language Processing (NLP) in Portuguese. In this project, we have two objectives: (i) to create and make publicly available a large corpus with eye tracking movements of short paragraphs during silent reading in Portuguese, by undergraduate students in Brazil, together with predictability norms that estimate the predictability of orthographic form, morphosyntactic and semantic information for each word in the paragraph, via a Cloze test, and (ii) to contribute to the dissemination of research using the eye movement techniques in the Psycholinguistics and PLN research areas. The methodology for developing the RastrOS corpus follows the same steps of the Provo project (Luke and Christianson, 2018), which used: (i) short paragraphs of various genres; (ii) the reading of 55 paragraphs for the eye tracking test and 5 paragraphs for the Cloze test; and (iii) each word of the corpus being read by at least 40 students. For RastrOS, the 50 paragraphs of the corpus were taken from various sources in journalistic, literary and popular science genres, at a rate of 40% for newspaper articles, 20% for literary texts and 40% for popular science communication. The 50 paragraphs were selected from a corpus larger than 100 paragraphs to account for the greatest diversity of linguistic factors relevant for processing cost assessment, reflected in the reading process: structural complexity of the period (simple vs. compound periods); verbal transitivity; sentence types (active / passive / relative); mechanisms of construction of correlation relations, among others. RastrOS uses a highly accurate eye-tracker - the EyeLink 1000 Desktop. Stimulus presentations were done by Experiment Builder software, data processing has been done by Data Viewer. We evaluated 4 semantic similarity methods: (i) LSA (Landauer e Dumais 1997) and (ii) BERT (Devlin et al., 2019) trained with the corpus brWaC (Wagner Filho et al., 2018), (iii) Word2vec (Mikolov et al., 2013) and (iv) FastText (Bojanowski, et al., 2017) trained with the corpus PUC-RS that includes brWaC, BlogSet-BR (Santos et al., 2018) and a Brazilian Portuguese Wikipedia dump from March 2019. The words are annotated with word classes (PoS) and inflexion information of the PALAVRAS parser (https://visl.sdu.dk/), with human revision. Principal Investigator: Sandra Maria Aluísio (NILC/ICMC/USP) Associated Researchers: Elisângela Nogueira Teixeira (UFC) Erica dos Santos Rodrigues (PUC/RJ) Gustavo Henrique Paetzold (UTFPR, campus Toledo) Katerina Lukasova (UFABC, Centro de Matemática, Computação e Cognição) Maria da Graça Campos Pimentel (Intermidia/ICMC/USP) Maria Teresa Carthery-Goulart (UFABC, Centro de Matemática, Computação e Cognição) Renê Alberto Moritz da Silva e Forster (UERJ) International collaborations: Denis Drieghe (Associate Professor within Psychology at the University of Southampton) Students: João Marcos Munguba Vieira (UFC) Sidney Evaldo Leal (ICMC/USP) References Bojanowski P., Grave E., Joulin A. & Mikolov T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. Cop, U., Dirix, N., Drieghe, D., & Duyck, W. (2017). Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behavior Research Methods, 49, 602–615. doi:10.3758/ s13428-016-0734-0) Devlin J., Chang M. W., Lee K. & Toutanova K. (2019). BERT: Pre-training of deep bidirec-tional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers), pages 4171–4186. Husain, S., Vasishth, S., and Srinivasan, N. (2015). Integration and prediction difficulty in Hindi sentence comprehension: Evidence from an eye-tracking corpus. Journal of Eye Movement Research, 8(2). Kennedy, A., Pynte, J., Murray, W. S., & Paul, S.-A. (2013). Frequency and predictability effects in the Dundee Corpus: An eye movement analysis. Quarterly Journal of Experimental Psychology, 66, 601– 618 Kliegl, R., Grabner, E., Rolfs, M., & Engbert, R. (2004). Length, frequency, and predictability effects of words on eye movements in reading. European Journal of Cognitive Psychology, 16, 262–284. Landauer, T. K. and Dumais, S. T. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2):211–240. Laurinavichyute, A.K., Sekerina, I.A., Alexeeva, S. Russian Sentence Corpus: Benchmark measures of eye movements in reading in Russian Behavior Research Methods, 2018, Jun:15, pp. 1-18. Luke, S. G.; Christianson, K. The Provo Corpus: A large eye-tracking corpus with predictability norms. Behavior Research Methods, 2018, Apr:50(2):826-833. Mikolov T., Chen K., Corrado G. & Dean J. (2013). Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. Santos H., Woloszyn V. & Vieira R. (2018). BlogSet-BR: A Brazilian Portuguese blog corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA). Wagner Filho, J. A., Wilkens T., Idiart M. & Villavicencio A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA). Yan, M., Kliegl, R., Richter, E. M., Nuthmann, A., and Shu, H. (2010). Flexible saccade-target selection in Chinese reading. The Quarterly Journal of Experimental Psychology, 63(4):705–725. Publications: 1. LEAL, S. L.; ALUÍSIO, S. M.; RODRIGUES, E. S.; VIEIRA, J. M. M. and TEIXEIRA, E. N. Métodos de Clusterização para a Criação de Corpus para Rastreamento Ocular durante a Leitura de Parágrafos em Português. Proceedings of the VI Jornada de Descrição do Português. Salvador - BA, Brasil, 2019. (In Portuguese) 2. LEAL, S. L.; VIEIRA, J. M. M.; RODRIGUES, E. S.; TEIXEIRA, E. N. and ALUÍSIO, S. M. Using Eye-tracking Data to Predict the Readability of Brazilian Portuguese Sentences in Single-task, Multi-task and Sequential Transfer Learning Approaches. Proceedings of the 28th International Conference on Computational Linguistics. Barcelona - Espanha (Online), 2020. 3. VIEIRA, J.; LEAL, S.; RODRIGUES, E. S.; ALUÍSIO, S. M. ; DRIEGHE, D. and TEIXEIRA, E. N. Lexical and partial prediction in a Brazilian Portuguese eye-tracking corpus. Proceedings of the 34th Annual CUNY Conference on Human Sentence Processing. University of Pennsylvania (Online), 2021. 4. LEAL, S. L.; CASANOVA, E.; PAETZOLD, G.; and ALUÍSIO, S. M. Evaluating Semantic Similarity Methods to Build Semantic Predictability Norms of Reading Data. Proceedings of the 24th International Conference on Text, Speech and Dialogue (TSD 2021). Olomouc - Czech Republic (Online), 2021. 5. LEAL, S. E.; DURAN, M. S.; SCARTON, C. E.; HARTMANN N. S.; ALUÍSIO, S. M. . NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. In: Language Resources and Evaluation - Special Issue: Computational approaches to Portuguese, 2021. *under peer review* 6. Leal, S.E., Lukasova, K., Carthery-Goulart, M.T. et al. RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese. Lang Resources & Evaluation (2022). https://doi.org/10.1007/s10579-022-09609-0