Zurich Cognitive Language Processing Corpus: A simultaneous EEG and eye-tracking resource for analyzing the human reading process

doi:10.17605/OSF.IO/Q3ZWS

Title	Authors

Home

# ZuCo: A simultaneous EEG and eye-tracking resource for analyzing the human reading process We present the Zurich Cognitive Language Processing Corpus (ZuCo), a dataset combining EEG and eye-tracking recordings from subjects reading natural sentences as a resource for the investigation of the human reading process in adult English native speakers. This dataset includes simultaneous EEG and eye-tracking signals collected from 12 subjects while reading natural English text. The recordings span both active and passive reading tasks. This resource can be used for neuroscience and psycholinguistics to analyze the human reading and language understanding process, as well as for natural language processing, where the EEG and eye-tracking signals can be used to train improved machine learning models for various tasks. The text material in the presented dataset is particularly suitable for information extraction tasks such as entity and relation extraction and sentiment analysis. This repository contains the data files and scripts for preprocessing and loading the data. For a list of additional public repositories making use of the ZuCo data please see the end of this document. The ZuCo 2.0 data, containing recordings for more sentences and more subjects, can be found [here](https://osf.io/2urht/). It is available in the same format. ## Publication If you use the ZuCo dataset, please reference the following paper: Hollenstein, N., Rotsztejn, J., Troendle, M., Pedroni, A., Zhang, C., & Langer, N. (2018). [ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading](https://www.nature.com/articles/sdata2018291). _Scientific data_, 5, 180291. ## Files Each task folder contains one Matlab file for each subject: Task 1 - Normal reading (sentiment) Task 2 - Normal reading (relations) Task 3 - Task-specific reading (relations) Task Materials - contains the raw text with the sentiment and relations labels for each sentence. Note: The paragraph IDs in Task 2 and Task 3 relate to the original corpus. ### Matlab Files The Matlab files contain the eye-tracking and EEG data: - Word level data - Sentence level data - Only for task 1 (sentiment reading): data recorded during the response period of the control questions (movie ratings) **Structure of the .mat files:** - Each line within the structure contains information and features of one sentence. - Each file contains features in the frequency domain, the following abbreviations are used: - “_t1”: theta1, “_t2”: theta2, “_a1”: alpha1, ”_a2”: alpha2, “_b1”: beta 1, “_b2”: beta 2, “_g1”: gamma1, “_g2”: gamma2 - For calculation of the difference values (e.g. field “mean_a1_diff”), 48 electrode pairs were used. The differences are always based on left – right values of homologous electrode-pairs: (1) E22-E9, (2) E26-E2, (3) E23-E3, (4) E33-E122, (5) E27-E123, (6) E19 -E4, (7) E24-E124,(8) E34-E116, (9) E28-E117, (10) E20-E118, (11) E35-E110, (12)E29-E111, (13)E13-E112, (14) E30-E105,(15) E36-E104, (16) E41-E103, (17) E45-E108,(18) E46-E102, (19) E47-E98,(20) E42-E93, (21) E37-E87,(22) E53- E86,(23) E52-E92, (24) E51-E97,(25) E50-E101, (26) E60-E85, (27) E59-E91, (28) E58-E96, (29) E66-E84, (30) E65-E90, (31) E70-E83, (32) E38-E121, (33) E44-E114, (34) E43-E120, (35) E39-E115, (36) E40-E109, (37) E57-E100, (38) E64-E95, (39) E69-E89, (40) E74-E82, (41) E71-E76, (42) E67-E77, (43) E61-E78, (44) E54-E79, (45) E31-E80, (46) E7-E106, (47) E12-E5, (48) E18-E10 Each number represents the electrode number of the 128 channel HydroCel Geodesic Sensor Net. - EEG electrode mapping: Use the field EEG.chanlocs.labels in the “Preprocessed” files with the EEG data to get the original labels of the 105 electrode values. - For each sentence (one line within the struct), the substructure “word” contains all features on word level. - On word level, the following eye-tracking features were extracted: - nFixations: number of fixations - mean pupil size (and pupil size for each of the following as well); pupil size is the pupil area measured in arbitrary units - FFD: first fixation duration - TRT: total reading time - GD: gaze duration - SFD: single first fixation (this field only contains a value if the fixation was fixated only once, which means it will be empty often) - GPT: go-past time - The eye-tracking features are measured in samples with a rate of 0.5 (1 sample = 2ms). These features have been extracted analogous to the ones in the GECO dataset (Cop et al., 2016). - The field “allFixations” contains information about all fixations which occurred between onset and offset of the sentence (including fixations outside of wordbounds). - Word-level EEG features: - For example, FFD_g1 is the gamma activity of the EEG data during the first fixation duration of that specific word. - The rawEEG field which contains the EEG data before feature extraction. If a word was fixated 5 times, rawEEG will contain 5 vectors of 105 values. Sample script to read the preprocessed MATLAB files in Python: scripts/read-mat-files.py **Note:** A few subjects have incomplete data due to technical issues during the recordings or errors in preprocessing! The missing sentences all have NaN values. Task 1 (Sentiment Reading): - ZDN -- sentences 151-250, 400 Task 2 (Normal Reading): - ZJS -- sentences 1-50 - ZPH -- sentences 51-100 Task 3 (Task-specific Reading): - ZGW -- sentences 179-225 - ZKB -- sentences 360-407 - ZPH -- sentences 271-314, 363-407 ### Raw data The raw files can be segmented by the following triggers: **Task 1 and 2:** Sentence onset: 10, sentence finished: 11 Control sentence onset: 12, control sentence finished: 13 Control question answered / finished: 15 (trigger 13 also indicates onset of control question) **Task 3:** Sentence onset: 10, sentence finished: 11 Control sentence onset: 12, control sentence finished: 13 → For simplification, _all_ sentences included a control question on the _same_ screen. “(Control) sentence finished” trigger (11 resp. 13) indicates an answer on the control question after reading the sentence. ### Preprocessed In the preprocessed folder is a folder for each subject. In these folders you can find: 1. The preprocessed EEG data for each reading block with Automagic (XX_EEG.mat). Please see the description of the preprocessing in the [ZuCo paper](https://www.nature.com/articles/sdata2018291) for the details. 2. The wordbounds (wordbounds_XX.mat), which are the coordinates of the word bounds for each presented word. 3. The eye-tracking data (XX_ET.mat). ## Version January 2019 In the January 2019 version each recording block of each subject was merged before the data were preprocessed with Automagic (Pedroni et al., 2019). This was done to extract the identical ICA components across all the blocks for one subjects (comparability between blocks). Thus the preprocessed EEG data and the extracted features might slightly differ to the previous version of the preprocessed data. In the initial preprocessed data, each individual block for each subject was separately analyzed by Automagic. This new version of the data contains additional ICA (independent component analysis) features for the EEG data. The eye-tracking data has not changed. The format remains the same. The ICA feature names have the suffix “_sec”. ## Further information Public code repositories using the ZuCo data: - https://github.com/DS3Lab/ner-at-first-sight Named entity annotations for the sentence in the ZuCo corpus and other eye-tracking corpora. Keywords: _NLP, NER, eye-tracking_ - https://github.com/DS3Lab/zuco-nlp Improving sentiment analysis, relation extraction and named entity recognition with eye-tracking, EEG and both combined. Keywords: _NLP, NER, relation extraction, sentiment analysis, eye-tracking, EEG_ - https://github.com/DS3Lab/cognival A framework for evaluating word embeddings with cognitive language processing data. Keywords: _NLP, word embeddings, eye-tracking, EEG, fMRI_ - https://github.com/LukasMut/NER-with-EEG-and-ET Named Entity Recognition (NER) with cognitive data, analysis between normal reading and task-specific reading. Keywords: _NLP, NER, eeg, eye-tracking, normal reading, task-specific reading_

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.