# Data
This repository contains the ZuCo 2.0 data as preprocessed for the ZuCo reading task classification benchmark.
There is a directory for each of the two reading tasks: normal reading (NR) and task-specific reading (TSR). We provide both the preprocessed data as well as the extracted featured. For the extracted features, each file corresponds to one participant and contains both eye tracking and EEG brain activity data.
We only provide the data of the participants included in the benchmark.
As test data, feature files are provided for 10 subjects, including all features as well es preprocessed EEG and ET data. Sentences of the NR and TSR condition are merged and shuffled within each subject.
# Code
The code to run the baseline models of the benchmark task is available here: [https://github.com/norahollenstein/zuco-benchmark](https://github.com/norahollenstein/zuco-benchmark).
# Features
Structure of the .mat files:
- Each line within the structure contains information and features of one sentence for the current subject.
- There are 739 rows, each row representing features extracted for one sentence of the current subject. All normal reading and task specific reading sentences are shuffled.
- If a row only contains NaN values throughout the provided features, either EEG or ET data of this trial could not be used to extract any features (due to technical errors or due to exclusions in the merging process before the preprocessing pipeline).
- The Column "rawData" refers to the full (preprocessed) EEG data acquired while the sentence was presented (electrodes x samples). If one electrode (one row) contains NaNs, channel data was rejected for this sentence due to large artifacts. The "rawET" data contains the corresponding (x and y) gaze data from the Eyetracker.
- Each file contains features in the frequency domain, the following abbreviations are used:
- “_t1”: theta1, “_t2”: theta2, “_a1”: alpha1, ”_a2”: alpha2, “_b1”: beta 1, “_b2”: beta 2, “_g1”: gamma1, “_g2”: gamma2
- For each electrode (105) power is averaged across the sentence duration.
- If one electrode has the value NaN, data was omitted due to large artifacts in the EEG data in the corresponding channel during sentence presentation.
- For calculation of the difference values (e.g. field “mean_a1_diff”), 48 electrode pairs were used. The differences are always based on left – right values of homologous electrode-pairs:
(1) E22-E9, (2) E26-E2, (3) E23-E3, (4) E33-E122, (5) E27-E123, (6) E19 -E4, (7) E24-E124,(8) E34-E116, (9) E28-E117, (10) E20-E118, (11) E35-E110, (12)E29-E111, (13)E13-E112, (14) E30-E105,(15) E36-E104, (16) E41-E103, (17) E45-E108,(18) E46-E102, (19) E47-E98,(20) E42-E93, (21) E37-E87,(22) E53- E86,(23) E52-E92, (24) E51-E97,(25) E50-E101, (26) E60-E85, (27) E59-E91, (28) E58-E96, (29) E66-E84, (30) E65-E90, (31) E70-E83, (32) E38-E121, (33) E44-E114, (34) E43-E120, (35) E39-E115, (36) E40-E109, (37) E57-E100, (38) E64-E95, (39) E69-E89, (40) E74-E82, (41) E71-E76, (42) E67-E77, (43) E61-E78, (44) E54-E79, (45) E31-E80, (46) E7-E106, (47) E12-E5, (48) E18-E10
Each number represents the electrode number of the 128 channel HydroCel Geodesic Sensor Net.
- saccMeanAmp / saccMaxAmp refers to the mean / maxium amplitude of saccades during the current sentence
- saccMeanVel / saccMaxVel refers to the mean / maxium velocity of saccades during the current sentence
- saccMeanDur / saccMaxDur refers to the mean / maxium duration of saccades during the current sentence
- omissionRate refers to the percentage of skipped words in the current sentence
- The field “allFixations” contains information about all fixations which occurred between onset and offset of the sentence (including fixations outside of wordbounds).
- The field “allSaccades” contains information about all saccades which occurred between onset and offset of the sentence (including saccades outside of wordbounds).
- EEG electrode mapping: See the file chanloncs105.mat in "Further materials" the get the location and original labels of the 105 electrode.
- For each sentence (one line within the struct), the substructure “word” contains all features on word level.
- On word level, the following **eye-tracking** features were extracted:
- fixPositions indicates how the fixations chronologically occurred on each word (fixationPosition 3 on word 2 indicates that the 3rd fixation during the sentence presentation was on word 2)
- nFixations: number of fixations
- mean pupil size (and pupil size for each of the following as well); pupil size is the pupil area measured in arbitrary units
- FFD: first fixation duration
- TRT: total reading time
- GD: gaze duration
- SFD: single first fixation (this field only contains a value if the fixation was fixated only once, which means it will be empty often)
- GPT: go-past time
- For each of the measures described before (FFD,TRT,GD,SFD,GPT) the average pupil size is provided (e.g. FFD_pupilsize)
- inSacc_velocity refers to the velocity of incoming saccades on the current word, outSacc_velocity velocity of outgoing saccades, withinSacc_velocity refers to saccades which are within a single word
- inSacc_duration refers to the duration of incoming saccades on the current word, outSacc_ duration velocity of outgoing saccades, withinSacc_duration refers to saccades which are within a single word
- inSacc_amp refers to the amplitude of incoming saccades on the current word, outSacc_velocity amplitude of outgoing saccades, withinSaccade_amp refers to saccades which are within a single word
- rawET contains raw gaze data (1. row: time, 2. row: x location, 3. row: y location, 4. Row: pupil size) for each fixation on the current word
- The eye-tracking features are measured in samples with a sampling rate of 500 (1 sample = 2ms).
- On word level, the following **EEG** features were extracted:
- For each of the measures describe above (FFD,TRT,GD,SFD,GPT), the average power in all frequency bands is provided for each of the 105 electrodes (e.g. FFD_t1 is average theta 1 power during the first fixation). Additionally, differences (left-right) for the 48 above described electrode pairs are extracted for each frequency band (e.g. FFD_t1_diff).
- rawEEG contains preprocessed EEG data during each fixation on the current word (channels x samples)
# Further Materials
- chanlocs105.mat contains the original channel labels and electrode coordinates of the 105 electrodes used here.
- task-materials contain all sentences and control questions presented in the two reading paradigms.
- subject_answers contain all answers the participants gave during TSR and to the control questions in the NR task.
- Further materials (e.g. recording scripts, raw EEG data etc.) can be accessed in the ZUCO2 OSF repository https://osf.io/2urht/