Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# InteRead Dataset *Repository under construction* The InteRead dataset is an eye-tracking dataset for interdisciplinary research in educational science, psycholinguistics, and natural language processing. It is designed to explore the impact of interruptions on reading behavior. We provide data from 50 adults with normal or corrected-to-normal eyesight and proficiency in English (native or C1 level). The dataset encompasses a self-paced reading task of an English fictional text (28 pages, 5247 tokens), with participants encountering interruptions on six pages. It features gaze data, cognitive scores, resumption lag times and demographic information. This data collection has been approved by the Ethics Committee the University of Stuttgart. ## Dataset license agreement This dataset - with all the files it contains - is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0). By using this dataset, you agree to the license terms. The major license terms include: - Attribution: You must give appropriate credit to the original creators of the dataset. - Non-Commercial: You may not use the dataset for commercial purposes. - Share Alike: If you remix, transform, or build upon the dataset, you must distribute your contributions under the same license as the original. For more details, please refer to the [CC BY-NC-SA 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/). ## Publication Please refer to the following paper when using or citing the InteRead dataset: Zermiani Francesca, Dhar Prajit, Sood Ekta, Kögel Fabian, Bulling Andreas, and Wirzberger Maria. 2024. InteRead: An Eye Tracking Dataset of Interrupted Reading. In Proceedings of the 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024. ## Project Structure ### code This folder contains the code used to annotate, analyze and visualize the data. **annotation_tool.py** is the Python code to generate the annotation tool. The tool visualizes the gaze data right after an interruption and allows to manually select the resumption point. **Figure3.R** is the R code to generate Figure 3 in the paper. **Figure4.R** is the R code to generate Figure 4 in the paper as well as the associated analysis in Section 4 under **Individual Differences in Reading and Resumption Time**. **Section4_statistics.R** is the R code to perform all the analysis done in Section 5. ### data **bboxes.csv.zip** words in the stimulus text and their RoI bounding box coordinates, dimensions and sentence membership. **fixation_count.csv**: This dataset consists of the fixation counts by a **subject** for each **word** in **page**. **fixation_values.csv**: This dataset consists of the fixation durations by a **subject** for each **word** in **page**. **grouped_rt.csv**: This dataset consists of the reading times by a **subject** for each **page** and if it occurs pre- or post-interruption. **page_per_subject.csv**: The combined datasets of **pages_with_bounding_boxes.csv** for each particpant from **processed_fixation_events.parquet.gzip**. **pages_spacy_output.csv**: The output of running the Spacy tokenizer on the reading material/stimuli. It has the following columns: 1. **word**: The tokenized word. 2. **lemma**: The lemma of **word**. 3. **pos**: The part-of-speech category of **word**. 4. **dep**: The dependency relation label of **word**. 5. **ner**: The Named Entity Recognition (NER) tag of **word** (if any). 6. **sent_id**: The id of the sentences in the stimuli. Note that this dataset does not contain the page_id as Spacy processed the entire reading material without the page breaks. **pages_with_bounding_boxes.csv**: This dataset consists of the page information from **pages_spacy_output.csv** along with that from **bboxes.csv**. The columns are: 1. **page_id** 2. **line_num**: The line number for each page. 3. **token_id**: The unique id for the **word** for the entire stimuli. 6. **sent_id**: The id of the sentences in the stimuli. 7. **word**: The tokenized word. 8. **lemma**: The lemma of **word**. 9. **pos**: The part-of-speech category of **word**. 10. **dep**: The dependency relation label of **word**. 11. **ner**: The Named Entity Recognition (NER) tag of **word** (if any). 12. **x**: The x-coordinate of the bounding box of **word**. 13. **y**: The y-coordinate of the bounding box of **word**. 14. **width**: The width of the bounding box of **word**. 15. **height**: The heigth of the bounding box of **word**. **pre_post_tests_scores.csv** contains all the pre- and post-test scores, including demographic information of the participants involved in the data collection. Please refer to the publication's appendix for further details about the questionnaires performed. **processed_regression_events.csv.zip** regressions per participant and page, their onset, duration and interruption phase - computed from the saccades. **processed_saccade_events.csv.zip** | valid detected saccades per participant and page, their onset, duration and interruption phase. **resampled_gaze.csv.zip** resampled raw gaze data (x, y coordinates) per participant. **resumption_fixation.csv** contains the fixation values per participant and resumption page as well as the onset, duration and interruption phase. **resumption_saccades.csv** contains the saccades made by each participant on a resumption page as well as the onset, duration, interruption phase and if the given saccade was a regression. **resumption_times_per_page.csv** contains manually annotated resumption time for each page with interruption (on average, six resumption times per subject). **resumption_times_per_subject.csv** contains the resumption times averaged per subject (one resumption time per subject). **target_words.csv** words triggering an interruption and their bbox information ### stimuli Here, you can find the PNG files of the pages (0-27) shown to the participants during the data collection. ## Data Processing We observed fluctuations in the sampling rate and thus resampled the raw gaze data from the eye tracker to obtain data at exactly 1200Hz. The coordinates were linearly interpolated between the closest valid real samples - excluding blinks. We then averaged the coordinates of left and right eyes in the resampled gaze data and extracted fixation and saccade events using the REMoDNaV toolkit (Dar et al., 2020). Finally we exluded events from invalid participants (see paper for details) and assigned each event line and page range bins. Regressions were extracted from the saccades. We used the following criteria: a saccade is a regression if its end point lies in the same line but before its start point in reading direction. ## Participants The InteRead dataset showcases data from adult participants with normal or corrected-to-normal eyesight, English proficiency (native speaker, C1, IELTS 6.5+ or equivalent), without any diagnosed attention or reading disorders. For further details, please see the 'demographics' folder. ## Stimuli The reading material used for this data collection was extracted from The Adventure of the Speckled Band by Arthur Conan Doyle (retrieved [here](https://ocw.mit.edu/courses/21l-430-popular-culture-and-narrative-serial-storytelling-spring-2013/fc1c78ed095eb824c7124468e8a400e0_MIT21L_430S13_Adventure.pdf)). For further details, please see the 'stimuli' folder. ## Citations Asim H. Dar, Adina S. Wagner, and Michael Hanke. 2020. Remodnav: Robust eye-movement classification for dynamic stimulation. bioRxiv, 53:399–414. https://doi.org/10.1101/619254
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.