Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
We present here a description of the ACLEW DAS sampling practices along with scripts or tools to implement yourself. Any corpus specific practices are noted after the overall description. <br> **5-minute Starter Set** **Part 1: Recordings Participant Information** The 5-minute starter set clips were hand selected by each lab. The guiding selection principles were that a clip serve as a good pilot representation across infant age, target child gender, vocalization types, and SES. The selected recordings served as a pilot run over which the ACLEW annotation scheme was developed. [Starter Set Recording/Participant Information](https://github.com/aclew/ACLEW_OSF_Supplemental/blob/master/Starter_ACLEW_Participant%20Info.csv) **Part 2: Segment Selection** Each Lab hand-picked the recordings and clips used in the 5-minute started set [5-minute Starter Set Recordings](https://nyu.databrary.org/volume/390) <br> **Random Sampling** The annotated segments were randomly selected for a total of 15 2-minute segments from 10 children in each corpus. **Part 1: Recordings Participant Information** 10 recordings were selected from each of the corpora. Recordings were not selected to be representative of the populations in which they the were collected. Rather they were chosen to represent the range of target child ages and mother education levels, and to balance gender. The spreadsheet below provides the metadata for the recordings used in the random sampling dataset. [Random Sampling Recording/Participant Information](https://github.com/aclew/ACLEW_OSF_Supplemental/blob/master/RandomSampleACLEW-participant-info.csv) **Part 2: Random sampling Segments** The segments that were annotated were selected completely at random. For each recording, we picked 15 2-minute segments. The only restriction on the segment selection was that the 5-minute context regions for a segment that included the segment on interest, the 2-minutes preceding it, and the minute following it could not overlap. The exception to this sampling procedure was the CAS corpora. For the Tseltal data, they randomly selected 9 5-minute segments. For the Yeli data, they randomly sampled 9 2.5-minute segments. In both cases they used the same context overlap criteria. A custom script was used to select the regions and can found in the repository below. Instructions for using the the script are available in the Readme file. [Random Sampling Script Repository ](https://github.com/BergelsonLab/aclewsample) <br> **High-volubility Sampling** This section details the sampling practices used selecting high-volubility segments. There is a description of how the recordings were selected from the corpora and details about the selection criteria to qualify as high-volubility. **Part 1: Recordings Participant Information** A custom script identified 10 new recordings matched to the ACLEW Random Sample dataset for each corpus with more than 10 new participant recordings available. For corpora with fewer than 20 total participants, recordings were sampled such that the sample could be evenly split across the random and high volubility (e.g. for a corpus with total N = 12, 6 files were sampled including the 2 that were not included in the random sample, plus 4 additional recordings). The recordings were matched as closely as possible by the target child's age, mother's education, and target child's gender. Additionally, we excluded any participants living in the same home as a previously sampled participant. Exceptions to this process were the SOD and CAS corpora. The CAS corpus sampled high-volubility segments from the same 10 participants used in the random sampling. In the Sod corpus, the 3 additional participants not already used in the random sampling dataset were matched using the same process described above to the random sample. The remaining unmatched participants were reused to create the sample of 10 participants. None of the high-volubility segments were allowed to overlap segments from the random sampling. This link below is to a spreadsheet containing the recording information and participant metadata for the files used for High-volubility annotation. [High-volubility Sampling Recording/Participant Information](https://github.com/aclew/ACLEW_OSF_Supplemental/blob/master/HighVolSampleACLEW-participant-info.csv) The link below leads to a GitHub repository containing the script used to create the matched sample. [High-volubility Recording/Participant selection script](https://github.com/aclew/ACLEW_highvolubility_matchedsampler) **Part 2: High-volubility Segment Vetting** High-volubility segments were identified using a 2-step process. First, a custom automated process was used to identify segments in the recording that potentially contained a high density of speech. Second, these regions were roughly assessed by trained annotators to determine whether the clip met the selection criteria related to child vocalizations and amount of adult speech. The tutorial below provides detailed instructions for how the annotators assessed the segments. You can also find a link to the spreadsheet used for the rapid assessment of the clips. [Tutorial (EN): High-volubility Segment Vetting](https://docs.google.com/presentation/d/1VOzDQZ8FdGg28NXqQUJ7jjMeKOqw1WctZGRBkfDt3DM/edit?usp=sharing) [Tutorial (ES): Examinar archivos de elevado volumen de habla ](https://docs.google.com/presentation/d/1WncissukyUbXm7fwC6YAmfcl0YNt7jchS5s00qAokDQ/edit#slide=id.p1) The spreadsheets below are the lab specific hv vetting tracking forms. [High-volubility Segment Vetting Spreadsheet BER](https://docs.google.com/spreadsheets/d/1_Gg6j-KDywfYkQeXRCxyEOxdGWmxt0o_2ivGrHHG9HA/edit?usp=sharing) [High-volubility Segment Vetting Spreadsheet ROS](https://docs.google.com/spreadsheets/d/1WP6AdHt_j_YFmGb0mIR4tbWQ_WFv2homljhn9vfGIlA/edit#gid=1905168764) [High-volubility Segment Vetting Spreadsheet SOD](https://docs.google.com/spreadsheets/d/13Mgxshm6fmXpK7JMxY3ngYOHXhHRTbIoxh_v0VJ-V8s/edit?usp=sharing) <br> **Corpus-Specific Workflow Deviations** Electronic media has been handled differently across the different corpora and across the different sampling sets. Note that all the files have had child-directed media segmented and tagged with the annotation "O". This information is summarized in the Google sheet [Electronic Media Annotation Across Corpora](https://docs.google.com/spreadsheets/d/1EsLWHxwg0PFLI-RFjmwCfvtgvmSm5TH7bcH3jWhVf-E/edit?usp=sharing).