Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**What is the DARCLE Annotation Scheme?** Interoperable annotation formats are fundamental to the utility, expansion, and sustainability of collective data repositories. In the study of language development, shared annotation schemes have been critical to facilitating the transition from raw acoustic data to searchable, structured inventories (e.g., what CHAT does for CHILDES). Current schemes, like CHAT, typically require annotators to comprehensively and manually annotate recordings for utterance boundaries and orthographic speech content, with an additional, optional range of properties dependent on those. These schemes have been enormously successful for datasets on the scale of dozens of recording hours but are untenable for long-format recording corpora, which routinely contain hundreds or even thousands of hours of audio. Long-format corpora would benefit greatly from (semi-)automated analyses, both on the earliest steps of annotation—voice activity detection, utterance segmentation, and speaker diarization—as well as later steps—e.g., classification-based codes such as child-vs-adult-directed speech, and speech recognition to produce phonetic/orthographic representations. The DARCLE annotation scheme (DAS) is an annotation workflow specifically designed for long-format corpora which can be tailored by individual researchers and which interfaces with the current dominant scheme for short-format recordings. The workflow allows semi-automated annotation and analyses at higher linguistic levels. We give one example of how the workflow has been successfully implemented in a large cross-database project. ---------- **What will you find in this repository?** We store here a collection of DAS annotation templates ranging from the most minimal version of DAS to project-specific adaptations. This repository is meant to serve as a distribution point for existing templates and annotation documentation to promote shared annotation schemes within the long-format recording community. Please request contributor access or email melsod@babylanguagelab.org if you wish to include a link to your annotation system or add it as an OSF component. - **DAS minimal template.** This is a basic structure for formatting longform audio and highly adaptable to a variety of research goals. - **ACLEW DAS.** The ACLEW DAS was developed out of the DAS minimal template and provides comprehensive structure for annotation of broad class speaker (e.g. male adult, female adult, eletronic media) and addressee (child-directed, adult-directed, etc.) as well as a CHAT-based framework for transcribing adult speech and a language-neutral vocal maturity classification system. A comprehensive tutorial-based training system and a shiny-app for testing against a Gold Standard are available for use. - **ICALIM-plantilla-basica.** - **LAAC non-native template.** This system is a more focused template for annotating files in a language for which you do not have a native informant. It is therefore focused on identifying speech and speaker, and includes a vocal maturity classification but does no transcription or annotations for which understanding the content of the speech is required. ---------- *For more on issues around long-format developmental recordings (a.k.a. "daylong" child language recordings), also see:* The DARCLE group's webpage: [http://darcle.org/][1] The HomeBank webpage: [http://homebank.talkbank.org/][2] [1]: http://darcle.org/ [2]: http://homebank.talkbank.org/