Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**A. Format and scope of collections** We will first assess what material was archived for each of the collections. We will discuss the range of digital formats, tools needed to read the data, the extent to which open source and open format data formats are used. We will investigate metadata and check the extent to which it matches up what was archived. During this time, we will also investigate how much of the audio data is transcribed and collect basic statistics about the types of transcription (IPA, orthographies used, translated, and the like). Most of this checking can be done with basic scripts in Python or R, which allow us to run data sanity checks. For example, we will check how many audio files have a matching transcription file, and which ones (if any) are missing. We aim to have the audit-a-long meeting in the middle of August, focused on the contents of collections. Research Questions: - What languages are represented (documentary language(s), languages for meta-analysis)? - Where individual speaker attribution is appropriate, how easy is it to recover that information? - How accessible is the collection to researchers and community members? - How easy is it to retrieve metadata? Can we find out easily what is in the collection? - How easy can the corpora be searched? (e.g., how easily can I find all the narratives? Is the corpus organized in a way that facilitates retrieval by computational or manual methods?) - What software was used to create the files in the collection? Is it still maintained? - Number of speakers, number of hours (and other basic statistics of the corpus, keeping Dobrin et al’s [2007] caveats in mind); - Does the metadata match the deposit? If not, in what ways? - How complete are transcript files? - Is there a way to cite or reference individual pieces of data in the collection in a stable way? **B. Working with sound files** Next, we consider transcription, alignment, and extraction of phonetic information. For example, we will automatically align text at the level of segments and extract material for vowel space and f0 measurements. Currently we do not anticipate working with video collections in this section, except inasmuch as we may extract audio channels from video files to align audio for subtitling. This for reasons of time. Before digital transcription programs such as Elan (Wittenburg et al. 2006) became widely used, transcripts were usually handwritten and then typed for interlinear glossing in a program such as Toolbox. There is thus a great need for alignment tools which can allow us to add value to transcripts, by associating text with the underlying audio and video. For phonetic research goals, such data allows extraction of phonetic data for acoustic analysis, e.g. speaker variation, properties of the language, and comparative data with other languages. It also is crucial for study of prosody at the word and phrase level. For community-oriented materials, forced alignment facilitates creating subtitled videos and talking books. For corpora with transcriptions that are time-aligned at the utterance level, further segmentation should be straightforward. Several forced alignment programs exist, including WebMAUS (Strunk, Schiel & Seifart 2014), Praat (Boesma & Weenink 2014), P2FA/FaveAlign (Evanini, Isard & Liberman 2009), and implementations of the Kaldi alignment algorithm (Povey et al. 2011), such as the Montreal Forced Aligner (MFA; McAuliffe et al. 2017). All these algorithms align text to speech by identifying CV transitions. P2FA and the Montreal Forced Aligner use a pronouncing dictionary, while WebMAUS and Praat use generic language models. Babinski et al (2019) made a comparison of these algorithms for the Australian (Pama-Nyungan) language, Yidiny, and found that the Montreal Forced Aligner and P2FA produced results which were indistinguishable from manually coded data for studying word-level f0 prosodic features, while for duration measurements, results were more variable (see also DiCanio (2012, 2013) for earlier work on these questions). However, while they identified possible interactions between language models, pronouncing dictionary choices, and alignment accuracy, they did not pursue those interactions. Not all forced aligners can work with all collections. MFA requires short files (10-15 seconds) while P2FA can work with longer files. P2FA uses a pretrained English model, while MFA allows users to train their own models. All these models require a pronunciation dictionary to be constructed using the language grapheme to CMU Arpabet correspondences. The CMU Arpabet in use only has English phonemes; thus choices need to be made about how best to represent other language speech sounds which do not occur in English. Moreover, once materials are aligned, they need to be converted back to the standard orthographic conventions if they are to be used for community work. Note that we do not commit to a full manual evaluation of the accuracy of the forced alignment work here, since the preparation of gold standard texts is extremely time consuming. However, we can do data sanitary checks which will give us an indication if something has gone very wrong. We are able to compare the results gained from our automatic format extraction with published results in the literature, where they exist (e.g. with respect to segment duration). We are able to look at the range of data, subset data, and use other methods of inferring the accuracy and reliability of the results, at least to some extent. Naturally, we will also listen to a portion of the materials and manually note the types of alignment errors discovered. Note that we do not consider the materials produced by our tests to be the end product of a linguistic analysis. Rather, they are the first step which a linguist would do prior to subsequent analysis. Since not all documentary linguists know how to run forced alignment and segment extraction algorithms, we intend to have the audit-a-long meetings be in two parts for this section. The first (in mid September) would be the release of a tutorial on how to run these analyses. The algorithms do come with instructions and specifications; however, in discussion with other linguists I have found that many fieldworkers find these instructions very abstract and difficult to apply to their own data. The meeting of the to discuss the issues arising from alignment is planned for the end of October. Research questions: - Does the corpus have digital alignments at the utterance level? [type of program: Elan, Transcriber, Childes, Praat, etc] If not, can we create them (using P2FA)? - Can we align the corpora (or a substantial proportion of it) at the segment level? - Can we extract data about the segments (formant data for vowels, f0 measurements, for example) to produce a basic acoustic description of the language? - If the answer is no for any of the above, what problems were encountered? Why did alignment fail? What features were the result of earlier data processing choices, rather than being inherent to the corpus materials themselves? Is it possible to alter or adapt workflows so that usable data may be obtained? Preparing corpus data for alignment, segmentation, and analysis involves using Praat (Boesma & Weenink 2014) and R (R Core Team 2017) scripts; these will be adapted from work that has already been substantially complete within the lab; the scripts will be released on GitHub so that others can use them. These scripts include creating Arpabet dictionaries for MFA or P2FA alignment; resampling audio, segmenting long audio files, batch-running alignments, extracting acoustic measurements from the resulting files, and some data sanity checks. **C. Working with print materials** For this stage of the project, we will be working primarily with the text corpus. For example, we will be looking at whether we could make a bilingual or trilingual or word list from the data, and what further work would be required to make a preliminary dictionary from the corpus materials. How many unique lexical items are there? How much of the sentence data is parsed? Are transcripts already linked to audio directly? If not, can we relink the files? Note that some of these questions may have already been addressed in previous months. We aim for an audit-a-long meeting in early December to address these questions. Depending on the corpus, some of the secondary analyses mentioned here (e.g. wordlists, import to Flex or another program) may have already been completed by the original researcher. If that material was archived, we will use it in our workflows (see questions below about recovering relationships between secondary and primary data); if it was not archived (or does not exist), we will attempt a basic project setup in Flex (Fieldworks Language Explorer). Research questions: - Are there analytical materials archived (or accessibly published)? (e.g. parsed transcripts in Flex or Toolbox, a wordlist, sketch grammar with examples)? - What software programs were used to compile these materials? Are those programs still available? What limits are there on their use? (free, open-source, regularly maintained, etc.) Can we load and use the corpus as it was archived? Are all essential files in the archive? - If so, are the sources of the analytical materials easy to recover within the corpus? That is, how closely linked are primary and secondary source materials? - Can the corpus we used to create a preliminary wordlist? (with which language(s)?) - What steps are needed to create a parsed corpus (if one does not already exist) from the data? Can it be linked to audio and video? The questions that could be asked here are essentially limitless, since the number of things one can do with a text corpus once it is compiled are so varied. For example, one could evaluate suitability for use with training for Part of Speech tagging, or import into Bender et al’s AGGREGATION project for multilingual grammar engineering and machine translation (e.g. Bender 2018). One could use the text-speech aligned data to train automated transcription (Michaud et al. 2018). I leave this question open at this point. **D. Community-oriented language materials** The next part of the project is focused specifically on the community-oriented language materials that can be created from digital corpora. Note, as discussed above, we do not assume that any particular types of materials are necessarily needed, welcome, or appropriate for all Indigenous or endangered language communities whose languages are represented in the corpus audit. However, these are all types of materials which can commonly and easily be produced from documentary corpora and so are appropriate places to start in an audit of this type. The audit-a-long meeting would be at the end of March. Research questions: - Can we convert the data to a mobile phone dictionary, such as SIL’s Dictionary App Builder? - Can we make a spell-checker for Microsoft Word or LibreOffice? - Could we make a web display of time aligned texts (e.g. using LingView (Pride, Tomlin & AnderBois 2020))? - Could we convert some materials to an online language course? - Are there other suggestions that came out of the January meeting that we can implement with current project expertise and corpus materials? If not, what is needed? **E. Sustainability** We look at re-archiving, the longevity of materials we have created, data storage issues, data pipeline issues, and the reintegration of materials created throughout the year with other materials created for analysis and as part of a documentation pipeline. The audit-a-long group will naturally be focused on such questions as well. Research questions: - Can materials created by the project be integrated into the existing archive? If not, why not? - What aspects of the existing collection and the new materials are likely to be obsolete in a few years? Can we guard against that? - Can additional language documentation materials be included in a straightforward way into the workflows used here? (For researchers working on pre-archived corpora or using the tools in the field, can they easily incorporate new materials into existing collections?)
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.