DATA TREATMENT PIPELINE
This pipeline implements utterance segmentation, speaker diarization, automatic transcription, forced alignment, and acoustic analyses.
1. SPLIT BY UTTERANCE
Original script written by Mietta Lennes (25.1.2002)
input = long .wav sound file
output = textgrid with annotated utterance tier
praat --run ~/utt_seg.praat 0 0 0.150 59 19 0 20 0 ~/outputdir/ ~/inputdir/
2. SPEAKER DIARIZATION
See https://github.com/pyannote/pyannote-audio for original code and more details.
input = long .wav sound file
output = table with diarization time stamps
python ~/pyannote_dia.py ~/inputdir/
3. ADD DIARIZATION INTERVALS TO TEXTGRID
Original script written by Author1 (1.27.2024)
input = textgrid with annotated utterance tier (output from step 1) + table with diarization time stamps (output from step 2)
output = textgrid with annotated utterance tier and annotated diarization tiers
praat --run ~/dia_to_existing_textgrid.praat ~/inputdir/
4. COMBINE DIARIZATION AND UTTERANCE TIERS
Original script written by Author1 (1.27.2024), toNonOverlappingIntervals.proc from https://gitlab.com/cpran/plugin_tgutils
input = textgrid with annotated utterance tier and annotated diarization tiers (output from step 3), toNonOverlappingIntervals.proc
output = textgrid with single annotated utterance/diarization tier
praat --run ~/combine_utt_dia_tiers.praat ~/inputdir/
5. SPLIT TEXTGRIDS & WAVS BY INTERVALS (ANNOTATED BY UTTERANCE AND DIARIZATION)
Original script written by Mietta Lenes (8.3.2002), modified by Danielle Daidone (4.27.2019), then modified by Author1 (1.27.2024).
input = long .wav sound file + textgrid with single annotated utterance/diarization tier (output from step 4)
output = split textgrids and corresponding .wav sound files
praat --run ~/utterance_split_onetier.praat ~/inputdir/ ~/outputdir/ "_"
6. TRANSCRIPTION
See https://github.com/openai/whisper for original code and more details.
input = split .wav files for target speaker (output from step 5)
output = table with transcripts for each split utterance/speaker turn
python ~/whisper_trans.py ~/inputdir/
7. CREATE NEW TEXTGRID WITH ANNOTATION BY UTTERANCE SEGMENTATION AND DIARIZATION WITH TRANSCRIPTION
Original script written by Author1 (2.4.2024)
input = table with transcripts for each split utterance/speaker turn (output from step 6) + long .wav sound file
output = long .wav sound file + long textgrid with target speech annotated (by utterance segmentation and diarization) and transcribed
praat --run ~/dia_trans_to_textgrid_EN.praat ~/inputdir/
8. SPLIT NEW TEXTGRIDS & WAVS BY INTERVAL
Original script written by Mietta Lenes (8.3.2002), modified by Danielle Daidone (4.27.2019), then modified by Author1 (1.27.2024).
input = long .wav sound file + long textgrid with target speech annotated (by utterance segmentation and diarization) and transcribed (output from step 7)
output = final split textgrids and corresponding split .wav sound files
praat --run ~/utterance_split_onetier_2_EN.praat ~/inputdir/ ~/outputdir/ "_"
9. FORCE ALIGN
See https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/index.html for original code and more details.
input = final split textgrids and corresponding .wav sound files (output from step 8)
output = aligned split textgrids and .wav sound files
mfa validate ~/inputdir/ english_us_arpa english_us_arpa
mfa align ~/inputdir/ english_us_arpa english_us_arpa ~/inputdir/aligned/ --no_textgrid_cleanup --clean
10. ACOUSTIC ANALYSES: PRAATSAUCE
See https://github.com/kirbyj/praatsauce for original scripts and more details.
input = aligned split .wav sound files and textgrids (output from step 9)
output = acoustic measures for each vowel (spectral_measures.txt)
praat --run ~/shellSauce.praat ~/input.wavdir/ ~/input.textgriddir/ ~/outputdir/ filename_spectral_measures.txt 1 1 2 "^$|^\s+$|-|!|B$|CH$|D$|DH$|DX$|EL$|EM$|EN$|F$|G$|HH$|H$|JH$|K$|L$|M$|N$|NX$|NG$|P$|Q$|R$|S$|SH$|T$|TH$|V$|W$|WH$|Y$|Z$|ZH$|spn$|sp$|sil$" 0 "" "_" 0 "n equidistant points" 9 1 1 1 1 0.05 0.5 6000 0.005 0 20 320 0 0 5 50 1 500 1500 2500 0 0 0
11. ACOUSTIC ANALYSES: PITCH TRACKING, JITTER, SHIMMER, UTT POS
Original script written by Author1 (2.29.2024)
input = aligned split .wav sound files and textgrids (output from step 9)
output = acoustic measures for each vowel (pitchtrack_jitter_shimmer.txt)
praat --run ~/extract_pitch_jitter_shimmer_EN.praat ~/inputdir/ ~/outputdir/
12. RUN DATA ANLAYSIS R SCRIPTS: PRELIMS
Original script from CITATION, adjusted to creak by Author1 (2023-2024).
input = acoustic measures for each vowel (spectral_measures.txt, pitchtrack_jitter_shimmer.txt) (outputs from step 10 and 11)
output = data frames with socio and ling variables (PT_full.csv, PT.csv, PT_clean.csv, PS_long.csv, PS.csv)
Rscript creak_script_prelim.R
13. RUN DATA ANLAYSIS R SCRIPTS: CLEANING1
Original script from CITATION, adjusted to creak by Author1 (2023-2024).
input = data frames with socio and ling variables (PS.csv) (output from step 12)
output = data frames after round 1 cleaning and vmeans calculated (PS_int.csv, PS_final.csv)
Rscript creak_script_cleaning1.R
14. RUN DATA ANLAYSIS R SCRIPTS: PLOTS
Original script from CITATION, adjusted to creak by Author1 (2023-2024).
input = data frames after round 1 cleaning and vmeans calculated (PS_final.csv) (output from step 13)
output = plotted results (.bmp) + data frames after round 2 cleaning (PS_stats.csv, PS_cleanH1H2.csv, PS_cleanCPP.csv, PS_cleanHNR05.csv)
Rscript creak_script_plots.R
15. RUN DATA ANLAYSIS R SCRIPTS: STATS1
Original script from CITATION, adjusted to creak by Author1 (2023-2024).
input = data frames after round 2 cleaning (PS_stats.csv, PS_cleanH1H2.csv, PS_cleanCPP.csv, PS_cleanHNR05.csv) (output from step 14)
output = data frames after round 3 cleaning (PS_stats.csv, PS_cleanH1H2.csv, PS_cleanCPP.csv, PS_cleanHNR05.csv) + lmer model summaries and visualizations
Rscript creak_script_stats1.R
16. RUN DATA ANLAYSIS R SCRIPTS: STATS2
Original script written by Author1 & Author2 (8.11.2024).
input = data frames after round 3 cleaning (PS_stats.csv, PS_cleanH1H2.csv, PS_cleanCPP.csv, PS_cleanHNR05.csv) (output from step 15)
output = brms model summaries and visualizations
Rscript creak_script_stats2.R
17. RUN DATA ANLAYSIS R SCRIPTS: RERUN PLOTS
Original script from CITATION, adjusted to creak by Author1 (2023-2024).
input = data frames after round 3 cleaning (PS_stats.csv, PS_cleanH1H2.csv, PS_cleanCPP.csv, PS_cleanHNR05.csv) (output from step 15)
output = new plotted results from cleaned final data frames (.bmp)
Rscript creak_script_plotsrerun.R