Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
@[toc](Materials, data, and scripts for the Lexical Richness project) # Overview ## Datasets The `FrenchData`, `GermanData` and `PortugueseData` directories contain nine csv files each. (`*` stands for `French`, `German` or `Portuguese`.) ### Ratings*.csv All individual lexical richness ratings; unfiltered. The rating data were collected using the `test.php` pages on the rating platform (see project component "Rating platform"). The file `RatingsPortuguese.csv` lacks a row with these column headers; they are added in the analysis scripts. - `RaterID`: unique identifier for each rater. - `Batch`: The texts were presented in batches of 50 to 52 texts. This is the batch of texts the raters were assigned to. - `Text`: Path to the txt file containing the text. The string before the first slash specifies the main directory; the string between the two slashes specifies the `Batch`; the string after the second slash is the unique text ID. - `Trial`: Order in which the texts were presented. - `Time`: Unix time stamp of the rating in milliseconds. - `Rating`: Lexical richness rating (from 1 to 9). ### Questionnaire_*Ratings.csv Background questionnaire data for all raters; unfiltered. The questionnaire data were collected using the `fb_fr.php`, `fb_de.php` and `fb_pt.php` pages on the rating platform (see project component "Rating platform"). The file `Questionnaire_PortugueseRatings.csv` lacks a row with these column headers; they are added in the analysis scripts. - `RaterID` - `Batch` - `Sex`: self-explanatory. - `Age` in years. - `NativeLanguage`: + `mono-French`, `mono-German` and `mono-Portuguese` mean that the rater considers him- or herself a native speaker of French, German or Portuguese only. + `bi-French`, `bi-German` and `bi-Portuguese` mean that the rater consider him- or herself a native speaker of French, German or Portuguese *and* another language. This other language is listed in `NativeLanguage2`. + `other` means that the rater did not consider him- or herself a native speaker of French, German or Portuguese. Their native language is then listed in `NativeLanguageOther`. - `NativeLanguage2`: A text-field entry (see above). `NA` means not applicable. - `NativeLanguageOther`: A text field entry (see above). ``NA` means not applicable. - `Country`: The country the rater hailed from; usually abbreviated with the country code. (e.g., `CH` = Switzerland). `--` means not specified. - `Education`: Highest degree. `NoDegree` = no degree. `Secondary` = secondary education. `EFZ` = certificate of proficiency/competence ('Eidgenössisches Fähigkeitszeugnis'). `Bachelor` = Bachelor degree. `Master` = Master degree. `PhD` = Ph.D. degree. - `Profession`: + `Linguist` + `Student`: Field specified in `StudentFollowUp` + `Teacher`: Level specified in `TeacherFollowUp` + `Other`: Specified in `OtherOccupation` - `OtherOccupation`: A text-field entry (see above). `NA` means not applicable. - `TeacherFollowUp`: Check-box entry (see above): `Kindergarten`, `Primary` school, `SecondaryAdults` (secondary school and/or adults). Multiple selections possible. ### Raters_*.csv Like `Questionnaire_*Ratings.csv` but filtered. See technical report for the filtering criteria. The file `Raters_Portuguese.csv` has slightly different headers. ### *_complete.csv `Ratings*.csv` merged with `Raters_*.csv` and with the training texts thrown out. Some column names are duplicated (e.g., `Batch.x` and `Batch.y`). ### meanRatingPerText_*.csv The average rating per text after filtering and some additional information about the texts. * `Text`: Path to the text (including text name). * `meanRating`: The text's mean rating (1 to 9). * `nRatings`: The number of ratings `meanRating` is based on. * `Class`: An identifier specifying the class the pupil who wrote the text was in. * `ControlGroup`: `FALSE` if the child who wrote the text was a French-Portuguese or German-Portuguese bilingual; `TRUE` otherwise. * `Learner`: The child's ID within the class. * `TextType`: `narr`ative or `arg`umentative. * `Language`: `F` = French, `D` = German (Deutsch), `P` = Portuguese. * `Time`: Was the text written at wave 1, 2, or 3? * `Group`: `PT-FR` = Portuguese-French bilingual. `PT-DE` = Portuguese-German bilingual. `FR (comparison)`, `DE (comparison)` and `PT (comparison)`: Child in the French, German or Portuguese comparison group (i.e., not French/German-Portuguese bilinguals). ### LexicalDiversity*.csv Measures of lexical diversity (i.e., of word variation) computed using the scripts in the project component "Computer code for cleaning, tagging, and analysing the texts". The technical report (Chapter 5) explains each column in greater detail. ### LexicalSophistication*_unlemmatised.csv Measures of lexical sophistication (i.e., of word rarity) based on unlemmatised corpus frequencies, computed using the scripts in the project component "Computer code for cleaning, tagging, and analysing the texts". The technical report (Chapter 6) explains each column in greater detail. ### LexicalSophistication*_lemmatised.csv Measures of lexical sophistication (i.e., of word rarity) based on lemmatised corpus frequencies, computed using the scripts in the project component "Computer code for cleaning, tagging, and analysing the texts". The technical report (Chapter 6) explains each column in greater detail. ### EvennessDisparityDispersion*.csv Measures corresponding to Jarvis' dimensions of word evenness, disparity, and dispersion, computed using the scripts in the project component "Computer code for cleaning, tagging, and analysing the texts". The technical report (Chapter 7) explains each column in greater detail. ## Raw texts The project component "French, German, and Portuguese texts as shown to the raters" contains 7z files which contain all texts as shown to the raters. ## Scripts The scripts `inspect_ratings_*.R` read in and clean the CSV files and compute the mean ratings' reliabilities. The scripts `predict_ratings_*.R` are used to try and predict the mean lexical richness ratings per text using the lexical indices. The script `predict_ratings_summary.R` draws some descriptive graphs and evaluates the models' performance on the hold-out sets. The files `predict_ratings_*.R` and `predict_ratings_summary.R` were updated in May 2018 because we recomputed the R² values as the proportional drop in the residual sum of squares relative the reference model rather than as the squared correlation between the predicted and observed values. ## Results summary `ModelPerformance.csv` summarises the predictive models' performance in both cross-validation and on the hold-out sets. This file is created in the `predict_ratings_summary.csv` script; please refer to it for details. This file was updated in May 2018 because we recomputed the R² values as the proportional drop in the residual sum of squares relative the reference model rather than as the squared correlation between the predicted and observed values. * `Language`: French, German or Portuguese. * `Model`: No predictors (reference), Black-box approach, 6-dimension approach, Guiraud only. * `Evaluation`: Performance in `Cross-validation` on the `Test set`. * `Metric`: `RMSE` (root mean squared error) or `R²` (coefficient of determination). * `Estimate` * `SE`: Standard error of estimate. This is the standard error of the 16 cross-validation estimates (for cross-validation) or the standard deviation of the bootstrapped distribution (for test set performance; see technical report, Chapter 14, for details). * `CI_lower`: The lower bound of a 95% confidence interval. For cross-validation estimates, this is based on a t(15) distribution. For test set estimates, this is based on the bootstrapped distribution. * `CI_upper`: The upper bound of a 95% confidence interval. # Technical report The technical report documents, in tedious detail, all steps taken in the Lexical Richness project. You'll also find a description of the variables in the `LexicalDiversity*.csv`, `LexicalSophistication*.csv` and `EvennessDisparityDispersion*.csv` in this technical report. # Running the analyses yourself You'll need `R`, which you can download for free from http://r-project.org. If you want to only reproduce or build on the predictive modelling (using the same predictor data), download the `FrenchData`, `GermanData` and `PortugueseData` folders as well as the `inspect_ratings*.R` and `predict_ratings*.R` scripts. Run the `inspect_ratings*.R` scripts first, then the language-specific `predict_ratings*.R` scripts, then the `predict_ratings_summary.R` scripts. If you want to also reproduce or build on the computation of the predictor data, download all texts ("French, German, and Portuguese texts as shown to the raters") and all files in the "Computer code for cleaning, tagging, and analysing the texts" component. Then run the `MasterScript*.R` files. There are some steps you'll need to effect on the command line (I used Linux), though. You had probably better read the technical report, too.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.