Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
## Introduction We used the following software: - [Stata 13][6] - [R][7] (versions changed over time, but the last we used was V 3.2.4.) and the packages [tm][2], [wordcloud][3], [ggplot2][4], [MASS][5] - Adobe Photoshop for post-processing of figures (e.g. adding legends, topic labelling within figures), The NLP pipeline was implemented in Java. If you have any issues finding what you are looking for or if you have any questions, feel free to contact the corresponding author (Julia Rohrer). Contact Martin Brümmer for questions regarding the NLP pipeline. Notice that you will have to exchange paths to the proper working directories on your PC if you plan to use any of these scripts yourself. Also, please note that the code we wrote is not optimized but rather grew "organically". So if you decide to run your own analyses, just pick the bits and pieces that you consider helpful. ## Data ## We are not able to make the data sets used in our study accessible because of third party restrictions. However, data are made available to the scientific community after signing a data distribution contract with DIW Berlin. See [DIW Berlin: Data][1] or contact soepmail@diw.de. Please notice that the textual data we used is not part of the standard data distribution of the SOEP; instead, it needs to be explicitely requested. ## Scripts & Code ## **Analysis of selection effects** Run in Stata. Requires the original data set (long format) that we received from the DIW. See do-files for more information on included variables. [First do-file: Prepare data for analysis][8] [Second do-file: Multilevel models][9] **Pre-processing of textual data** The NLP pipeline is implemented in Java. To run it, import the Textual Processing Software folder into the IDE of your choice. Follow the [README.md][23] for further guidelines. The pipeline depends on wordlists and dictionaries in the Textual Processing Resources folder, namely: * [Abbreviation dictionary][17]: A list of common abbreviations and their full form with which they are replaced * [Exception dictionary][18]: A list of words that should not be spell-checked because it results in systematic errors * [Expansion dictionary][19]: A list of stemmed word forms and the full forms they are to be replaced with for readability and visualization * [Stemming dictionary][20]: A list of full word forms and their stemmed equivalents for stemming. Heavily modified form of the [Snowball Stemmer German Stemming dictionary][21] * [Stopword list][22]: A list of stopwords to be removed because they are semantically meaningless **LDA** Check [README.md][23] for the MalletClusterer we used for topic modeling. Results were translated to English and plotted in R. The data input used here is one tsv-file per topic, which contains the words and their respective weights within the topic. [Translation file German -> English][10] (notice that this file is not identical to the one we provided as Supporting Material; for presentation as Supporting Material, we removed untranslated words and the second column that had included some word-count information of a former version of the data) [R script to create topic wordclouds][11] **Word-level analyses** Word-level correlational analyses are run on a rather sparse data matrix with one column for each word tested (0 = word not used in answer, 1 = word used in answer). [R script to run a large number of linear models][12] - this is anything but efficient code, so I recommend that (in case you want to run this type of analysis yourself) you write your own script. [R script to plot the results of these analyses][13] - we used Photoshop to add legends and remove the border lines. **Topic-level analyses** Topic-level analyses are run on data with additional columns, that is, one for each topic. [Topics over time][14] - calculate the measure of variability we used in this study (CFB) and plots the results, furthermore plots the trajectories of selected topics over time. [Topics & close-ended worry items][16] - calculate the relative risk of topic occurence for people that answered that they were very worried and plots the results. We used Photoshop to label the strongest associations. [Topic-level correlational analyses][15] - This is basically the same as the word-level correlational analyses but run on topic occurence instead of occurence of single words. Script also includes code to plot the results. We used Photoshop to label the strongest associations. [1]: http://www.diw.de/en/diw_02.c.222517.en/data.html [2]: http://cran.r-project.org/web/packages/tm/index.html [3]: http://cran.r-project.org/web/packages/wordcloud/index.html [4]: http://cran.r-project.org/web/packages/ggplot2/index.html [5]: http://cran.r-project.org/web/packages/MASS/index.html [6]: http://www.stata.com/stata13/ [7]: http://cran.r-project.org/ [8]: http://osf.io/q69h2/ [9]: http://osf.io/dtq9s/ [10]: http://osf.io/rxfnw/ [11]: http://osf.io/jr5zh/ [12]: http://osf.io/fmj5k/ [13]: http://osf.io/x3myt/ [14]: https://osf.io/tx82b/ [15]: https://osf.io/9ykjn/ [16]: https://osf.io/n4m9q/ [17]: https://osf.io/9b4qx/ [18]: https://osf.io/zv569/ [19]: https://osf.io/qjtpm/ [20]: https://osf.io/gazkd/ [21]: http://snowball.tartarus.org/algorithms/german/diffs.txt [22]: https://osf.io/cewns/ [23]: https://osf.io/vczt3/
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.