**UPDATE (March 2022):** A paper based on the results reported in the master thesis with some additional analyses has been published in *Computational Statistics*. Please cite this paper from now on:
**Pargent, F., Pfisterer, F., Thomas, J., & Bischl, B. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. *Computational Statistics* (2022). https://doi.org/10.1007/s00180-022-01207-6**
For questions and remarks feel free to contact Florian.Pargent@psy.lmu.de
----------
CONTENT:
**upload_datasets/**:
- scripts were used to upload some benchmark datasets to OpenML
**analysis/high_cardinality_benchmark/**:
- *main.R* builds a batchtools registry containing all computational jobs; sources most other scripts.
- after jobs have been run on some compute cluster, *collect_results.R* extracts the results from the registry; saves the preprocessed results in *results.rds*
**doc/**:
- *high_card_final_datasets.Rmd* documents all datasets used in the benchmark along with some remarks on why they were included; outputs *high_card_final_datasets.html* as well as *analysis/high_cardinality_benchmark/oml_ids.rds* and *analysis/high_cardinality_benchmark/descr_dat.rds* which are used in the benchmark and the manuscript
- *paper.Rmd* with *appendix.Rmd* is a reproducible script to build the paper submitted as master thesis in March 2019; outputs *paper.pdf*
- *references.bib* contains all references
- *sessionInfo_220319* is a text file documenting the package versions used to run the benchmark analysis on the Linux Cluster of the Leibniz Supercomputing Centre in Garching