This dataset is designed for comparing different algorithms for adapting a language model to the writing of a particular user. It contains the sent email messages of employees of Enron separated by user and in chronological order.
We based our dataset on the [Enron Personalization Validation Set][1] released by Google and used in this [CHI 2015][2] paper by Fowler, et al. on language model personalization. In comparison to the original dataset, our dataset provides the exact normalized text used in our experiments. We have also provided related assets such as our word list and baseline n-gram models to help facilitate future comparisons.
If you use this dataset, please cite our Interspeech 2023 paper:
@inproceedings{adhikary_personalization,
author = {Jiban Adhikary and Keith Vertanen},
title = {Language Model Personalization for Improved Touchscreen Typing},
booktitle = {Proceedings of the International Conference on Spoken Language Processing},
location = {Dublin, Ireland},
month = {August},
year = {2023},
}
This material is based upon work supported by the NSF under Grant No. IIS-1750193
[1]: https://github.com/google-research-datasets/EnronPersonalizationValidation
[2]: https://research.google/pubs/pub43272/