This manuscript was submitted to the peer-reviewed International Journal of Artificial Intelligence in Education on December 2, 2019. Was accepted on May 14, 2020.
## Background ##
The goal of this paper is not only to improve the generalizability of AES but also to understand more what is inside the AES black box. First, this paper puts aside ASAP's D8 given its very small sample size, huge scale of holistic scores, and the imbalanced dataset. It opts for ASAP's D7, which has also rubric scores provided by human raters. Second, given the huge number of writing features and the relatively small essay dataset, this paper applies feature selection to keep the most important ones. We opted for feature selection instead of feature engineering for interpretability purposes (since feature engineering aggregates variables). Third, we follow thoroughly a process to improve the optimization of hyperparameters. Fourth, this paper analyzes what are the important features per rubric, using the permutation importance technique. Finally, the paper reveals how the proposed AES system exceeds human performance in terms of agreement level among human raters.