22.6K lines (blog posts) and 4 semicolon-separated columns, any occurances of semicolons in the texts have been replaced with the string < semic > (without the spaces). Blog texts automatically classified as written in English from wordpress, blogspot and tumblr only. Note that earlier versions of dataset are erroneous due to a bug in the previous data cleaning scripts.
Version 4 (final version, 208-03-16) of the dataset published on data science community web site Kaggle
Re-published for citation purposes.
Get more citations
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information,
and information on cookie use.