Main content

Files | Discussion Wiki | Discussion | Discussion
default Loading...

Home

Menu

Loading wiki pages...

View
Wiki Version:
This dataset is comprised of data gathered for and created in the process of the paper [Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity][1]. It contains a large legal data text corpus, several word2vec embedding models of the words in the said corpus, and a set of legal domain gazetteer lists. 1. **Legal Case Corpus:** This corpus contains 39,155 legal cases including 22,776 taken from the United States supreme court. For the convenience of the future researchers, we have also included 29,404 cases after some preprocessing. A map (key) is included for the folder numbering in the provided zip file. 2. **Legal Domain Word2Vec models:** Two word2vec models trained on the above corpus are included. One trained on raw legal text and one trained on the same text after lemmatization. 3. **Legal Domain gazetteer lists:** A number of gazetteer lists built by a legal professional to indicate domain specific semantic grouping is included. 4. **Word2Vec results:** Finally the results obtained by the [above paper][1] using the trained word2vec models are included. ---------- For the purpose of convenience, given below is the abstract of the paper [Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity][1]. *Semantic similarity measures are an important part in Natural Language Processing tasks. However Semantic similarity measures built for general use do not perform well within specific domains. Therefore in this study we introduce a domain specific semantic similarity measure that was created by the synergistic union of word2vec, a word embedding method that is used for semantic similarity calculation and lexicon based (lexical) semantic similarity methods. We prove that this proposed methodology outperforms both, word embedding methods trained on a generic corpus and word embedding methods trained on a domain specific corpus, which do not use lexical semantic similarity methods to augment the results. Further, we prove that text lemmatization can improve the performance of word embedding methods.* ---------- When loading Word2Vec models use: from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format("/content/legallemmatextreplacewithnnp.bin", binary=True, unicode_errors='ignore') [1]: https://goo.gl/E1dkbP
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.