Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# Overview This repository contains the data accompanying the paper on Characterizing Short-Term Memory in Neural Language Models (https://aclanthology.org/2022.conll-1.28/) # Code The accompanying code is available here: https://github.com/KristijanArmeni/verbatim-memory-in-NLMs # [Files](https://osf.io/5gy7x/files/osfstorage) The files tab contains the following subfolders. ## /dataset_wikitext-103_retokenized Contains the .json files containing the [Wikitext-103 dataset](https://paperswithcode.com/dataset/wikitext-103) retokenized with a BPE tokenizer that was itself retrained on the Wikitext-103 dataset for the purposes of training a smaller GPT2-like transformer language model (see below). Each file contains indices that encode the tokens in the tran/validation/test splits. There are two train sets: one with 40M subset of tokens and one with 80M subset of the full Wikitext-103 dataset. ### /tokenizer_wikitext-103 Contains the merges.txt and vocab.json files needed if you want to load the tokenizer with `.from_pretrained("/path/to/tokenizer/folder")` in huggingface. Note that the tokenizer is also available on HuggingFace hub: https://huggingface.co/Kristijan/wikitext-103-tokenizer ## /input_files TBA ## /model_checkpoints ## /awd_lstm - `LSTM-3-layer_adam.pt` | pytorch class (binary) and the weights (compiled under pyTorch 0.4!) - `LSTM-3-layer_adam_statedict.pt` | pyTorch statedict (containing only the model weights) ### /dictionary - `wt103_word2idx.json` | a dict to convert a token into an index (for Wikitext-103 dataset) - `wt103_idx2word.json` | a list of tokens, can be used to convert an index to token These can be read into Python like this: ```python import json f = "./wt103_word2idx" with open(f, 'r') as fh: word2idx = json.load(fh) # will load a python dict ``` ## /gpt2_* Each gpt2 folder contains the checkpoint folder with the files needed to use `model = GPT2LMHeadModel().from_pretrained("/path/to/checkpoint/folder")` in HuggingFace. - gpt2_40m_12-768-1024_a_02 (12 layer) - gpt2_40m_6-768-1024_a_02 (6 layer) - gpt2_40m_3-768-1024_a_02 (3 layer) - gpt2_40m_1-768-1024_a_02 (1 layer) The 12 layer checkpoint is also available form the HuggingFace hub (https://huggingface.co/Kristijan/gpt2_wt103-40m_12-layer):
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.