Characterizing verbatim short-term memory in neural language models

doi:10.17605/OSF.IO/5GY7X

Title	Authors

Home

# Overview This repository contains the data accompanying the paper on Characterizing Short-Term Memory in Neural Language Models (https://aclanthology.org/2022.conll-1.28/) # Code The accompanying code is available here: https://github.com/KristijanArmeni/verbatim-memory-in-NLMs # [Files](https://osf.io/5gy7x/files/osfstorage) The files tab contains the following subfolders. ## /dataset_wikitext-103_retokenized Contains the .json files containing the [Wikitext-103 dataset](https://paperswithcode.com/dataset/wikitext-103) retokenized with a BPE tokenizer that was itself retrained on the Wikitext-103 dataset for the purposes of training a smaller GPT2-like transformer language model (see below). Each file contains indices that encode the tokens in the tran/validation/test splits. There are two train sets: one with 40M subset of tokens and one with 80M subset of the full Wikitext-103 dataset. ### /tokenizer_wikitext-103 Contains the merges.txt and vocab.json files needed if you want to load the tokenizer with `.from_pretrained("/path/to/tokenizer/folder")` in huggingface. Note that the tokenizer is also available on HuggingFace hub: https://huggingface.co/Kristijan/wikitext-103-tokenizer ## /input_files TBA ## /model_checkpoints ## /awd_lstm - `LSTM-3-layer_adam.pt` | pytorch class (binary) and the weights (compiled under pyTorch 0.4!) - `LSTM-3-layer_adam_statedict.pt` | pyTorch statedict (containing only the model weights) ### /dictionary - `wt103_word2idx.json` | a dict to convert a token into an index (for Wikitext-103 dataset) - `wt103_idx2word.json` | a list of tokens, can be used to convert an index to token These can be read into Python like this: ```python import json f = "./wt103_word2idx" with open(f, 'r') as fh: word2idx = json.load(fh) # will load a python dict ``` ## /gpt2_* Each gpt2 folder contains the checkpoint folder with the files needed to use `model = GPT2LMHeadModel().from_pretrained("/path/to/checkpoint/folder")` in HuggingFace. - gpt2_40m_12-768-1024_a_02 (12 layer) - gpt2_40m_6-768-1024_a_02 (6 layer) - gpt2_40m_3-768-1024_a_02 (3 layer) - gpt2_40m_1-768-1024_a_02 (1 layer) The 12 layer checkpoint is also available form the HuggingFace hub (https://huggingface.co/Kristijan/gpt2_wt103-40m_12-layer):

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.