Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# Conceptual Organization is Revealed by Consumer Activity Patterns ## Data and Code dunnhumby Research & Love Lab (University College London) ## Description This repository contains all data and modelling code from the "Conceptual Organization is Revealed by Consumer Activity Patterns" paper conducted by Adam Hornsby, Thomas Evans, Rosie Prior, Peter Riefer and Brad Love. Researchers wishing to use this for development or reproduction can use this README to help them understand the structure of the repository and the data. ## Using this data and code Please cite the ArXiv paper or - when published - the published version of this paper when referencing this code or data. Please note that this data and code is licensed under the Attribution-NonCommercial 4.0 International license. ## Repository structure (high level) The repository has the following structure: * **basket_data** - This folder contains the raw basket data used to develop the LDA model. * **raw_basket_data.txt** - The raw basket data used to develop the LDA model. Each row pertains to a basket. Baskets are represented with a comma-separated list of product codes. * **product_lookup.csv** - A lookup file, containing the associated product `Description` for each `TPNB` (i.e. product code) shown in the raw basket data. * **code** - This folder contains all modelling code * **lda** * **learn_lda_topics.sc** - A Scala script that re-creates the final 25 topic LDA model solution (in Spark 1.6.0). * **classifiers** * **profile_modelling.py** - A Main Python script that re-creates the demographic modelling results reported in the paper. * **config.py** - The configuration file for the `profile_modelling.py` script. * **consumer_data** - A folder containing raw data from the consumer study * **consumer_resp_data.csv** - Responses from the 'spot the intruder' study conducted with real consumers * **expert_data** - A folder containing raw data from the expert study * **survey_results** - Raw response data from expert 'topic name agreement' study, conducted using participants from dunnhumby. * **top_5_lookup.csv** - A lookup describing the "top five" products associated with each surveyed topic, according to their relevancy scores. Also used in the consumer study. * **model_data** - A folder containing results from the demographic modelling part of the paper * **modelling_data.csv** - A dataset containing the target variables (age, gender and location) and the mean topic probabilities for each customer ## Code usage 1. **learn_lda_topics.sc** - Execute this Scala code through Spark 1.6.0 using `spark-submit`. 2. **profile_modelling.py** - Execute this Python code through the command line using `python profile_modelling.py` using Python 2.7. Note that 2. requires: ``` pandas>=0.23 numpy>=1.15 scikit-learn>=0.20 ``` ## Data dictionaries ### Raw basket data Number of rows: 1,253,183 This is a raw text file. It's comma delimited but does not have a fixed width. It does not have headers. Each new line contains a comma-separated list of TPNBs (i.e. product codes). This represents a single basket of products. This data contains no customer, basket, time or date identifiers. ### Product lookup Number of rows: This data provides a lookup between the TPNB in the raw basket data and the product description: | Column | Description | Example value | |---------------------|------------------------------------------------------------------------|---------------------------------| | TPNB | The product code of the lookup | 50006618 | | Description | The provided description of the product | T. EDAY VAL BEEF ROASTING JOINT | ### Expert study raw data Number of rows: 51 Rows 3 and 4 contain a legend for each surveyed topic. The codes in this legend are used in the raw data below. Rows 9 to 60 contain participant's responses (one row per participant). For this data, we have the following columns: | Column | Description | Example value | |--------------|---------------------------------------------------------------------------------------------------------|----------------------| | Participant | An anonymised, unique identifier for each participant. This identifier is unique to this study and cannot be matched with identifiers in other datasets within this repository. | 0 | | Order | The order in which topics were presented to the participant | 7;4;3;5;1;9;2;6;10;8 | | TX\_selected | The topic label selected by the participant, where `X` pertains to the topic number shown in the legend | 1 | | TX\_option | The alternative topic labels to select from | 1;3;6;8 | | TX\_rtime | The time taken (seconds) to make the response | 21.5 | Rows 62 - 67 show summary statistics of responses for each topic. ### Consumer study raw data Number of rows: 3,841 This data contains the responses made by consumers in the study. One row per participant. | Column | Description | Example value | |----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------| | Participant | An anonymised, unique identifier for each participant. This identifier is unique to this study and cannot be matched with identifiers in other datasets within this repository. | 0 | | Date collected | The date of the survey | 24-Mar-17 | | Time collected | The time of the survey | 21-24 | | TopicShown | The topic shown to the participant | Low calorie options | | OddOneOut | The product that the participant considered to be the 'intruder' product. The filename of the image associated with this product is also described. | Prepared Baby Sprouts%IMG(lmc_product1.jpg:w150)% | ### Demographic modelling data Number of rows: 28,123 This data contains the features and target variables used for the demographic classifiers. | Column | Description | Example value | |----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| | Customer | An anonymised, unique identifier for each participant. This identifier is unique to this study and cannot be matched with identifiers in other datasets within this repository. | 0 | | Gender | Male = 0, Female = 1 | 1 | | Age4Level2014 | Customer's age, discretized into 4 buckets. | 60+ | | New_Region\_6\_Level\_2016 | England = 1, Other = 0 | 1 | | ... | The remaining 25 columns contain the mean, customer-level topic probabilities calculated across each basket purchased by the customer. Note that the baskets were taken exclusively from the `raw_basket_data.txt` file, and were not taken from an exhaustive list of all baskets purchased by those customers. | 0.64161 | ## Privacy 1. All consumer and participant identifiers have been pseudonymised in order to protect their privacy. 2. To protect the privacy of all participants and consumers, it is not possible to merge/join datasets from different studies up together. With the exception of lookup files, each dataset should be used in isolation from every other.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.