Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
<h1>Data Manual</h1> ---------- <h2>Data Contents </h2> <p align="justify">The dataset can be downloaded from from the <a href="https://osf.io/fc7ih/files/">data section</a> in the project. The dataset consists of three parts: Annotation, Metadata and Music tracks. </p> <ul> <li> **Annotation.zip** (41 KB): this file unzips into a folder called "*Annotation*". The folder has one file - *mturk_worker_labels.csv*, which contains labels assigned by AMT workers to segments of the music tracks, in csv format. For the basic goal of computing consensus given noisy worker labels, this is the most important file. <li> **Metadata.zip** (591 KB): this file unzips into a folder called "*Metadata*" and it contains different metadata for the music tracks in xml format. The description of each xml file can be found below. <li> **Music.zip and Music.z(01-18)** (~2.2 GB): This file has the actual music tracks in mp3 format. Please download all the files and unzip the file "Music.zip" to get a folder called "*Music*" which contains the mp3 files for all the music tracks in the dataset. <li> **result_template.csv**: This file contains a template for the results in the submission format. </ul> <h2>Annotation (Worker Labels)</h2> The worker label assignment of sound segments was carried out on <a href="https://www.mturk.com/mturk/welcome">Amazon Mechanical Turk</a>. Amazon Mechanical Turk is a crowdsourcing platform that allows utilizing human resources for a variety of functions through Human Intelligence Tasks (HITs) designed by researchers. For our purposes, we initially conducted an EDM recruitment task to test the ability of HIT workers in detecting drops for select labelled sound segments. These recruited workers were then assigned to the main task of assigning labels to unlabelled sound segments. <h3>HIT Description</h3> Workers must listen to three 15-second music segments. These segments are chosen because a comment containing the word "drop" falls within this time-frame. The 3 different segments may be from the same track or from entirely different tracks. The workers then validate whether they actually hear a drop in the 15-second segment by choosing one of the following options: <ol> <li> The start-point <em>timestamp-here</em> is perfectly positioned as I can hear the drop (including the build-up) within the 15-second window (**Label: 1**) <li> The start-point <em>timestamp-here</em> is poorly positioned as I can hear only a part of the drop within the 15-second window (**Label: 2**) <li> The start-point <em>timestamp-here</em> is wrong as I cannot hear a drop within the 15-second window (**Label: 3**) <li> If you think none of the above options is suitable, please write your opinion in this box: <em>text-input</em> (**Label: 4**) </ol> <h3>File Description</h3> The labels maybe found in the *Annotation/mturk_worker_labels.csv* file. The file is comma separated with " " as a text delimiter. Each row corresponds to the properties, queries and the corresponding responses of the worker to a particular HIT where he/she listens to 3 music segments. The CSV file maybe viewed in different categories, <h4> <b>HIT Properties</b></h4> Comprises of fields that describe the properties of the HIT <ul> <li> <em>HITId</em> : Unique identifier for each HIT, 3 rows comprise of the same value since 3 workers can perform the same HIT. <li> <em>HITTypeId</em> : Unique identifier for the type of HIT, all rows comprise of the same value. <li> <em>Title</em> : Title assigned to the HIT, all rows comprise of the same value. <li> <em>Description</em> : Text description of the HIT assigned by the requester, all rows comprise of the same value. <li> <em>Keywords</em> : Keywords corresponding to the HIT assigned by the requester, all rows comprise of the same value. <li> <em>CreationTime</em> : Time of deployment of the HIT, all rows comprise more or less the same value with slight variation in seconds. <li> <em>MaxAssignments</em> : The number of maximum workers per HIT, all rows comprise of the same value. <li> <em>AutoApprovalDelayInSeconds</em> : The time till a HIT is automatically approved, all rows comprise of the same value. <li> <em>Expiration</em> : The time of expiration of the HIT, all rows comprise more or less the same value with slight variation in seconds. </ul> <h4> <b>Woker Properties</b></h4> Comprises of fields that describe the properties of the worker <ul> <li> <em>AssignmentId</em> : Unique identifier linking a worker and a particular HIT, all rows have a different value. <li> <em>WorkerId</em> : Unique identifier for the worker (due to privacy concerns we use anonymous ids), a single value may repeat any number of times since the worker is free to perform any number of different HITs. <li> <em>AssignmentStatus</em> : Indicates whether requester approved/rejected the Worker-HIT assignment, all rows comprise of the same value. <li> <em> LifetimeApprovalRate</em> : Approval rate of a worker since they first HIT, values are greater than 88% <li> <em>Last30DaysApprovalRate</em> : Approval rate of a worker in the last 30 days, all rows comprise of the same value. <li> <em>Last7DaysApprovalRate</em> : Approval rate of a worker in the last 7 days, all rows comprise of the same value. </ul> <h4><b>Response Properties</b></h4> Comprises of fields that correspond to the input and response of the workers to a particular HIT. For the basic goal of computing consensus using noisy worker labels, the properties: <em>Input.Track_ID_(1,2,3)</em>, <em>Input.start_time_(1,2,3)</em>, <em>Input.end_time_(1,2,3)</em>, <em>Answer.Q(1,2,3)</em> are used. For example, *Input.Track_ID_1* gives the track id of the first segment, *Input.start_time_1* gives the start time (in seconds) of the 15-segment segment in the track, *Input.end_time_1* gives the end time (in seconds) of the 15-second segment and *Answer.Q1* gives the label provided by the worker for the first segment. <ul> <li> <em>AcceptTime</em> : The time when the HIT was accepted by a worker. <li> <em>SubmitTime</em> : The time when the HIT was completed and submitted by a worker. <li> <em>AutoApprovalTime</em> : The time when the HIT submission was automatically approved by Amazon. <li> <em>ApprovalTime</em> : The time when the HIT was approved by the requester. <li> <em>WorkTimeInSeconds</em> : The time interval within which the worker accepted and submitted the HIT. <li> <em>Input.Track_ID_(1,2,3)</em> : Unique identifier assigned by soundcloud to a particular track for which the worker listens to a music segment. <li> <em>Input.start_time_(1,2,3)</em> : The start of the music segment in seconds of the track. <li> <em>Input.end_time_(1,2,3)</em> : The end of the music segment in seconds of the track. <li> <em>Answer.Q(1,2,3)</em> : Choices to corresponding questions chosen by the worker. </ul> Note : (1,2,3) correspond to each of the music segment sequentially represented in the HIT. <h2>Metadata</h2> There are in total 355 XML files, corresponding to the number of music tracks with 2200 comments of which 591 contain the word **drop**. The names of the XML files are representative of the corresponding Track ID. The structure of the XML files is as follows : <pre> <sound> | <track></track> | <comment> | | <user></user> | </comment> </sound> </pre> <h3>Properties</h3> <ol> <li> The XML files are encoded in <em>"utf-8"</em> format and require corresponding parsing and pre-processing requirements. <li> Only comments which contain English characters <li> All metadata are represented as string attributes to each XML tag and must be retrieved accordingly. eg: <pre> <sound> | <track track_id="4856748" commentable="True" tags="None" release=""></track> | <comment body="Nice:)" comment_id="86786"> | | <user user_id="67863" username="bill"></user> | </comment> </sound> </pre> Note that <em>track_id</em> is a string and not integer. A simple python parser to retrieve and print the <em>created_at</em> for a track and <em>username</em> for each comment is shown below: <pre> from xml.etree.ElementTree import XML with open('61852485.xml') as xmlfile: reader = xmlfile.read() e = XML(reader) track = e.find('track') print(track.get('created_at')) comments = e.findall('comment') for comment in comments: for user in comment.findall('user'): print(user.get('username')) </pre> <li> The following attributes are reported to contain Unicode characters that do not fall under the ASCII range and must be treated with caution: <ul> <li> <b>track</b> : title, description, owner_genre, tags, label_name, release <li> <b>comment</b> : body <li> <b>user</b> : username </ul> <li> If an attribute does not exist in the original track on soundcloud, it is still incorporated into the tag field but assigned "" to maintain consistency with other tracks. <li> Empty but existent attributes are automatically assigned the "None" value to distinguish between non-existent attributes. <li> More information regarding the metadata can be found <a href="https://developers.soundcloud.com/docs/api/reference">here</a>. </ol> <h3>Filter Employed</h3> <ol> <li> Comments without track timestamp were ignored. <li> Comments must be 2 or more characters long. <li> Non-English comments or comments which lie outside the ASCII range were ignored. <li> Comments with the literal “http” were ignored. <li> Comments by track owner ignored. <li> Comments containing variations of the string "drop" (such as "dropppp", "droppin") were isolated as vital. </ol> <h2>Music</h2> This folder contains the 355 Creative Commons EDM tracks downloaded from <a href="https://soundcloud.com/stream">SoundCloud</a> in mp3 format. The workers listen to 15-second segments from these tracks to assign the labels.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.