Main content



Loading wiki pages...

Wiki Version:
<h1>2014 Crowdsourcing for Social Multimedia Task ReadMe</h1> ---------- This data set was published by the <a href="">MediaEval Multimedia Benchmark</a> Initiative within the context of the <a href="">MediaEval 2014 Crowdsourcing Task</a>. The task takes place during the month of September 2014. We are using the OSF for the first time in order to test its usefulness for such a task. Note that by using this data, we are assume that you are agreeing to the data licensing terms for the MediaEval 2014 Crowdsourcing Task specified in the <a href="">MediaEval 2014 Usage Agreement</a>. <p align="justify">To participate in the task during the month of September 2014, please follow these steps:</p> <ol> <li>Send an email to Karthik Yadati ( and Martha Larson ( to let them know that you intend to participate.</li> <li>Create a fork of this project.</li> <li>Develop and implement a system that addresses the task.</li> <li>Run your system on the entire data. You may submit up to five runs to MediaEval 2014. A run is an experimental condition (i.e., a specific algorithm or a variation of your system)</li> <li>Create a folder in your branch named "Results" and store your (up to five) results files there. Make the folder available to <a href="">Karthik Yadati</a> and <a href="">Martha Larson</a> You must do this before on by <strong>22 September 2014</strong>.</li> <li>Receive your evaluation results from the MediaEval organizers (takes ca. 2 days).</li> <li>Write a two-page working notes paper for the MediaEval proceedings using the MediaEval Working Notes format <a href="">.cls is here</a>. We need to receive your paper by <strong>28 September 2014</strong> for it to be counted as an "official" MediaEval 2014 submission. In your working notes paper, you cite the overview paper for the task, so you do not have to repeat the data set details. An early draft of the overview paper is available <a href=" ">here</a> For examples, please see the <a href="">MediaEval Workshop Notes Proceedings from 2013</a></li> </ol> <h2>Task description</h2> <p align="justify">The basic goal of this task is for participants to develop an algorithm that generates a single accurate label given multiple noisy labels collected using a crowdsourcing platform. </p> <p>The data for this task is a set of Creative Commons music tracks from the genre of Electronic Dance Music (EDM) and have been collected from the social music sharing platform: SoundCloud. Participants are asked to develop algorithms to predict a label. The label (described in more detail below) reflects whether or not the 15-second music segment contains a "<strong>drop</strong>", which is a characteristic event in EDM tracks. More information on the "<strong>drop</strong>" can be found in the recommended reading [1]. </p> <p>As input to the algorithm, we provide a set of noisy labels for 15-second segments within the tracks. These labels can be expected to be noisy, typical of the labeling behavior of crowdsourcing workers. We have collected the crowdsourcing labels using Amazon Mechanical Turk. Teams participating in this task are also welcome to tackle the task using their own crowdsourcing platform, however we expect that most teams will use the crowdsourcing labels that we provide with the task. Teams can experiment with very simple algorithms (e.g., majority voting among noisy labels), but are encouraged to apply more advanced techniques (e.g., methods to detect highly competent workers, whose labels can be assigned more weight.)</p> <p align="justify">Beyond this basic goal, the advanced goal of this task is to explore the ability of hybrid human/conventional computation to generate accurate labels for the 15-second segments in the tracks. Teams that pursue this advanced goal are asked to develop algorithms that combine noisy crowdsourcing labels (i.e., human computation) with automatic text processing or multimedia content analysis techniques (i.e., conventional computation). We provide the original music tracks, start and end points of the 15-second segments in the data set. We also provide the metadata associated with the music tracks in order to allow teams to develop such algorithms.</p> <p align="justify">The algorithms will be evaluated by comparing the output that they generated to a ground-truth that has been created by trusted annotators. The task data will be released in a single round and there is no development set. For the consensus calculation task, we suggest to use last year's data for algorithm development <a href="">[3]</a>. If you are interested in using audio features, you need to collect your own data. We suggest you contact us for some tips. If there is sufficient interest in this task, we will propose it again to be run at larger scale at MediaEval 2015.</p> <p align="justify">&nbsp;</p> <h2>Task Instructions</h2> <p align="justify">The data in this task will be released in a single round. The data set consists of a set of 591 15-second segments in 355 SoundCloud tracks. Each segment is associated with three labels contributed by three different crowdworkers who listened to that segment. The labels are the integers 1-4, and correspond to the following crowdworker judgments:</p> <ol> <li>The 15-second segment contains the entire drop (Label: <em><strong>1</strong></em>)</li> <li>The 15-second segment contains part of the drop (Label: <em><strong>2</strong></em>)</li> <li>The 15-second segment does not contain a drop (Label: <em><strong>3</strong></em>)</li> <li>None of the above (Label: <em><strong>4</strong></em>)</li> </ol> <p align="justify">The label <em><strong>4&nbsp; </strong></em>was given when the track was not available on SoundCloud or if the workers were not able to hear any music in the segment. All the annotations are released along with the dataset and the segments labelled <em><strong>4 </strong></em>can be ignored while developing the algorithms. In other words, the algorithms need to predict one of the three labels for each segment. More details about the dataset are presented in the Data wiki.</p> <p align="justify">As stated above, participants in this task are asked to develop algorithms to predict one of the the labels for each 15-second segment: "<em><strong>1</strong></em>", "<em><strong>2</strong></em>" and "<em><strong>3</strong></em>" <br /> Participants submit label predictions for all the segments in the dataset. Participants can submit up to five different runs (specific details are explained below). The runs will be evaluated against a high-fidelity ground truth. This ground-truth will be created by trusted annotators.<br />Note that the task is designed to reflect a real-world crowdsourcing scenario. Specifically, the low-fidelity annotations (which are intended to be used as input to this task) are representative of what can be obtained on a crowdsourcing platform with a "sensible" (basic, but not advanced) quality control mechanism. As such, you should not expect that there is a radical difference between the low-fidelity annotations and the high-fidelity annotations (which are used to evaluate the algorithms). In other words, the task is quite challenging.</p> <h3>Evaluation metric</h3> <p align="justify">The official evaluation metric for this task is the weighted F1 score. As mentioned earlier, for each segment we have three possible labels indicating whether the entire/part/no drop is present in the 15-second segment. For each of these three labels, the F1 score is calculated separately. We then weigh each F1 score with the proportion of the data and compute the mean F1 score. <h3>Run submission</h3> <p align="justify">The submissions of participants are evaluated against the ground-truth collected from trusted annotators. The entire data is used for the evaluation. (In other words, the whole data set of 591 segments is the test set, there is no development set.)</p> <h3>Required runs</h3> <p align="justify">Participants can submit up to 5 runs and in each run they can use different approaches.&nbsp;Each participant is required to submit one run using only the crowd annotation. Participants should not use any metadata or content analysis methods to generate the labels for this run. If there is a team that for some reason would like to be released from submitting the required run, please write an e-mail to the task organizers (contact details below) to explain the situation. <h3>Submission format</h3> Participants should submit their results as a csv file (Please use the file "result_template.csv" in the <a href="">data section</a>), in which each line has the following format:</p> <p align="justify">trackid,start,end,label</p> <p align="justify">trackid: unique identifier for each track, which is also the name of the mp3 file in the Music folder</p> <p align="justify">start: start-time of the segment</p> <p align="justify">end: end-time of the segment</p> <p align="justify">label: one of the three labels - <em><strong>1</strong></em>, <em><strong>2</strong></em>, <em><strong>3</strong></em><em>.</em></p> <p align="justify"><em>&nbsp;</em></p> <h2>Recommended Reading</h2> [1] M. J. Butler. Unlocking the Groove: Rhythm, Meter, and Musical Design in Electronic Dance Music (Profiles in Popular Music). Indiana University Press, February 2006.<br /><br /> [2] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP '10), pages 64-67, 2010.<br /><br /> [3] B. Loni, L. Y. Cheung, M. Riegler, A. Bozzon, L. Gottlieb, and M. Larson. <a href="">Fashion 10000: an enriched social image dataset for fashion and clothing.</a> In Proceedings of the 5th ACM Multimedia Systems Conference (MMSys '14), 41-46, 2010.<br /><br /> [4] B. Loni, J. Hare, M. Georgescu, M. Riegler, X. Zhu, M. Morchid, R. Dufour, and M. Larson. Getting by with a little help from the crowd: Practical approaches to social image labeling. To Appear in proceedings of the International ACM Workshop on Crowdsourcing for Multimedia (CrowdMM), 2014.<br /><br /> [5] B. Loni, M. Larson, A. Bozzon, and L. Gottlieb. <a href=" ">Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, data set, and evaluation.</a> In MediaEval 2013 Workshop, Barcelona, Spain, 2013. <br /><br /> [6] S. Nowak and S. RĂ¼ger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international conference on Multimedia information retrieval (MIR '10), pages 557-566, 2010.<br /><br /> [7] A. Sheshadri and M. Lease. <a href=" ">SQUARE: A Benchmark for Research on Computing Crowd Consensus.</a> In Proceedings of the 1st AAAI Conference on Human Computation (HCOMP), pages 156--164, 2013. <br /> <p align="justify">&nbsp;</p> <h2><span style="font-family: 'Segoe UI', 'Lucida Grande', Arial, sans-serif; font-size: 17px; font-weight: bold; line-height: 1.25em;">Contact</span></h2> <p align="justify"><span style="color: #555555; font-family: 'lucida grande', verdana, tahoma, sans-serif; font-size: 85%;">Karthik Yadati, Delft University of Technology, Netherlands,&nbsp;email: &lt;;</span><br /><span style="color: #555555; font-family: 'lucida grande', verdana, tahoma, sans-serif; font-size: 85%;">Martha Larson, Delft University of Technology, Netherlands, </span><span style="color: #555555; font-family: 'lucida grande', verdana, tahoma, sans-serif; font-size: 85%;">email: &lt;;</span></p> <p align="justify">&nbsp;</p>
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.