**HyperKvasir: A Comprehensive Multi-Class Image
and Video Dataset for Gastrointestinal Endoscopy**
Artificial intelligence is currently a hot topic in medicine. The fact that medical data is often sparse and hard to obtain due to legal restrictions and lack of medical personnel to perform the cumbersome and tedious labeling of the data, leads to technical limitations. In this respect, we share the Hyper-Kvasir dataset, which is the largest image and video dataset from the gastrointestinal tract available today.
The data is collected during real gastro- and colonoscopy examinations at a Hospital in Norway and partly labeled by experienced gastrointestinal endoscopists.
The dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. Resulting in around 1 million images and video frames all together.
**Labeled images**
In total, the dataset contains 10,662 labeled images stored using the JPEG format. The images can be found in the images folder. The classes, which each of the images belongto, correspond to the folder they are stored in (e.g., the ’polyp’ folder contains all polyp images, the ’barretts’ folder contains allimages of Barrett’s esophagus, etc.). The number of images per class are not balanced, which is a general challenge in themedical field due to the fact that some findings occur more often than others. This adds an additional challenge for researchers,since methods applied to the data should also be able to learn from a small amount of training data. The labeled images represent 23 different classes of findings.
**Segmented Images**
We provide the original image, a segmentation mask and a bounding box for 1,000 images from the polyp class. In the mask, the pixels depicting polyp tissue, the region of interest, are represented by the foreground (white mask), while the background (in black) does not contain polyp pixels. The bounding box is defined as the outermost pixels of the found polyp. For this segmentation set, we have two folders, one for images and one for masks, each containing 1,000 JPEG-compressed images. The bounding boxes for the corresponding images are stored in a JavaScript Object Notation (JSON) file. The image and its corresponding mask have the same filename. The images and files are stored in the segmented images folder. It is important to point out that the segmented images have duplicates in the images folder of polyps since the images were taken from there.
**Unlabeled Images**
In total, the dataset contains 99,417 unlabeled images. The unlabeled images can be found in the unlabeled folder which is a subfolder in the image folder, together with the other labeled image folders. In addition to the unlabeled image files, we also provide the extracted global features and cluster assignments in the Hyper-Kvasir Github repository as Attribute-Relation File Format (ARFF) files. ARFF files can be opened and processed using, for example, the WEKA machine learning library, or they can easily be converted into comma-separated values (CSV) files.
**Videos**
In total, 374 videos are provided in the dataset, stored in the folder called videos. The video file format is Audio Video
Interleave (AVI). In addition to the video files, a CSV file is provided containing the videos’ videoID and finding. VideoID contains the corresponding video file name, and the finding contains the description of the finding in the video. In total, we have 171 different findings in the videos. Some are related to the labels provided for the images, and some are unique for the videos. Overall, the image and video findings can be seen as two different annotation types since the video findings are meant to describe the video as a whole.
**Terms of use**
The data is released fully open via Creative Commons Attribution 4.0 International (CC BY 4.0). In all work, documents and papers that use or refer to the dataset or report experimental results based on the Hyper-Kvasir, a reference to the related article needs to be added:
"**Hanna Borgli, Vajira Thambawita, Pia H. Smedsrud, Steven Hicks, Debesh Jha, Sigrun L. Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen, Carsten Griwodz, Håkon K. Stensland, Enrique Garcia-Ceja, Peter T. Schmidt, Hugo L. Hammer, Michael A. Riegler, Pål Halvorsen, and Thomas de Lange: "HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy", Springer Nature Scientific Data, 2020**"
(PREPRINT available at: https://osf.io/mkzcq/ )
**Ethical approval**
In this study, we used fully anonymized data approved by Privacy Data Protection Authority. It was exempted from approval from the Regional Committee for Medical and Health Research Ethics - South East Norway. Furthermore, we confirm that all experiments were performed in accordance with the relevant guidelines and regulations of the Regional Committee for Medical and Health Research Ethics - South East Norway, and the GDPR.