Dataset

Patient, slide and region selection

For the formation of the dataset for the mitosis detection challenge, hematoxylin and eosin (H&E) stained slides from 23 invasive breast carcinoma patients were made available. These are invasive breast carcinoma patients who underwent an excision biopsy between July 2009 and January 2010 at the University Medical Center Utrecht. The single patient selection criterion was the availability of the slides in the Pathology Department archive. Please note that we use the routinely prepared H&E sections, which capture the day-to-day variability of the tissue preparation and staining processes.

One expert pathologist selected one representative stained slide per patient and marked a large region of the tumor on the glass slides where mitosis annotation was to be performed.  For the larger tumors, the marked areas within the digital slides were selected to encompass the most invasive part of the tumor, at the periphery and with highest cellularity, which are the standard guidelines for performing mitosis counting. Smaller tumors were included in their entirety.

The regions of interest vary in size in the range from 7 mm2 to 27 mm2 58 mm2 with a median of  26 mm2. The standard in pathology practice is to count mitotic figures in an area of 2 mm2 (translating to 8 to 10 high power fields, depending on the microscope) and report that number as the mitotic activity index. However, in order to annotate as many mitotic figures as possible, the counting was not limited to 2 mm2 but was extended to the entire marked area.

Digitization

The digitization of the regions marked for annotation was performed with the Aperio ScanScope XT scanner at 40× magnification and with a spatial resolution of 0.25 μm/pixel. This is one of the most widely used digital slide scanners at present.  During the scanning, the automatically selected focus points by the scanner were manually revised in order to avoid out of focus artifacts and to ensure the best image quality possible. At the time of scanning, high quality JPEG 2000 compression (quality factor of 85) was used in order to reduce the storage requirements.

Mitosis annotation

Two expert pathologists independently traversed the selected regions on the digital slides and annotated the locations of mitotic figures.  This was done using standard digital slides viewing software on consumer grade computer monitors. The concordant cases (objects that were annotated as mitotic figures by both observers) were taken as ground truth objects directly. The discordant cases (objects that were annotated as mitotic figures only by one of the observers) were presented to a panel of an additional two observers who made the final decision. Note that the additional two observers did not traverse the slides, but only looked at the discordant cases. With this setup, all objects that are accepted as ground truth mitotic figures have been agreed upon by at least two experts.

Dataset format

The annotated regions were exported into separate images (TIFF format), each image representing one high power field (HPF, defined as 0.5×0.5 mm2 or 2000×2000 pixels). Since for some cases the total number of HPFs is very high (in the order of several hundreds), only the HPFs that contain at least one mitotic figure were included as part of the dataset. For the cases that have fewer than 10 HPFs in which  a mitotic figures is present, additional “empty” HPFs were included to extend the total number to 10 (in order to include sufficient “background” information, necessary for good training and evaluation).

The patients were divided into two groups, one used for training and the other as an independent testing set. The division was done in such a way that the number of mitotic figures in the two groups is balanced.

Both training and testing sets are organized into numbered folders, each folder containing HPFs and, if applicable, ground truth data from a single slide (patient). The HPFs are stored as 8-bit RGB TIF images with PackBits  lossless compression.

An alternative version of the datasets, with smaller download size, where the images are stored with light lossy JPEG compression (quality factor of 95), is available for download. Note that this compression is on top of the one used at scan-time.

The training HPF images are accompanied by a comma separated value (CSV) file with the same filename but different extension (.csv) containing the locations of the ground truth mitotic figures. The “empty” HPFs do not have a corresponding CSV file. Each row in the CSV file corresponds to one mitotic figure, and the two columns give the image coordinates of the annotated location.