Evaluation

The participants can log in on this web site and submit results for evaluation. Each submission must be accompanied by an abstract describing the proposed method or description of the difference of the method from previous submissions.

Our main goal is to evaluate automatic methods for mitosis detection. However, methods that require or use some degree of user interaction are acceptable, provided they still offer benefit over fully manual mitosis counting. The user interaction must be described in the submitted abstract (see below), and if applicable, the output from the user interaction should be uploaded along with the results. All methods that require user interaction will be designated as semi-automatic when the results of the challenge are presented.

Results format

The results must be submitted as a CSV file, one for each HPF, with the same filename as the HPF it refers to. Each row in the CSV file must correspond to one detected mitotic figure location. The first two columns must contain the image coordinates of the detection.

A third optional column in the CSV file can contain a confidence value for the detection. If the participants submit the results in this format, then they must provide a threshold for the confidence value, such that all objects with confidence above the threshold are considered detected mitotic figures. The threshold value must be identical for all HPFs from all patients. If the third column is not provided, all objects in the CSV file will be considered as detected mitotic figures. Although for the evaluation of results we do not consider the confidence values of the detected mitotic figures (see below), this information might be used in the summary paper to plot free response ROC curves or other similar graphs.

All CSV files must be organized in a directory structure identical to the provided dataset. If applicable, the confidence theshold value should be provided as a 'threshold.txt' file at the top level of the directory tree. The abstract (see below) should also be in the top level of the directory tree in PDF format with filename 'abstract.pdf'. For submission, the directory tree must be compressed with the following filename:

teamUsername_#submission.[zip, tar.gz, …]

Abstract format

The abstract should be 500 to 1000 words long, and contain Methods and Experiments sections. The Methods section should contain a short overview of the proposed method, in sufficient detail to understand how the method works. If a commercial system is used a method description is not necessary, but the exact name of the system and version number need to be provided. The Experiments section should describe the steps taken in order to select the detection model and/or model parameters (training procedure).

Evaluation measures

A detection will be considered a true positive if it’s Eucledian distance to a ground truth location is less than 7.5 μm (30 pixels). It can happen that multiple detections fall within 7.5 μm of a single ground truth location. In that case, they will be counted as one true positive. All detections that are not within 7.5 μm of a ground truth location will be counted as false positives. All ground truth locations that do not have a detection within 7.5 μm will be counted as false negatives.

For comparison of the proposed methods, two different rankings will be produced:

  • Ranking according to the overall F1-score;
  • Ranking according to the F1-score computed for each patient separately;

In the first ranking scheme, all ground truth objects are considered as a single dataset (regardless to which patient they belong to). The proposed methods will simply be ranked according to the F1-score calculated as F1 = 2·precision·recall / (precision + recall).

The first ranking scheme is heavily influenced by the results for the cases with very high number of mitotic figures. The second ranking scheme equally weights the results from all cases, regardless of the number of mitotic figures present in them. In this case, the ground truth objects belonging to a single patient are considered as separate datasets. F1-score is calculated on the patient level, and the proposed methods are ranked for each patient separately. The final placing of the methods is according to the average ranking from all patients.

In the training dataset there is one case with zero ground truth mitotic figures. If such cases occur in the testing dataset, the ranking for those cases will be done according to the number of false positive detections, as the precision and recall are not defined.

The ranking of the semi-automatic methods will be done separately from the automatic methods.

Upon the evaluation, the participants will be given the number of true positives, false positives and false negatives for each HPF.