Using reference data for training and evaluation of statistical annotation and analysis methods is a core feature of empirical research. The most important basis for the successful application of such methods is the employment of suitable models underlying the algorithms. In addition to a suitable learning procedure, the existence of ground truth is the essential prerequisite for the creation of such models.
The OCR-D-Ground-Truth-Guidelines contain a format documentation of the existing Ground Truth provided by OCR-D and can be used as instructions for the compilation of further Ground Truth. With this standardisation, Ground Truth can be technically validated. Furthermore, existing transcriptions can be checked on the basis of this set of rules and, if necessary, converted into Ground Truth data.
The data format of the OCR-D-Ground-Truth is PAGE-XML. This format was initially developed by the PRImA Research Lab at the University of Salford Greater Manchester and fundamentally extended within the EU project IMPACT. It is currently managed by the PRImA Research Lab. In order to ensure further development and maintenance of this format, a PAGE-XML board was created on the initiative of OCR-D.