Guidelines for the Ground Truth Transcription

Transcription guidelines for full texts to be used as Ground Truth

The Ground Truth corpus contains pages from publications printed between 1500 and 1900. The content of the corpus is based on a particular selection from the holdings of the DFG project „German Text Archive“, the Digitized Collections of the Staatsbibliothek zu Berlin and the Wolfenbüttel Digital Library of the Duke August library. The holdings of projects and digital collections of other libraries as well as additional Ground Truth data, which are compiled together with module projects, can be included in the corpus as special extensions in concertation with the OCR-D coordination project. If additional annotations or texts are necessary, these can be created in consultation with the OCR-D coordination project.

The provision of Ground Truth data aims at:
  • making available templates and data for the purpose of training OCR programs,
  • as well as enabling an examination and evaluation of the OCR recognition results.

These transcription guidelines largely follow the guidelines of the German Text Archive. Subsequently, its basic principles are described, which are also followed by these guidelines:

  • The texts are transcribed under the principle of preserving the historical language level of the texts.
  • In order to achieve this objective, we aim to minimize the number of (unavoidable) interpretations of typographical features.
  • Printing errors are not corrected.
  • The following guidelines result from the principle of preserving the original text as accurately as possible while at the same time concentrating on the lexical situation.

The corpus is transcribed or can be transcribed in relation to the interpretation of individual typographic and graphematic phenomena at different levels. The levels are explained in more detail below.