hOCR - OCR Workflow and Output embedded in HTML

Living Standard,

This version:
http://kba.github.io/hocr-spec/1.2/
Previous Versions:
Issue Tracking:
GitHub
Inline In Spec
Editor:
Konstantin Baierer
Former Editor:
Thomas Breuel

Abstract

The purpose of this document is to define an open standard for representing document layout analysis and OCR results as a subset of HTML.

1. Introduction

The purpose of this document is to define an open standard for representing document layout analysis and OCR results as a subset of HTML. The goal is to reuse as much existing technology as possible, and to arrive at a representation that makes it easy to store, share, process and display OCR results.

This specification defines many features that can represent a variety of OCR-related information. However, being built on top of HTML, hOCR is designed to make it easy to start simple and gradually use more complex constructs when necessary.

Consider you have an HTML document that encodes a book: Wrapping page elements in <div class="ocr_page"> tags will convey the page boundaries to hOCR-capable agents and turn the HTML document into an hOCR document.

2. Terminology and Representation

2.1. Reusing HTML

Reusing HTML: Some text is missing in the first paragraph <https://github.com/kba/hocr-spec/issues/96>

This document describes a representation of various aspects of OCR output in an XML-like format. That is, we define a set of tags containing text and other tags, together with attributes of those tags. However, since the content we are representing is formatted text,

However, we are not actually using a new XML for the representation; instead we embed the representation in XHTML (or HTML) because [XHTML1] and XHTML processing already define many aspects of OCR output representation that would otherwise need additional, separate and ad-hoc definitions. These aspects include:

We are embedding this information inside HTML by encoding it within valid tags and attributes inside HTML. We are going to use the terms elements and properties for referring to embedded markup.

2.2. Definitions

2.2.1. "element"

An hOCR element (in this spec simply referred to as an element) is any HTML tag with a class attribute that contains exactly one class name that starts with ocr_ or ocrx_. Non-OCR related HTML content must not use class names that begin with ocr_ or ocrx_.

Note: When referring to an HTML tag with class ocr_page, this spec uses the notation ocr_page

If an HTML tag is an hOCR element, then its title attribute must not be used for any other purpose than to define hOCR properties and adhere to the properties format.

For some elements, the specs recommends using specific HTML tags. This is entirely optional, it may not be possible or desirable to actually choose those tags (e.g., when adding hOCR information to an existing HTML output routine).

2.2.2. "property"

hOCR Properties are a set of key-value pairs that convey OCR-specific information related to specific elements. They are serialized using a specific format in the title attribute of the element they refer to.

Note: When referring to a property bbox, this spec uses the notation bbox.

The name of a property must only consist of lowercase letters and numbers. Property names must be either from those defined in § 4 The properties of hOCR or begin with x_ to denote implementation-specific extensions.

Properties may define a default value. For those elements for which the property is not disallowed but not explicitly specified, the property is assigned to the element with the default value.

2.2.3. "capability"

The presence of elements and properties must be explicitly stated as a capability. The rationale is that if a hOCR producer is capabable of producing certain elements and properties, it should inform hOCR consumers that they may encounter those elements/properties. If a producer is not capable of producing certain elements/properties, consumers need not look for them.

Note: When referring to a capability ocrp_poly, this spec uses the notation ocrp_poly.

The mechanism for declaring capabilities are described in § 6.2 Capabilities

2.3. Relationship between elements, properties

2.3.1. element - property

There are four levels of association between any element to any property:

Disallowed Property

The element MUST NOT contain the property

Unless defined otherwise, all properties are disallowed for any element.

Required Property

The element MUST contain the property

Recommended Property

The element SHOULD contain the property

Allowed Property

The element MAY contain the property

2.3.2. property - property

A property present on an element can have on of the following relations to any other property:

Independent Property

The presence of property A has no influence on the presence of property B

Unless otherwise defiined, properties are always independent

Implied Property

If property A is present, property B must also be present

Conflicting Property

If property A is present, property B must not be present

Related Property

Property B is related to property A

2.4. Properties Grammar

The properties format for the properties is as follows, expressed in ABNF notation of [RFC5234]:

digit            = %x30-39
uint             = +digit
int              = *1"-" uint
nint             = "-" uint
fraction         = "." uint
float            = *uint fraction

whitespace       = +%20  ; one or more spaces ' '
comma            = %2C   ; comma ','
semicolon        = %3B   ; semicolon ';'
doublequote      = %22  ; double quote '"'
lowercase-letter = %x41-5A
alnum-word       = +(lowercase-letter / digit)
ascii-word       = +(%x21-7E - semicolon)    ; printable w/o space/semicolon
ascii-string     = +(%x20-7E - doublequote)  ; printable ascii without doublequote
delimited-string = doublequote ascii-string doublequote

properties-format = key-value-pair *(*whitespace semicolon *whitespace key-value-pair)
spec-property-name = ("bbox" / "baseline" / "cflow" / "cuts" / "hardbreak" /
                      "image" / "imagemd5" / "lpageno" / "nlp" / "order" /
                      "poly" / "ppageno" / "scan_res" / "textangle" /
                      "x_bboxes" / "x_confs" / "x_font" / "x_fsize" /
                      "x_scanner" / "x_source" / "x_wconf" )
engine-property-name = "x_" alnum-word
key-value-pair = property-name whitespace property-value
property-name = spec-property-name / engine-property-name
property-value = (ascii-word / delimited-string) *(whitespace (ascii-word / delimited-string) )

This is just the general grammar, the individual properties will define the exact property grammar that overrides property-name and property-value.

<div class="ocr_page" id="page_1">
  <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
    <div class="ocr_par" id="par_7"> ... </div>
    <div class="ocr_par" id="par_19"> ... </div>
  </div>
</div>

3. The elements of hOCR

The elements in hOCR can be broadly categorized as follows:

Typesetting Elements

Elements that describe those areas of a page that nest but don’t generally overlap

See § 3.1 Typesetting Elements

Float Elements

Elements that describe those areas of a page that are not part of the flow but are positioned

See § 3.2 Float elements

Logical Elements

These elements describe a page and its components in traditional typesetting.

See § 3.3 Logical Elements

Inline elements

Thse elements describe content beyond the level of text lines

See § 3.4 Inline Elements

Engine-Specific elements

Elements whose semantics are engine-specific

See § 3.5 OCR Engine-Specific elements

3.1. Typesetting Elements

The following typesetting related elements are based on a typesetting model as found in most typesetting systems, including XSL:FO, (La)TeX, LibreOffice, and Microsoft Word.

In those systems, each page is divided into a number of areas. Each area can either be a part of the body text (or multiple body texts, in the case of newspaper layouts). The content of the areas derives from a linear stream of textual content, which flows into the areas, filling them linewise in their preferred directions.

3.1.1. ocr_page

Name
ocr_page
Categories
Typesetting Elements
Properties
Required:
bbox
Recommended:
image, imagemd5, ppageno, lpageno
Allowed:
x_source, x_scanner, scan_res

The ocr_page element must be present in all hOCR documents.

3.1.2. ocr_column

Name
ocr_column (Deprecated)
Categories
Typesetting Elements
OBSOLETE

Please use ocr_carea instead

3.1.3. ocr_carea

Name
ocr_carea
Categories
Typesetting Elements
Properties
Required:
bbox

"ocr content area" or "body area"

Used to be called ocr_column

The ocr_carea elements should appear in reading order unless this is impossible because of some other structuring requirement. If the document contains multiple ocr_linear streams, then each ocr_carea must indicate which stream it belongs to.

Note that for many documents, the actual ground truth careas are well-defined by the document style of the original document before printing and scanning. From a single page, the careas of the original document style cannot be recovered exactly. However, the partition of a document by ocr_carea for an individual page shall be considered correct relative to ground truth if

  1. all the text contained in a ground truth carea is fully contained within a single ocr_carea,

  2. no text outside a ground truth carea is contained within an ocr_carea, and

  3. the ocr_carea appear in the same order as the text flow relationships between the ground truth careas.

3.1.4. ocr_line

Name
ocr_line
Categories
Typesetting Elements
Properties
Required:
bbox
Allowed:
baseline, hardbreak, x_font, x_fsize, x_bboxes

In typesetting systems, content areas are filled with “blocks”, but most of those blocks are not recoverable or semantically meaningful. However, one type of block is visible and very important for OCR engines: the line. Lines are typesetting blocks that only contain glyphs (“inlines” in XSL terminology). They are represented by the ocr_line area.

ocr_line should be in a span

3.1.5. ocr_separator

Name
ocr_separator
Categories
Typesetting Elements , Float Elements
Properties
Required:
bbox

Any separator or similar element

3.1.6. ocr_noise

Name
ocr_noise
Categories
Inline Elements

Any noise element that isn’t part of typesetting

3.2. Float elements

Overlaid onto the page is a set of floating elements; floating elements exist outside the normal reading order. Floating elements may be introduced by the textual content, or they may be related to the page itself (anchoring is a logical property). In typesetting systems, floating elements may be anchored to the page, to paragraphs, or to the content stream. Floating elements can overlap content areas and render on top of or under content, or they can force content to flow around them. The default for floating elements in this spec is that their anchor is undefined (it is a logical property, not a typesetting property), and that text flows around them. Note that with rectangular content areas and rectangular floats, already a wide variety of non-rectangular text shapes can be realized.

There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.

Floats should not be nested. The following floats are defined:

3.2.1. ocr_float

Name
ocr_float
Categories
Float Elements
Properties
Required:
bbox

3.2.2. ocr_textfloat and ocr_textimage

Name
ocr_textfloat
Categories
Float Elements
Properties
Required:
bbox
Name
ocr_textimage
Categories
Float Elements
Properties
Required:
bbox

3.2.3. ocr_image, ocr_linedrawing and ocr_photo

Name
ocr_image
Categories
Float Elements
Properties
Required:
bbox
Name
ocr_linedrawing
Categories
Float Elements
Properties
Required:
bbox

Something that could be represented well and naturally in a vector graphics format like SVG (even if it is actually represented as PNG)

Name
ocr_photo
Categories
Float Elements
Properties
Required:
bbox

Something that requires JPEG or PNG to be represented well

3.2.4. ocr_header and ocr_footer

Name
ocr_header
Categories
Float Elements
Properties
Required:
bbox
Name
ocr_footer
Categories
Float Elements
Properties
Required:
bbox

3.2.5. ocr_pageno

Name
ocr_pageno
Categories
Float Elements
Properties
Required:
bbox

3.2.6. ocr_table

Name
ocr_table
Categories
Float Elements
Properties
Required:
bbox

3.3. Logical Elements

Logical Tags/classes

The classes defined in this section for logically structuring a hOCR document have their standard meaning as used in the publishing industry and tools like LaTeX, MS Word, and others.

Tags must be nested as indicated by the following list, but not all tags within the hierarchy need to be present.

For all of these elements except ocr_linear, there exists a natural linear ordering defined by reading order (ocr_linear indicates that the elements contained in it have a linear ordering). At the level of ocr_linear, there may not be a single distinguished order. A common example of ocr_linear is a newspaper, in which a single newspaper may contain many linear, but there is no unique reading order for the different linear. OCR evaluation tools should therefore be sensitive to the order of all elements other than ocr_linear.

Textual information like section numbers and bullets must be represented as text inside the containing element.

Documents whose logical structure does not map naturally onto these logical structuring elements must not use them for other purposes.

3.3.1. ocr_document

Name
ocr_document
Recommended HTML Tags
div
Categories
Logical Elements

3.3.2. ocr_title, ocr_author and ocr_abstract

Name
ocr_title
Recommended HTML Tags
h1
Categories
Logical Elements
Name
ocr_author
Categories
Logical Elements
Name
ocr_abstract
Categories
Logical Elements

3.3.3. ocr_part and ocr_chapter

Name
ocr_part
Recommended HTML Tags
h1
Categories
Logical Elements
Name
ocr_chapter
Recommended HTML Tags
h1
Categories
Logical Elements

3.3.4. ocr_section, ocr_subsection and ocr_subsubsection

Name
ocr_section
Recommended HTML Tags
h2
Categories
Logical Elements
Name
ocr_subsection
Recommended HTML Tags
h3
Categories
Logical Elements
Name
ocr_subsubsection
Recommended HTML Tags
h4
Categories
Logical Elements

3.3.5. ocr_display, ocr_blockquote and ocr_par

Name
ocr_display
Categories
Float Elements
Properties
Required:
bbox
Name
ocr_blockquote
Recommended HTML Tags
blockquote
Categories
Logical Elements
Name
ocr_par
Recommended HTML Tags
p
Categories
Logical Elements

3.3.6. ocr_linear

Name
ocr_linear
Categories
Typesetting Elements

3.3.7. ocr_caption

Name
ocr_caption
Categories
Logical Elements

Image captions may be indicated using the ocr_caption element; such an element refers to the image(s) contained within the same float, or the immediately adjacent image if both the image and the ocr_caption element are in running text.

3.4. Inline Elements

<https://github.com/kba/hocr-spec/issues/51>

There is some content that should behave and flow like text

3.4.1. Unrecognized characters and words: ocr_glyph and ocr_glyphs

Name
ocr_glyph
Categories
Inline Elements
Name
ocr_glyphs
Categories
Inline Elements

3.4.2. ocr_dropcap

Name
ocr_dropcap
Categories
Inline Elements

3.4.3. Mathematical and chemical formulas: ocr_math and ocr_chem

Name
ocr_math
Categories
Float Elements
Properties
Required:
bbox
Name
ocr_chem
Categories
Float Elements
Properties
Required:
bbox

Mathematical and chemical formulas that float must be put into an ocr_float section. Formulas that are “display” mode should be put into an ocr_display section. ocr_math and ocr_chem

ocr_math must either be or contain either a single img tag or [MathML] markup

ocr_chem must either be or contain either a single img tag or [CML] markup

3.4.4. Unspecified inline content: ocr_cinfo

Define ocrx_cinfo

3.5. OCR Engine-Specific elements

A few abstractions are used as intermediate abstractions in OCR engines, although they do not have a meaning that can be defined either in terms of typesetting or logical function. Representing them may be useful to represent existing OCR output, say for workflow abstractions.

Common suggested engine-specific markup are:

3.5.1. ocrx_block

Name
ocrx_block
Categories
Inline Elements , Engine-Specific Elements

ocr_carea vs ocrx_block

Generators should attempt to ensure the following properties:

3.5.2. ocrx_line

Name
ocrx_line
Categories
Inline Elements , Engine-Specific Elements

ocr_line vs ocrx_line

3.5.3. ocrx_word

Name
ocrx_word
Categories
Inline Elements , Engine-Specific Elements

4. The properties of hOCR

The properties in hOCR can be broadly categorized as follows:

General Properties

These properties can apply to most elements

Non-Recommended Properties

These properties can apply to most elements but should not be used unless there is no alternative:

Inline Properties

These properties apply to content on or below the level of ocr_line / ocrx_line

Layout Properties

These properties relate to placement of elements on the page

Font Properties

These properties convey font information

Character Properties

These properties convey character level information

Page Properties

These properties convey information on the whole page

Content Flow Properties

These properties are related to the reading order and flow of content on the page

Confidence Properties

These properties are related to the confidence of the hOCR producer that the text in the element has been correctly recognized

4.1. The baseline property

Name

baseline

Categories

Inline

Grammar
property-name = "baseline"
property-value = float int

Example
baseline 0.015 -18

This property applies primarily to textlines.

The baseline is described by a polynomial of order n with the coefficients pn ... p0 with n = 1 for a linear (i.e. straight) line.

The polynomial is in the coordinate system of the line, with the bottom left of the bounding box as the origin.

The hOCR output for the first line of eurotext.tif contains the following information:

<span class='ocr_line' id='line_1_1'
    title="bbox 105 66 823 113; baseline 0.015 -18">...</span>

bbox is the bounding box of the line in image coordinates (blue). The two numbers for the baseline are the slope (1st number) and constant term (2nd number) of a linear equation describing the baseline relative to the bottom left corner of the bounding box (red). The baseline crosses the y-axis at -18 and its slope angle is arctan(0.015) = 0.86°.

baseline explained

4.2. The bbox property

Name

bbox

Categories

General, Layout

Grammar
property-name = "bbox"
property-value = uint uint uint uint

Example
bbox 0 0 100 200

The bbox - short for "bounding box" - of an element is a rectangular box around this element, which is defined by the upper-left corner (x0, y0) and the lower-right corner (x1, y1).

<span class='ocr_line' id='line_1'
    title="bbox 10 20 160 30">...</span>

The bounding box bbox of this line is shown in blue and it is span by the upper-left corner (10, 20) and the lower-right corner (160, 30). All coordinates are measured with reference to the top-left corner of the document image which border is drawn in black.

bbox explained

4.3. The cflow property

Name

cflow

Categories

Content Flow

Grammar
property-name = "cflow"
property-value = delimited-string

Example
cflow "article1"

This property relates the flow between multiple ocr_carea elements, and between ocr_carea and ocr_linear elements.

The content flow on the page that this element is a part of

4.4. The cuts property

Name

cuts

Categories

Layout, Character

Related

nlp, x_bboxes

Implied

bbox

Grammar
property-name = "cuts"
property-value = +(uint *1(comma uint *1(comma nint)))

Example
cuts 9 11 7,8,-2 15 3

For left-to-write writing directions, cuts are sequences of deltas in the x and y direction; the first delta in each path is an offset in the x direction relative to the last x position of the previous path. The subsequent deltas alternate between up and right moves.

Assume a bounding box of (0,0,300,100); then

cuts("10 11 7 19") =
    [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ]
cuts("10,50,3 11,30,-3") =
    [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ]
<span class="ocr_cinfo" title="bbox 0 0 300 100; nlp 1.7 2.3 3.9 2.7; cuts 9 11 7,8,-2 15 3">hello</span>

Cuts are between all codepoints contained within the element, including any whitespace and control characters. Simply use a delta of 0 (zero) for invisible codepoints.

Writing directions other than left-to-right specify cuts as if the bounding box for the element had been rotated by a multiple of 90 degrees such that the writing direction is left to right, then rotated back.

It is undefined what happens when cut paths intersect, with the exception that a delta of 0 always corresponds to an invisible codepoint.

4.5. The hardbreak property

Name

hardbreak

Categories

Inline

Grammar
property-name = "hardbreak"
property-value = "0" / "1"

Default Value

hardbreak 0

Any special characters representing the desired end-of-line processing must be present inside the ocr_line element. Examples of such special characters are a soft hyphen ("­", U+00AD), a hard line break (<br>), or whitespace () for soft line breaks.

4.6. The image property

Name

image

Categories

Page

Related

imagemd5, x_source

Grammar
property-name = "image"
property-value = delimited-string

Example
image "/foo/bar.png"

4.7. The imagemd5 property

Name

imagemd5

Categories

Page

Implied

image

Grammar
property-name = "imagemd5"
property-value = doublequote 32(%x41-46 / digit) doublequote

4.8. The lpageno property

Name

lpageno

Categories

Page

Related

ppageno

Grammar
property-name = "lpageno"
property-value = delimited-string / uint

Example
lpageno "IV."

4.9. The ppageno property

Name

ppageno

Categories

Page

Related

lpageno

Grammar
property-name = "ppageno"
property-value = uint

Example
lpageno 7

4.10. The nlp property

Name

nlp

Categories

Confidence, Character

Related

cuts, x_confs

Implied

cuts

Grammar
property-name = "nlp"
property-value = +float

4.11. The order property

Name

order

Categories

Content Flow

Grammar
property-name = "order"
property-value = +uint

Example
order 8

The reading order of the element (an integer)

4.12. The poly property

Name

poly

Categories

Layout, Non-recommended

Grammar
property-name = "poly"
property-value = 2uint 2int *(2int)

Example
poly 0 0 0 10 10 10 10 20 0 20

A closed polygon for elements with non-rectangular bounds

4.13. The scan_res property

Name

scan_res

Categories

Page

Related

x_scanner

Grammar
property-name = "scan_res"
property-value = 2(uint)

Example
scan_res 300 300

The scanning resolution in DPI

4.14. The textangle property

Name

textangle

Categories

Layout

Grammar
property-name = "textangle"
property-value = float

Example
textangle 7.32

The angle in degrees by which textual content has been rotate relative to the rest of the page (if not present, the angle is assumed to be zero); rotations are counter-clockwise, so an angle of 90 degrees is vertical text running from bottom to top in Latin script; note that this is different from reading order, which should be indicated using standard HTML properties

4.15. The x_bboxes property

Name

x_bboxes

Categories

Inline, Character

Related

cuts

Grammar
property-name = "x_bboxes"
property-value = 1*(4uint)

Example
x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...
x_bboxes 0 0 10 10 0 10 20 20

4.16. The x_font property

Name

x_font

Categories

Font

Related

x_fsize

Grammar
property-name = "x_font"
property-value = delimited-string

Example
x_font "Comic Sans MS"

x_font is an OCR-engine specific font name (a string).

4.17. The x_fsize property

Name

x_fsize

Categories

Font

Related

x_font

Grammar
property-name = "x_fsize"
property-value = uint

Example
x_fsize 12

x_fsize is the OCR-engine specific font size (an unsigned integer).

4.18. The x_confs property

Name

x_confs

Categories

Confidence, Character

Grammar
property-name = "x_confs"
property-value = +float

Example
x_confs 37.3 51.23 1 100

4.19. The x_scanner property

Name

x_scanner

Categories

Page

Related

scan_res

Grammar
property-name = "x_scanner"
property-value = delimited-string

Example
scanner "Canon Lide 220"

A representation of the scanner

4.20. The x_source property

Name

x_source

Categories

Page

Related

image

Grammar
property-name = "x_source"
property-value = 1*delimited-string

Example
x_source "/gfs/cc/clean/012345678911" "17"
x_source "http://pageserver/012345678911&page=17"

4.21. The x_wconf property

Name

x_wconf

Categories

Confidence, Inline

Grammar
property-name = "x_wconf"
property-value = float

Example
x_wconf 97.23

5. Encoding Guidelines

5.1. Recommendations for Mappings

When possible, any mapping of logical structure onto HTML should try to follow the following rules:

Specifically

If necessary, the markup may use the following non-standard tags:

5.2. Styling hOCR with CSS

OCR information and presentation information can be separated by putting the CSS info related to the CSS in an outer element with an ocr_ or ocrx_ class, and then overriding it for the presentation by nesting another span with the actual presentation information inside that:

<span class="ocr_cinfo" style="ocr style"><span style="presentation style"> ... </span></span>

5.3. Language, Writing Direction

OCR-generated font and text color information is encoded using standard HTML and CSS attributes on elements with a class of ocr_... or ocrx_....

Language and writing direction should be indicated using the HTML standard attributes lang and dir.

Furigana and similar constructs must be represented using their correct Unicode encoding.

The HTML &lrm; and &rlm; entities (indicating writing direction) must not be used; all writing direction changes must be indicated with new tags with an appropriate dir attribute.

The CSS3 text layout attributes can be used when necessary. For example, CSS supports writing-mode, direction, glyph-orientation [ISO15924]-based script (list of codes), text-indent, etc.

5.4. Superscript and Subscript

Superscripts and subscripts, when not in ocr_math or ocr_chem formulas, must be represented using the HTML sup and sub tags, even if special Unicode characters are available.

5.5. Whitespace

Non-breaking spaces must be represented using the HTML &nbsp; entity.

Different space widths should be indicated using HTML and &ensp;, &emsp;, &thinsp;, &zwnj;, &zwj;.

5.6. Hyphenation

How to handle hyphens? <https://github.com/kba/hocr-spec/issues/7>

Non Linear Hyphens <https://github.com/altoxml/schema/issues/41>

Soft hyphens must be represented using the HTML &shy; entity.

5.7. Alternative Segmentations / Readings

Delete x_cost

Alternative segmentations and readings are indicated by a span with class="alternatives". It must contains ins and del elements. The first contained element should be ins and represent the most probable interpretation, the subsequent ones del. Each ins and del element should have class="alt" and a property of either nlp or x_cost. These span, ins, and del tags can nest arbitrarily.

<span class="alternatives">
<ins class="alt" title="nlp 0.3">hello</ins>
<del class="alt" title="nlp 1.1">hallo</del>
</span>

Whitespace within the span but outside the contained ins/del elements is ignored and should be inserted to improve readability of the HTML when viewed in a browser.

5.8. Grouped Elements and Multiple Hierarchies

The different levels of layout information (logical, physical, engine-specific) each form hierarchies, but those hierarchies may not be mutually compatible; for example, a single ocr_page may contain information from multiple sections or chapters. To represent both hierarchies within a single document, elements may be grouped together. That is, two elements with the same class may be treated as one element by adding a "groupid identifier" property to them and using the same identifier.

Grouped elements should be logically consistent with the markup they represent; for example, it is probably not sensible to use grouped elements to interleave parts of two different chapters. Therefore, grouped elements should usually be adjacent in the markup.

Applications using hOCR may choose to manipulate grouped elements directly, but the simplest way of dealing with them is to transform a document with grouped elements into one without grouped elements prior to further processing by first removing tags that are not of interest for the subsequent processing step, and then collapsing grouped elements into single elements. For example, output that contains both logical and physical layout information, where the logical layout information uses grouped elements, can be transformed by removing all the physical layout information, and then collapsing all split ocr_chapter elements into single ocr_chapter elements based on the groupid. The result is a simple DOM tree. This transformation can be provided generically as a pre-processor or Javascript.

The presence of grouped elements does not need to be indicated in the header; when it affects their operations, hOCR processors should check for the presence of grouped elements in the output and fail with an error message if they cannot correctly process the hOCR information.

6. Metadata

The creator of the hOCR document can indicate the following information information using meta tags in the head section.

ocr-system

Indicates software and version that generated the hOCR document

Every hOCR document must have exactly one ocr-system metadata field

ocr-capabilities

Features consumers of the hOCR document can expect

See § 6.2 Capabilities for possible values

Every hOCR document must have exactly one ocr-capabilities metadata field

ocr-number-of-pages

The number of ocr_page in the document

ocr-langs

Use ISO 639-1 codes

Value may be unknown

ocr-scripts

Use ISO 15924 letter codes

Value may be unknown

6.1. Document metadata

For document meta information, use the Dublin Core Embedding into HTML. See also Citation Guidelines for Dublin Core.

6.2. Capabilities

Any program generating files in this output format must indicate in the document metadata what kind of markup it is capable of generating. This includes listing the exact set of markup sections that the system could have generated, even if it did not actually generate them for the particular document.

If a document lists a certain capabilities but no element or attribute is found that corresponds to that capability, users of the document may infer that the content is absent in the source document. If a capability is not listed, the corresponding element or attribute must not be present in the document.

The capability to generate specific properties is given by the prefix ocrp_...; the important properties are:

ocrp_lang

Capable of generating lang attributes

ocrp_dir

Capable of generating dir attributes

ocrp_poly

Capable of generating polygonal bounds

ocrp_font

Capable of generating font information (standard font information)

ocrp_nlp

Capable of generating nlp confidences

ocr_embeddedformat_<formatname>

The capability to generate other specific embedded formats is given by the prefix ocr_embeddedformat_<formatname>.

ocr_<tag>_unordered

If an OCR engine represents a particular tag but cannot determine reading order for that tag, it must must specify a capability of ocr_<tag>_unordered.

6.3. Profiles - Restricting hOCR markup

hOCR provides standard means of marking up information, but it does not mandate the presence or absence of particular kinds of information. For example, an hOCR file may contain only logical markup, only physical markup, or only engine-specific markup. As a result, merely knowing that OCR output is hOCR compliant doesn’t tell us whether that file is actually useful for subsequent processing.

OCR systems can use hOCR in various different ways internally, but we will eventually define some common profiles that mandate what kinds of information needs to be present in particular kinds of output.

Of particular importance are:

Other possible profiles might be defined for specific engines or specific document classes:

6.4. Formats: Restricting HTML Markup

The HTML-based markup is orthogonal to the hOCR-based markup; that is, both can be chosen independent of one another. The only thing that needs to be consistent between the two markups is the text contained within the tags. hOCR and other embedded format tags can be put on HTML tags, or they can be put on their own div/span tags.

There are many different choices possible and reasonable for the HTML markup, depending on the use and further processing of the document. Each such choice must be indicated in the meta data for the document.

Many mappings derived from existing tools are quite similar, and most follow the restrictions and recommendations below already without further modifications.

Depending on the particular HTML markup used in the document, the document is suitable for different kinds of processing and use. The formats have the following intents:

html_none (see § 6.4.1 HTML without logical markup)

Straightforward equivalent of Goodoc or [XDOC]

html_simple

Target format for convenient on-line viewing and intermediate format for indexing

html_xytable_absolute, html_xytable_relative

Target format for layout-preserving on-screen document viewing

Formats defined in § 6.4.3 HTML produced by OCR engines

Straightforward recording of commercial OCR system output

Formats defined in § 6.4.4 HTML with absolute positioning

Target format for services like Google’s View as HTML

As long as a format contains the hOCR information, it can be reprocessed by layout analysis software and converted into one of the other formats. In particular, we envision layout analysis tools for converting any hOCR document into html_absolute, html_xytable_absolute, and html_simple. Furthermore, internally, a layout analysis system might use html_xytable_absolute as an intermediate format for converting hOCR into html_simple.

6.4.1. HTML without logical markup

The html_none format contains no logical markup at all; it is simply a collection of div and span elements with associated hOCR information. Note that such documents can still be rendered visually through the use of CSS.

6.4.2. HTML with limited logical elements

The html_simple format follows the restrictions and recommendations above, and only uses the following tags:

6.4.3. HTML produced by OCR engines

HTML markup produced by default by the OCR engine for the given document must follow the template html_ocr_<engine>.

Examples of possible values are:

html_ocr_unknown

The HTML was generated by some OCR engine, but it’s unknown which one

html_ocr_finereader_8
html_ocr_textbridge_11

6.4.4. HTML with absolute positioning

html_absolute

The HTML represents absolute positioning of elements on each page.

Possible subformats are:

html_absolute_cols

absolute positioning of cols

html_absolute_pars

absolute positioning of paragraphs

html_absolute_lines

absolute positioning of lines

html_absolute_words

absolute positioning of words

html_absolute_chars

absolute positioning of characters

The "View as HTML" for PDF files feature of Google Search uses html_absolute_lines; this is probably the most reasonable choice for approximating the appearance of the original document.

6.4.5. HTML as table

html_xytable

The HTML is a table that gives the XY-cut layout segmentation structure of the page in tabular form.

Note that in this format, text order does not necessarily correspond to reading order.

The format must contain one table of class ocr_xycut representing each page. The markup of the content of the table itself is as in html_simple.

Possible subformats are:

html_xytable_absolute

The table structure must represent the absolute size of the original page element.

html_xytable_relative

Table element sizes are expressed relative (percentages).

6.4.6. HTML from word processors

The HTML represents markup that follows the mappings of the given document processor to HTML.

Note that the document doesn’t actually need to have been constructed in the processor and that the processor doesn’t need to have been used to generate the HTML. For example, the html_latex2html tag merely indicates that, say, a scanned and ocr’ed article uses the same conventions for logical markup tags that an equivalent article actually written in LaTeX and actually converted to HTML would have used.

html_latex2html
html_msword

HTML mapping generated by “Save As HTML”

html_ooffice

HTML mapping generated by “Save As HTML”

html_docbook_xsl

HTML mapping generated by official XSL style sheets

6.5. Example

<html>
  <head>
    <meta name="ocr-system" content="tesseract v3.03"/>
    <meta name="ocr-capabilities" content="ocr_page ocr_line ocrp_lang"/>
    <meta name="ocr-langs" content="aa la zu"/>
    <meta name="ocr-scripts" content="Arab Khmr"/>
    <meta name="ocr-number-of-pages" content="112"/>
    ...
  </head>
  ...
</html>

Indicate that the work this hOCR file represents:


Appendix A: Revision History

hOCR has been originally developed by Thomas Breuel.

See the releases and full commit history for a revision history.

Appendix B: Sample Usage

See also the hocr-tools for more samples.

The HTML format described here may seem fairly complicated and difficult to parse, but because there are lots of tools for manipulating HTML documents, they’re actually pretty easy to manipulate. Here are some examples:

import libxml2,re,os,string

# convert the HTML to XHTML (if necessary)
os.system("tidy -q -asxhtml < page.html > page.xhtml 2> /dev/null")

# parse the XML
doc = libxml2.parseFile('page.xhtml')

# search all nodes having a class of ocr_line
lines = doc.xpathEval("//*[@class='ocr_line']")

# a function for extracting the text from a node
def get_text(node):
    textnodes = node.xpathEval(".//text()")
    s = string.join([node.getContent() for node in textnodes])
    return re.sub(r'\s+',' ',s)

# a function for extracting the bbox property from a node
# note that the title= attribute on a node with an ocr_ class must
# conform with the OCR spec

def get_bbox(node):
    data = node.prop('title')
    bboxre = re.compile(r'\bbbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)')
    return [int(x) for x in bboxre.search(data).groups()]

# this extracts all the bounding boxes and the text they contain
# it doesn’t matter what other markup the line node may contain
for line in lines:
    print get_bbox(line),get_text(line)

Note that the OCR markup, basic HTML markup, and semantic markup can co-exist within the same HTML file without interfering with one another.

Appendix C: IANA Considerations

XML namespace for hOCR HTML?

What DOCTYPE for hOCR HTML?

Media Type

In accordance to [RFC4289]

correct MIME type for hOCR?

MIME media type name

text

MIME subtype name:

vnd.hocr+html

Required parameters:
Optional parameters:
Encoding considerations:

hOCR documents should be encoded as UTF-8

Security considerations:
Interoperability considerations:
Applications which use this media type:
File extension(s):

*.html, *.hocr

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[HTML401]
Dave Raggett; Arnaud Le Hors; Ian Jacobs. HTML 4.01 Specification. 27 March 2018. REC. URL: https://www.w3.org/TR/html401/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119

Informative References

[CML]
Peter Murray-Rust; Henry Rzepa. Chemical Markup Language - CML. URL: http://www.xml-cml.org/
[ISO15924]
Code for the representation of names of scripts. International Organization for Standardization. 1998. ISO 15924:1998. Draft International Standard
[MathML]
Patrick D F Ion; Robert R Miner. Mathematical Markup Language (MathML) 1.01 Specification. 7 July 1999. REC. URL: https://www.w3.org/TR/REC-MathML/
[RFC4289]
N. Freed; J. Klensin. Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures. December 2005. Best Current Practice. URL: https://tools.ietf.org/html/rfc4289
[RFC5234]
D. Crocker, Ed.; P. Overell. Augmented BNF for Syntax Specifications: ABNF. January 2008. Internet Standard. URL: https://tools.ietf.org/html/rfc5234
[XDOC]
Daniel S. Connelly; Beth Paddock; Rebecca Harvey. XDOC DATA FORMAT. Technical Specification. May 1999. URL: https://web.archive.org/web/20160731161638/http://vividata.com/manuals/core12xdc.pdf
[XHTML1]
Steven Pemberton. XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition). 27 March 2018. REC. URL: https://www.w3.org/TR/xhtml1/

Issues Index

Reusing HTML: Some text is missing in the first paragraph <https://github.com/kba/hocr-spec/issues/96>
There is currently no way of indicating anchoring or flow-around properties for floating elements; properties need to be defined for this.
Logical Tags/classes
<https://github.com/kba/hocr-spec/issues/51>
Define ocrx_cinfo
ocr_carea vs ocrx_block
ocr_line vs ocrx_line
How to handle hyphens? <https://github.com/kba/hocr-spec/issues/7>
Non Linear Hyphens <https://github.com/altoxml/schema/issues/41>
Delete x_cost
XML namespace for hOCR HTML?
What DOCTYPE for hOCR HTML?
correct MIME type for hOCR?