Version: 10.0

Optical character recognition (OCR)

webPDF uses the integrated toolbox Tesseract OCR, version 5.x, for optical character recognition (OCR), i.e. for converting graphic formats into PDF documents with text content. This toolbox is used by the OCR web service:

HTTP REST-API	OpenAPI
`POST /ocr/{documentId}`	🔗

The external toolbox is located in the tesseract/ subdirectory, found in webPDF's installation directory.

The OCR web service is used to convert graphic documents in TIFF, JPEG or PNG format into PDF documents or to export the recognized text as plain text or XML files.

Output formats

If the source file is an image format and pdf is selected (see parameter outputFormat below) as the output format for OCR, a PDF document is created that visibly contains the graphical element and behind it (in a PDF layer) the text extracted via OCR. This makes the PDF document searchable again and it can be indexed, for example.

If xml is selected as the output format, a file is created that contains not only the recognized textual content but also the position of the found texts. The XML document is created in HOCR format.

In contrast, the text output format only creates a simple, unformatted text file with the found content.

{
  "ocr": {
    "checkResolution": false,
    "failOnWarning": false,
    "forceEachPage": false,
    "imageDpi": 200,
    "jpegQuality": 75,
    "language": "eng",
    "normalizePageRotation": false,
    "ocrMode": "pageSegments",
    "outputFormat": "pdf"
  }
}

Existing PDF documents

The OCR web service can also work with existing PDF documents. In this case, all pages with no textual content will be OCRed. These pages must consist of a single graphic. This is often the case, for example, when documents are created by scanners and these documents are not saved as a graphic format but as a PDF document.

As soon as text content is present on a page, that page is not processed using OCR. However, you can use the parameter forceEachPage to force this to happen anyway.

note

If the source file is a PDF document, the output format will always be a PDF document. The outputFormat parameter will be ignored.

Output quality

The "Tesseract OCR" toolbox is a freely available OCR engine. It delivers a good level of recognition performance, if the source graphics have at least 200 DPI. Nevertheless, bear in mind that this recognition is not error-free. Graphics which have a resolution of less than 200 DPI often lead to poor results (use checkResolution and dpi to enforce a minimum).

note

Please note that the OCR engine does not support handwriting (or similar fonts).

OCR requires a certain amount of data to be able to guarantee successful character recognition. It is possible that recognition will not be possible for individual pages with few text lines. You can enable error codes for this purpose (see failOnWarning parameter).

Use normalizePageRotation to work on rotated pages and to “normalize” the page, i.e. to rotate the page so that the text does not appear rotated in the target document.

It is also important to specify the language of the source document when using the web service so that “special characters” (such as öäü in German) of the respective language are recognized. At present, the following languages are supported (see language parameters):

English
French
Spanish
German
Italian

Further languages can be set up if they are stored in the tesseract/tessdata folder and a corresponding entry is added in the tesseract/languages.xml file.

caution

At present, no other languages are supported which use a "Multibyte Character Set" (MBCS). These are, for example, Asian or Arabic languages.

Image optimization

The quality of the OCR depends very much on the quality of the original graphics. The optimization element allows you to correct the images to optimize them for OCR.

caution

The optimization requires additional computing power, which can significantly slow down the entire OCR process.

Except for the deskew parameter, none of the optimizations are visible in the generated PDF documents. The other optimizations only help to improve text recognition and have no visual impact on the PDF document.

You can find a description of the individual optimization parameters in the API description.

{
  "ocr": {
    "checkResolution": false,
    "failOnWarning": false,
    "forceEachPage": false,
    "imageDpi": 200,
    "jpegQuality": 75,
    "language": "eng",
    "normalizePageRotation": false,
    "ocrMode": "pageSegments",
    "optimization": {
      "deskew": true,
      "despeckle": true,
      "edgeAccentuation": "low",
      "edgeAccentuationValue": 100,
      "gammaCorrection": "off",
      "gammaCorrectionValue": 0,
      "increaseContrast": "off",
      "increaseContrastValue": 0,
      "medianFilter": "medium",
      "medianFilterValue": 1,
      "noiseReduction": "low",
      "noiseReductionValue": 1,
      "reduceDithering": false,
      "sharpen": "low",
      "sharpenValue": 1
    },
    "outputFormat": "pdf"
  }
}

Output formats​

Existing PDF documents​

Output quality​

Image optimization​

Output formats

Existing PDF documents

Output quality

Image optimization