博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Improving the quality of the output
阅读量:6475 次
发布时间:2019-06-23

本文共 4191 字,大约阅读时间需要 13 分钟。

There are a variety of reasons you might not get good quality output from Tesseract. It's important to note that unless you're using a very unusual font or a new language retraining Tesseract is unlikely to help.

Image processing

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the tessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.

Rescaling

Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see .

Binarisation

binarisation.png

This is converting an image to black and white. Tesseract does this internally, but the result can be suboptimal, particularly if the page background is of uneven darkness.

Noise Removal

noise.png

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.

Rotation / Deskewing

skew-linedetection.png

A skewed image is when an page has been scanned when not straight. The quality of Tesseract's line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.

Border Removal

borders.png

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.

Tools / Libraries

Examples

If you need an example how to improve image quality programmatically, have a look at this examples:

  •  - c++ example
  •  - bash script for processing a scanned document of text to clean the text background.
  •  - python script for automatic detection of rotation and line spacing of an image of text
  •  - 

Page segmentation method

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see .

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the list as of 3.04:

0   Orientation and script detection (OSD) only. 1   Automatic page segmentation with OSD. 2   Automatic page segmentation, but no OSD, or OCR. 3   Fully automatic page segmentation, but no OSD. (Default) 4   Assume a single column of text of variable sizes. 5   Assume a single uniform block of vertically aligned text. 6   Assume a single uniform block of text. 7   Treat the image as a single text line. 8   Treat the image as a single word. 9   Treat the image as a single word in a circle.10   Treat the image as a single character.

Dictionaries, word lists, and patterns

By default Tesseract is optimized to recognize sentences of words. If you're trying to recognize something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate  is selected.

Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn't dictionary words. They can be disabled by setting the both of the load_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the .

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use the tessedit_char_whitelist . See the .

Still having problems?

If you've tried the above and are still getting low accuracy results,  for help, ideally posting an example image.

转载地址:http://bcmko.baihongyu.com/

你可能感兴趣的文章
OCA读书笔记(3) - 使用DBCA创建Oracle数据库
查看>>
Python基础进阶之路(一)之运算符和输入输出
查看>>
阻塞非阻塞异步同步 io的关系
查看>>
ClickStat业务
查看>>
DMA32映射问题
查看>>
POJ 1269 Intersecting Lines(判断两直线位置关系)
查看>>
spring3.0.7中各个jar包的作用总结
查看>>
Windows 10 /win10 上使用GIT慢的问题,或者命令行反应慢的问题
查看>>
梯度下降(Gradient descent)
查看>>
Windows平台分布式架构实践 - 负载均衡
查看>>
iOS自定制tabbar与系统的tabbar冲突,造成第一次点击各个item图片更换选中,第二次选中部分item图片不改变...
查看>>
SVN服务器使用(二)
查看>>
反射获取内部类以及调用内部类方法
查看>>
App里面如何正确显示用户头像
查看>>
U-BOOT之一:BootLoader 的概念与功能
查看>>
我的路上
查看>>
Velocity处理多余空白和多余空白行问题
查看>>
DB2与oracle有什么区别
查看>>
创建一个多级文件目录
查看>>
Picasa生成图片幻灯片页面图文教程
查看>>