googleTesseract

Tesseract源码分析(一)——二值化与版面分析

2017-07-27  本文已影响1022人  RobertY

tess4.0中主要的数据结构

  1. Page analysis result: PAGE_RES (ccstruct/pageres.h).
  2. Page analysis result contains a list of block analysis result field: BLOCK_RES_LIST.
  3. Block analysis result: BLOCK_RES (ccstruct/pageres.h).
  4. Block analysis result contains a list of row analysis result field: ROW_RES_LIST.
  5. Row analysis result: ROW_RES (ccstruct/pageres.h).
  6. Row analysis result contains a list of word analysis result field: WERD_RES_LIST.
  7. WERD_RES(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.

源码分析

Tesseract主要文字识别主要流程:二值化,切分处理,识别,纠错等步骤。本文主要总结二值化和预处理两部分步骤的处理过程。

Page Layout 分析步骤

二值化

OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,olf Jolion, 对包含有阴影的图片也有比较好的二值化结果。

切分处理

Remove vertical lines

This step removes vertical and horizontal lines in the image.

Remove images

This step remove images from the picture.

Filter connected component

This step generate all the connected components and filter the noise blobs.

Finding candidate tab-stop components

Finding the column layout

Finding the regions

上一篇 下一篇

猜你喜欢

热点阅读