Tesseract源码分析(二)——识别与纠错
tess4.0中主要的数据结构
- Page analysis result:
PAGE_RES
(ccstruct/pageres.h). - Page analysis result contains a list of block analysis result field:
BLOCK_RES_LIST
. - Block analysis result:
BLOCK_RES
(ccstruct/pageres.h). - Block analysis result contains a list of row analysis result field:
ROW_RES_LIST
. - Row analysis result:
ROW_RES
(ccstruct/pageres.h). - Row analysis result contains a list of word analysis result field:
WERD_RES_LIST
. -
WERD_RES
(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.
源码分析
Tesseract主要文字识别主要流程:二值化,切分处理,识别,纠错等步骤。上篇文章总结了二值化与切分处理的过程,本文主要总结识别和纠错两部分步骤的处理过程。
字符识别
pass 1 recongnize
Classify the blobs in the word and permute the results. Find the worst blob in the word and chop it up. Continue this process until a good answer has been found or all the blobs have been chopped up enough. The results are returned in the WERD_RES.
- 调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
- Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
- Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
- Tesseract::match_word_pass_n [ccmain/ control.cpp] ->
- Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
- ** Wordrec::set_pass1() [wordrec/ tface.cpp] -> **
- Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
- Tesseract::recog_word_recursive [ccmain/ tfacepp.cpp] ->
- Wordrec::cc_recog [wordrec/ tface.cpp] ->
- Wordrec::chop_word_main [wordrec/ chopper.cpp]
pass 2 recongnize
The processing difference of pass 1 and pass 2 is at the word set style which is in font-weight.
- 调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
- Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
- Tesseract::classify_word_pass2 [ccmain/ control.cpp] ->
- Tesseract::match_word_pass_n [ccmain/ control.cpp] ->
- Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
- ** Wordrec::set_pass2() [wordrec/ tface.cpp] -> **
- Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
- Tesseract::recog_word_recursive [ccmain/ tfacepp.cpp] ->
- Wordrec::cc_recog [wordrec/ tface.cpp] ->
- Wordrec::chop_word_main [wordrec/ chopper.cpp]
LSTM recongnize contained in pass 1 recongnize
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
- Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
- Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
- Tesseract::LSTMRecognizeWord [ccmain/linerec.cpp] ->
- LSTMRecognizer::RecognizeLine [lstm/lstmrecognizer.cpp] ->
- LSTMRecognizer::RecognizeLine [lstm/lstmrecognizer.cpp] ->
- Tesseract::SearchWords [ccmain/linerec.cpp]
The next passes are only required for Tess-only
pass 3 recongnize
Walk over the page finding sequences of words joined by fuzzy spaces. Extract them as a sublist, process the sublist to find the optimal arrangement of spaces then replace the sublist in the ROW_RES.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::fix_fuzzy_spaces [ccmain/fixspace.cpp] ->
- Tesseract::fix_sp_fp_word [ccmain/fixspace.cpp] ->
- Tesseract::fix_fuzzy_space_list [ccmain/fixspace.cpp]
pass 4 recongnize
dictionary_correction_pass
If a word has multiple alternates check if the best choice is in the dictionary. If not, replace it with an alternate that exists in the dictionary.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::dictionary_correction_pass [ccmain/control.cpp]
bigram_correction_pass
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::bigram_correction_pass [ccmain/control.cpp]
pass 5 recongnize
Gather statistics on rejects.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::rejection_passes [ccmain/control.cpp] ->
- REJMAP::rej_word_bad_quality [ccstruct/rejctmap.cpp]
pass 6 recongnize
Do whole document or whole block rejection pass
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::rejection_passes [ccmain/control.cpp] ->
- Tesseract::quality_based_rejection [ccmain/docqual.cpp] ->
- Tesseract::doc_and_block_rejection [ccmain/docqual.cpp] ->
- reject_whole_page [ccmain/docqual.cpp] ->
- REJMAP::rej_word_block_rej [ccstruct/rejctmap.cpp]
It seems to lack the pass 7 recongnize in the source code.
pass 8 recongnize
Smooth the fonts for the document.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::font_recognition_pass [ccmain/control.cpp]
pass 9 recongnize
Check the correctness of the final results.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::blamer_pass [ccmain/control.cpp] ->
- Tesseract::script_pos_pass [ccmain/control.cpp]
After all the recongnization, Tess removes empty words, as these mess up the result iterators.
段落检测
This is called after rows have been identified and words are recognized. Much of this could be implemented before word recognition, but text helps to identify bulleted lists and gives good signals for sentence boundaries.
pass 1 detection
Detect sequences of lines that all contain leader dots (.....) These are likely Tables of Contents. If there are three text lines in a row with leader dots, it's pretty safe to say the middle one should be a paragraph of its own.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- SeparateSimpleLeaderLines [ccmain/paragraphs.cpp] ->
- LeftoverSegments [ccmain/paragraphs.cpp]
pass 2a detection
Find any strongly evidenced start-of-paragraph lines. If they're followed by two lines that look like body lines, make a paragraph model for that and see if that model applies throughout the text (that is, "smear" it).
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- StrongEvidenceClassify [ccmain/paragraphs.cpp]
pass 2b detection
If we had any luck in pass 2a, we got part of the page and didn't know how to classify a few runs of rows. Take the segments that didn't find a model and reprocess them individually.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- LeftoverSegments [ccmain/paragraphs.cpp] ->
- StrongEvidenceClassify [ccmain/paragraphs.cpp]
pass 3 detection
These are the dregs for which we didn't have enough strong textual and geometric clues to form matching models for. Let's see if the geometric clues are simple enough that we could just use those.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- LeftoverSegments [ccmain/paragraphs.cpp] ->
- GeometricClassify [ccmain/paragraphs.cpp] ->
- DowngradeWeakestToCrowns [ccmain/paragraphs.cpp]
pass 4 detection
Take everything that's still not marked up well and clear all markings.
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- TessBaseAPI::DetectParagraphs [api/baseapi.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- DetectParagraphs [ccmain/paragraphs.cpp] ->
- LeftoverSegments [ccmain/paragraphs.cpp] ->
- SetUnknown [ccmain/paragraphs_internal.h]
Convert all of the unique hypothesis runs to PARAs.
ConvertHypothesizedModelRunsToParagraphs [ccmain/paragraphs.cpp]
Finally, clean up any dangling NULL row paragraph parents.
CanonicalizeDetectionResults [ccmain/paragraphs.cpp]
纠错
dictionary error correction
Verify whether the recongnized word is in the word_dic (unicharset)
- 调用栈
- main [api/tesseractmain.cpp] ->
- TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
- TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
- TessBaseAPI::Recognize [api/baseapi.cpp] ->
- Tesseract::recog_all_words [ccmain/control.cpp] ->
- Tesseract::RecogAllWordsPassN [ccmain/control.cpp] ->
- Tesseract::classify_word_and_language [ccmain/ control.cpp] ->
- Tesseract::classify_word_pass1 [ccmain/ control.cpp] ->
- Tesseract::tess_segment_pass_n [ccmain/ tessbox.cpp] ->
- Tesseract::recog_word [ccmain/ tfacepp.cpp] ->
- Wordrec::dict_word [wordrec/ tface.cpp] ->
- Dict::valid_word [dict/ dict.cpp]