软件:
jTessBoxEditor Version 0.9 (30 April 2013)
Tesseract-OCR win32 v3.02 with Leptonica
训练步骤:
1.使用jTessBoxEditor,tools->merge_tif,产生tif文件
2.产生box文件
tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop
makebox
3.使用jTessBoxEditor打开,Insert或Delete,添加删除字符,并通过xywh调整对应的坐票
4.训练(如果遇到不可识别的字符,couldn t find a matching
blob,尝试换位置或调坐标)
tesseract.exe eng.arial.01.tif eng.arial.01 nobatch
box.train
5.字体预处理
unicharset_extractor.exe eng.arial.01.box
6.创建font_properties.txt,内容为:arial 0 0 0 0 0
7.字体处理
mftraining.exe -F font_properties.txt -U unicharset
eng.arial.01.tr
8.cntraining.exe eng.arial.01.tr
9.把unicharset, inttemp, normproto,
pffmtable这四个文件加上前缀“eng.arial.01.”
10.combine_tessdata.exe eng.arial.01.
显示:
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 1660
Offset for type 4 is 327545
Offset for type 5 is 327781
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is –1
必须确定的是第2、4、5、6行的数据不是-1,那么一个新的字典就算生成了。
11.此时目录下“eng.arial.01.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录
12.
#tesseract.exe test.jpg result -l eng.arial.01
#tesseract.exe a.bmp result2 -l eng.arial.01
指定布局识别方式
tesseract.exe 42.png result2 -l eng.arial.01 -psm 7
布局参数描述:
-psm N
Set Tesseract to only
run a subset of layout analysis and assume a certain form of image.
The options for N are:
0 = Orientation and
script detection (OSD) only.
1 = Automatic page
segmentation with OSD.
2 = Automatic page
segmentation, but no OSD, or OCR.
3 = Fully automatic page
segmentation, but no OSD. (Default)
4 = Assume a single
column of text of variable sizes.
5 = Assume a single
uniform block of vertically aligned text.
6 = Assume a single
uniform block of text.
7 = Treat the image as a
single text line.
8 = Treat the image as a
single word.
9 = Treat the image as a
single word in a circle.
10 = Treat the image as
a single character.