12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667 |
- '''
- 使用Tesseract 对试卷做OCR
- Tesseract Usage:
- tesseract --help | --help-extra | --help-psm | --help-oem | --version
- tesseract --list-langs [--tessdata-dir PATH]
- tesseract --print-parameters [options...] [configfile...]
- tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]
- OCR options:
- --tessdata-dir PATH Specify the location of tessdata path.
- --user-words PATH Specify the location of user words file.
- --user-patterns PATH Specify the location of user patterns file.
- -l LANG[+LANG] Specify language(s) used for OCR.
- -c VAR=VALUE Set value for config variables.
- Multiple -c arguments are allowed.
- --psm NUM Specify page segmentation mode.
- --oem NUM Specify OCR Engine mode.
- NOTE: These options must occur before any configfile.
- Page segmentation modes:
- 0 Orientation and script detection (OSD) only.
- 1 Automatic page segmentation with OSD.
- 2 Automatic page segmentation, but no OSD, or OCR.
- 3 Fully automatic page segmentation, but no OSD. (Default)
- 4 Assume a single column of text of variable sizes.
- 5 Assume a single uniform block of vertically aligned text.
- 6 Assume a single uniform block of text.
- 7 Treat the image as a single text line.
- 8 Treat the image as a single word.
- 9 Treat the image as a single word in a circle.
- 10 Treat the image as a single character.
- 11 Sparse text. Find as much text as possible in no particular order.
- 12 Sparse text with OSD.
- 13 Raw line. Treat the image as a single text line,
- bypassing hacks that are Tesseract-specific.
- OCR Engine modes:
- 0 Legacy engine only.
- 1 Neural nets LSTM engine only.
- 2 Legacy + LSTM engines.
- 3 Default, based on what is available.
- Single options:
- -h, --help Show minimal help message.
- --help-extra Show extra help for advanced users.
- --help-psm Show page segmentation modes.
- --help-oem Show OCR Engine modes.
- -v, --version Show version information.
- --list-langs List available languages for tesseract engine.
- --print-parameters Print tesseract parameters.
- '''
- import os
- import pytesseract
- # ocr图片文件,生成文本文件,较好的参数为 -l chi_sim+eng --psm 6
- def sheetocr(picture, output, lang, psm):
- cmd = 'tesseract' + ' ' + picture + ' ' + output + ' ' + lang + ' ' + psm
- os.system(cmd)
- # ocr 图像,生成文本,较好的参数为'chi_sim+eng', '--psm 6'
- def sheetocr_py(img, lang, psm):
- words = pytesseract.image_to_string(img, lang=lang, config=psm)
- return words
|