''' 使用Tesseract 对试卷做OCR Tesseract Usage: tesseract --help | --help-extra | --help-psm | --help-oem | --version tesseract --list-langs [--tessdata-dir PATH] tesseract --print-parameters [options...] [configfile...] tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...] OCR options: --tessdata-dir PATH Specify the location of tessdata path. --user-words PATH Specify the location of user words file. --user-patterns PATH Specify the location of user patterns file. -l LANG[+LANG] Specify language(s) used for OCR. -c VAR=VALUE Set value for config variables. Multiple -c arguments are allowed. --psm NUM Specify page segmentation mode. --oem NUM Specify OCR Engine mode. NOTE: These options must occur before any configfile. Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Treat the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. OCR Engine modes: 0 Legacy engine only. 1 Neural nets LSTM engine only. 2 Legacy + LSTM engines. 3 Default, based on what is available. Single options: -h, --help Show minimal help message. --help-extra Show extra help for advanced users. --help-psm Show page segmentation modes. --help-oem Show OCR Engine modes. -v, --version Show version information. --list-langs List available languages for tesseract engine. --print-parameters Print tesseract parameters. ''' import os import pytesseract # ocr图片文件,生成文本文件,较好的参数为 -l chi_sim+eng --psm 6 def sheetocr(picture, output, lang, psm): cmd = 'tesseract' + ' ' + picture + ' ' + output + ' ' + lang + ' ' + psm os.system(cmd) # ocr 图像,生成文本,较好的参数为'chi_sim+eng', '--psm 6' def sheetocr_py(img, lang, psm): words = pytesseract.image_to_string(img, lang=lang, config=psm) return words