sheetocr.py 2.6 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
  1. '''
  2. 使用Tesseract 对试卷做OCR
  3. Tesseract Usage:
  4. tesseract --help | --help-extra | --help-psm | --help-oem | --version
  5. tesseract --list-langs [--tessdata-dir PATH]
  6. tesseract --print-parameters [options...] [configfile...]
  7. tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]
  8. OCR options:
  9. --tessdata-dir PATH Specify the location of tessdata path.
  10. --user-words PATH Specify the location of user words file.
  11. --user-patterns PATH Specify the location of user patterns file.
  12. -l LANG[+LANG] Specify language(s) used for OCR.
  13. -c VAR=VALUE Set value for config variables.
  14. Multiple -c arguments are allowed.
  15. --psm NUM Specify page segmentation mode.
  16. --oem NUM Specify OCR Engine mode.
  17. NOTE: These options must occur before any configfile.
  18. Page segmentation modes:
  19. 0 Orientation and script detection (OSD) only.
  20. 1 Automatic page segmentation with OSD.
  21. 2 Automatic page segmentation, but no OSD, or OCR.
  22. 3 Fully automatic page segmentation, but no OSD. (Default)
  23. 4 Assume a single column of text of variable sizes.
  24. 5 Assume a single uniform block of vertically aligned text.
  25. 6 Assume a single uniform block of text.
  26. 7 Treat the image as a single text line.
  27. 8 Treat the image as a single word.
  28. 9 Treat the image as a single word in a circle.
  29. 10 Treat the image as a single character.
  30. 11 Sparse text. Find as much text as possible in no particular order.
  31. 12 Sparse text with OSD.
  32. 13 Raw line. Treat the image as a single text line,
  33. bypassing hacks that are Tesseract-specific.
  34. OCR Engine modes:
  35. 0 Legacy engine only.
  36. 1 Neural nets LSTM engine only.
  37. 2 Legacy + LSTM engines.
  38. 3 Default, based on what is available.
  39. Single options:
  40. -h, --help Show minimal help message.
  41. --help-extra Show extra help for advanced users.
  42. --help-psm Show page segmentation modes.
  43. --help-oem Show OCR Engine modes.
  44. -v, --version Show version information.
  45. --list-langs List available languages for tesseract engine.
  46. --print-parameters Print tesseract parameters.
  47. '''
  48. import os
  49. import pytesseract
  50. # ocr图片文件,生成文本文件,较好的参数为 -l chi_sim+eng --psm 6
  51. def sheetocr(picture, output, lang, psm):
  52. cmd = 'tesseract' + ' ' + picture + ' ' + output + ' ' + lang + ' ' + psm
  53. os.system(cmd)
  54. # ocr 图像,生成文本,较好的参数为'chi_sim+eng', '--psm 6'
  55. def sheetocr_py(img, lang, psm):
  56. words = pytesseract.image_to_string(img, lang=lang, config=psm)
  57. return words