Python OCR: 使用 Tesseract 识别中文文字

使用 Python 和 Tesseract 识别中文文字

本文将介绍如何使用 Python 和 Tesseract 库识别中文文字。

代码示例

import pytesseract
from PIL import Image

def recognize_chinese_text(img_path):
    try:
        # 读取图片
        img = Image.open(img_path)
        # 转化为灰度图像
        img = img.convert('L')
        # 二值化处理
        threshold = 140
        table = []
        for i in range(256):
            if i < threshold:
                table.append(0)
            else:
                table.append(1)
        img = img.point(table, '1')
        # 识别中文文字
        text = pytesseract.image_to_string(img, lang='chi_sim')
        return text
    except Exception as e:
        print(e)
        return None

if __name__ == "__main__":
    img_path = "test.png"
    text = recognize_chinese_text(img_path)
    if text:
        print(text)
    else:
        print('识别失败！')

代码说明

导入必要的库：
- pytesseract: 用于调用 Tesseract OCR 引擎。
- PIL (Pillow): 用于图像处理。
定义函数 recognize_chinese_text(img_path)：
- 接受图片路径 img_path 作为参数。
- 使用 Image.open(img_path) 读取图片。
- 将图片转换为灰度图像 img.convert('L')。
- 进行二值化处理，将像素值低于 threshold 的设置为 0，高于 threshold 的设置为 1。
- 使用 pytesseract.image_to_string(img, lang='chi_sim') 识别图片中的中文文字，'chi_sim' 表示简体中文语言模型。
- 返回识别结果 text，如果识别失败则返回 None。
主函数 if __name__ == "__main__"：
- 定义一个测试图片路径 img_path。
- 调用 recognize_chinese_text(img_path) 函数进行识别。
- 打印识别结果，如果识别失败则打印 '识别失败！'。

注意事项

确保已安装 pytesseract 和 Pillow 库。
确保 Tesseract OCR 引擎已安装并配置好，并且包含简体中文语言模型 chi_sim。
可以根据实际情况调整代码中的 threshold 和 lang 参数，以获得最佳识别效果。

总结

本文提供了一个使用 Python 和 Tesseract 识别中文文字的简单示例。通过这个示例，你可以学习如何使用 Python 进行图像处理和文字识别，并进一步探索 Tesseract 库的更多功能。