Preparations for installing Tesseract-OCR :
Compilation environment: gcc gcc-c++ make (this environment is generally available on machines and can be ignored)
1
yum install gcc gcc-c++ make
dependent packages: autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel leptonica (above 1.67) 1. autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel can be installed by yum: 1 yum install autoconf automake libtool 2 yum install libjpeg-devel libpng-devel libtiff-devel zlib- devel 2. leptonica requires source code compilation and installation reference: http://paramountideas.com/tesseract-ocr-30-and-leptonica-installation-centos-55-and-opensuse-113 http://www.leptonica.org/ source/README.htmlDownload the leptonica package: http://www.leptonica.org/source/leptonica-1.68.tar.gz
After decompression, switch to the root directory of leptonica-1.68
1
./configure
2
make
3
make install
tesseract installation:
After the dependencies are installed, start installing tesseract and
download the tesseract-3.01 installation package: http://tesseract-ocr.googlecode.com/files/tesseract -3.01.tar.gz
decompress and switch to the tesseract-3.01 root directory
(if you encounter errors like strngs.h:1: error: stray '\357' in program during make, please tesseract-3.01/ccutil/strngs .h file is converted to ANSI encoding and saved, and then recompiled)
1
./autogen.sh
2
./configure
3
make
4
make install
5
ldconfig
tesseract English language package installation:
Download tesseract-3.01 English language package: http://tesseract- ocr.googlecode.com/files/tesseract-ocr-3.01.eng.tar.gz
解压后将tesseract-ocr/tessdata 下的所有文件全部拷贝到/usr/local/share/tessdata 下
安装完毕.
测试一下:
切换到解压后的tesseract-3.01 根目录(这个目录下有一个自带的phototest.tif 可以做测试用)
命令行:
1
tesseract phototest.tif phototest -l eng
输出:
1
Tesseract Open Source OCR Engine v3.01 with Leptonica
2
Page 0
这时应该在当前目录生成一个phototest.txt 文本文件,内容就是phototest.tif 显示的文字.
--------------------------------------------以上安装完成-------------------------------------
java实现
方法:
private static String recognizeText(File imageFile){ /** * 设置输出文件的保存的文件目录 */ File outputFile = new File(imageFile.getParentFile(), "output"); StringBuffer strB = new StringBuffer(); // 设置cmd命令行字符串形式 List<String> cmd = new ArrayList<String>(); cmd.add("tesseract"); cmd.add(imageFile.getName()); cmd.add(outputFile.getName()); cmd.add("-l"); cmd.add("eng"); try { // 启动exe进程 ProcessBuilder pb = new ProcessBuilder(); pb.directory(imageFile.getParentFile()); pb.command(cmd); pb.redirectErrorStream(true); Process process = pb.start(); // 等待此进程完成 int w = process.waitFor(); if (w == 0) {// 0代表正常退出 BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath() + ".txt"), "UTF-8")); String str; while ((str = in.readLine()) != null) { strB.append(str).append(EOL); } in.close(); } else { String msg; switch (w) { case 1: msg = "Errors accessing files. There may be spaces in your image's filename."; break; case 29: msg = "Cannot recognize the image or its selected region."; break; case 31: msg = "Unsupported image format."; break; default: msg = "Errors occurred."; } logger.error(msg); } } catch (Exception e) { logger.error(e.getMessage(), e); } new File(outputFile.getAbsolutePath() + ".txt").delete(); return strB.toString().replaceAll("\\s*", ""); }