识别图片中的文字
首先把下载好的tessdata放在自己项目的bin\Debug\tessdata文件夹中。
附一个tessdata的下载地址:https://github.com/tesseract-ocr/tessdata
命名空间:
using System.Drawing;
using Tesseract;
using System.IO;
需要NuGet的包:Tesseract
初始化tesseractEngine(注释的是白名单(能识别到的)和黑名单(不识别的))
private TesseractEngine tesseractEngine;
baseDirectory = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
datapath = Path.Combine(baseDirectory, "tessdata");
tesseractEngine = new TesseractEngine(datapath, "eng", EngineMode.Default);
//tesseractEngine.SetVariable("tessedit_char_whitelist", "0123456789");
//tesseractEngine.SetVariable("tessedit_char_blacklist", "!?@#$%&*()<>_-+=/:;'\"");
获取文字
confidence是识别率
//Bitmap bitmap = new Bitmap(fileName);
public string GetText(Bitmap bitmap, out float confidence)
{
var page = tesseractEngine.Process(bitmap);
var text = page.GetText();
confidence = page.GetMeanConfidence();
page.Dispose();
return text;
}
从Pdf中获取文字
扫描二维码关注公众号,回复:
13579463 查看本文章
命名空间:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
需要NeGet的包:iTextSharp
public string ReadPdfContent(string filePath)
{
PdfReader pdfReader = new PdfReader(filePath);
string text = string.Empty;
for (int i = 1; i <= pdfReader.NumberOfPages; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var temp = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
text += temp;
}
pdfReader.Close();
return text;
}