C# 获取图片,Pdf中的文字

识别图片中的文字

首先把下载好的tessdata放在自己项目的bin\Debug\tessdata文件夹中。

附一个tessdata的下载地址:https://github.com/tesseract-ocr/tessdata

命名空间:

using System.Drawing;
using Tesseract;
using System.IO;

需要NuGet的包:Tesseract

初始化tesseractEngine(注释的是白名单(能识别到的)和黑名单(不识别的))

private TesseractEngine tesseractEngine;
baseDirectory = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
            datapath = Path.Combine(baseDirectory, "tessdata");
            tesseractEngine = new TesseractEngine(datapath, "eng", EngineMode.Default);

            //tesseractEngine.SetVariable("tessedit_char_whitelist", "0123456789");
            //tesseractEngine.SetVariable("tessedit_char_blacklist", "!?@#$%&*()<>_-+=/:;'\"");

获取文字

confidence是识别率

//Bitmap bitmap = new Bitmap(fileName);

public string GetText(Bitmap bitmap, out float confidence)
        {
            var page = tesseractEngine.Process(bitmap);
            var text = page.GetText();
            confidence = page.GetMeanConfidence();
            page.Dispose();
            return text;
        }

从Pdf中获取文字

扫描二维码关注公众号,回复: 13579463 查看本文章

命名空间:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

需要NeGet的包:iTextSharp

public string ReadPdfContent(string filePath)
        {
                PdfReader pdfReader = new PdfReader(filePath);
                string text = string.Empty;

                for (int i = 1; i <= pdfReader.NumberOfPages; i++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    var temp = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
                    text += temp;
                }
                pdfReader.Close();

                return text;
        }

猜你喜欢

转载自blog.csdn.net/qq_41863100/article/details/103144835