大结构化文本文件的信息提取

需要从一些大型文件中(5万到10万行)提取信息,这些文件以空行分隔成组。每组以相同的模式“No.999999999 dd/mm/yyyy ZZZ”开头。以下是一些示例数据。

No.813829461 16/09/1987 270
Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA)
C.N.P.J./C.I.C./N INPI : 16404287000155
Procurador: MARCELLO DO NASCIMENTO
No.815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ)
C.N.P.J./C.I.C./NºINPI : 34162651000108
Apres.: Nominativa ; Nat.: De Produto
Marca: TRIO TROPICAL
Clas.Prod/Serv: 09.40
*DEFERIDO CONFORME RESOLUÇÃO 123 DE 06/01/2006, PUBLICADA NA RPI 1829, DE 24/01/2006.
Procurador: WALDEMAR RODRIGUES PEDRA
No.900148764 11/01/2007 LD3
Tit.TIARA BOLSAS E CALÇADOS LTDA
Procurador: Marcia Ferreira Gomes
*Escritório: Marcas Marcantes e Patentes Ltda
*Exigência Formal não respondida Satisfatoriamente, Pedido de Registro de Marca considerado inexistente, de acordo com Art. 157 da LPI
*Protocolo da Petição de cumprimento de Exigência Formal: 810080140197

已经编写了一些代码来相应地解析它。还有哪些地方可以改进,以提高可读性或性能?

  1. 解决方案
import re, pprint

class Despacho(object):
    """
    Class to parse each line, applying the regexp and storing the results
    for future use
    """
    regexp = {
        re.compile(r'No.([\d]{9})  ([\d]{2}/[\d]{2}/[\d]{4})  (.*)'): lambda self: self._processo,
        re.compile(r'Tit.(.*)'): lambda self: self._titular,
        re.compile(r'Procurador: (.*)'): lambda self: self._procurador,
        re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento,
        re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao,
        re.compile(r'Marca: (.*)'): lambda self: self._marca,
        re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe,
        re.compile(r'\*(.*)'): lambda self: self._complemento,
    }

    def __init__(self):
        """
        'complemento' is the only field that can be multiple in a single registry
        """
        self.complemento = []

    def _processo(self, matches):
        self.processo, self.data, self.despacho = matches.groups()

    def _titular(self, matches):
        self.titular = matches.group(1)

    def _procurador(self, matches):
        self.procurador = matches.group(1)

    def _documento(self, matches):
        self.documento = matches.group(1)

    def _apresentacao(self, matches):
        self.apresentacao, self.natureza = matches.groups()

    def _marca(self, matches):
        self.marca = matches.group(1)

    def _classe(self, matches):
        self.classe = matches.group(1)

    def _complemento(self, matches):
        self.complemento.append(matches.group(1))

    def read(self, line):
        for pattern in Despacho.regexp:
            m = pattern.match(line)
            if m:
                Despacho.regexp[pattern](self)(m)


def process(rpi):
    """
    read data and process each group
    """
    rpi = (line for line in rpi)
    group = False

    for line in rpi:
        if line.startswith('No.'):
            group = True
            d = Despacho()        

        if not line.strip() and group: # empty line - end of block
            yield d
            group = False

        d.read(line)


arquivo = open('rm1972.txt') # file to process
for desp in process(arquivo):
    pprint.pprint(desp.__dict__)
    print('--------------')
  • 第一种解决方案使用正则表达式解析每行,并将结果存储在一个类中。然后,该类可以被迭代以获取每个组的数据。
  • 第二种解决方案使用正则表达式来解析每行,并将结果存储在一个字典中。然后,字典可以用pprint模块格式化并打印出来。
  • 两种解决方案都能够提取所需的数据,但第二种解决方案的可读性和可维护性更高。

猜你喜欢

转载自blog.csdn.net/D0126_/article/details/143187602