UCAS-AI Academy-Special Course on Natural Language Processing- Lecture 4-Course Notes

Corpus and Language Knowledge Base

Basic concepts of corpus

  • Language database:
    • Large-scale language data (model parameter standard, evaluation standard)
    • NLP knowledge base (vocabulary semantic library, lexical syntax rule library, common sense library)
  • Corpus: files used to store language data
  • Corpus linguistics: research on color collection, storage, retrieval, statistics, part-of-speech and syntactic and semantic information of natural language texts, and corpus with the above functions in the fields of quantitative language analysis, dictionary compilation, work style analysis and human language technology application
    • Corpus-based linguistic research
  • research content:
    • Corpus construction and compilation
    • Corpus processing and management
    • Corpus usage

Corpus technology development

Corpus type

  • Four types
    • Heterogeneous corpus: the simplest method of predictive collection, without prior regulations and selection principles
    • Homogeneous corpus: opposite to heterogeneous
    • Systematic corpus: Fully consider the dynamic and static problems of corpus, representation and balance, and the size of the corpus
    • Dedicated corpus
  • Language type
    • Monolingual
    • Bilingual or multilingual
  • Whether to mark
    • Part-of-speech
    • Syntax structure information annotation (tree library)
    • Semantic information annotation
  • Raw corpus: corpus without any annotation
  • Familiar corpus: corpus with detailed information
  • Balanced corpus
    • Representativeness and balance in corpus collection
    • Seven principles
    • problem:
      • The scientific basis for selecting corpus for distribution points
      • Whether the degree of use truly reflects the use of language
  • Parallel corpus
    • Parallelism in the same language (select time, object, proportion, etc.)
    • Multilingual parallel sampling processing
  • Synchronous corpus: a corpus established by studying synchronic (same time) language
  • A diachronic corpus: a corpus of diachronic (developmental) research on language
    • Whether it is dynamic
    • Does the text have a quantitative circulation attribute
    • Whether deep processing is based on dynamic processing methods
    • Whether to obtain dynamic processing effect

Typical corpus introduction

  • Brown Corpus
    • The world's first standard corpus to collect samples based on systematic principles
  • LLC Spoken Corpus
    • Speaking materials such as dialogue and broadcasting
  • Longman Corpus
    • Respect the intuition and corpus authority of native speakers
  • UPenn Tree Library
    • Sentence grammatical structure annotation
    • Chinese PropBank and NomBank (the latter pays more attention to nouns)
    • Discourse Tree Bank discourse tree library (consistent relationship related to discourse connectivity)
  • Chinese Text Tree Bank (CTDB)
    • There are many implicit connections designed in Chinese
  • Prague Dependency Tree Library
    • Czech related
    • Three levels
      • Morphological layer: Morphological information
      • Analysis layer: syntactic information
      • Deep Grammar Layer: Deep Grammar Structure
  • Comprehensive Language Knowledge Base (CLKB)
  • Taiwan Chinese Academy of Research Balanced Corpus:
    • The world's first Chinese balanced corpus with complete part-of-speech tags
  • Spoken Translation Corpus (BTEC)
  • Speech-Translation TED Corpus
  • A spoken dialogue corpus constructed by the Institute of Automation of the Chinese Academy of Sciences and the Language Institute of the Academy of Social Sciences
  • CASIA multi-modal automatic abstract corpus
    • 英文:Topic——Documents——Videos——Summaries
    • Chinese: Topic-Document-Video-Summary

Problems and current situation

  • problem:
    • Dynamic and static, depending on purpose
    • Representativeness and balance
    • scale
    • Corpus management and maintenance
  • Chinese corpus problem
    • specification
    • Property rights protection
  • status quo:
    • From canonical text
    • The marking system does not agree
    • No clear NLP task orientation

Language Knowledge Base

  • Knowledge abstracted from language, expressed in language

WordNet

  • Organize vocabulary information by word meaning-semantic dictionary
  • Semantic relationship: pointer between synonym sets
    • Synonymous relationship
    • Antisense relationship
    • Subordinate relationship (subordinate-subordinate)
    • Partial relationship (whole-part)
  • Applications: vocabulary disambiguation, semantic reasoning, understanding

HowNet

  • Four basic views
    • NLP system needs strong knowledge base support
    • Knowledge is a system
    • First establish a common sense knowledge base
    • Design knowledge base framework by knowledge engineer

Conceptual hierarchical network

  • Mapping from natural language space to language concept space

Knowledge graph

  • Describe the relationships between entities and the attributes of entities or concepts
  • DBPedia: based on Wikipedia
  • YAGO
  • BabelNet
  • XLORE
  • Key technology
    • Entity, concept recognition
    • Relationship extraction
    • Attribute extraction
Published 14 original articles · praised 0 · visits 67

Guess you like

Origin blog.csdn.net/cary_leo/article/details/105642999