Large-scale language data (model parameter standard, evaluation standard)
NLP knowledge base (vocabulary semantic library, lexical syntax rule library, common sense library)
Corpus: files used to store language data
Corpus linguistics: research on color collection, storage, retrieval, statistics, part-of-speech and syntactic and semantic information of natural language texts, and corpus with the above functions in the fields of quantitative language analysis, dictionary compilation, work style analysis and human language technology application
Corpus-based linguistic research
research content:
Corpus construction and compilation
Corpus processing and management
Corpus usage
Corpus technology development
Corpus type
Four types
Heterogeneous corpus: the simplest method of predictive collection, without prior regulations and selection principles
Homogeneous corpus: opposite to heterogeneous
Systematic corpus: Fully consider the dynamic and static problems of corpus, representation and balance, and the size of the corpus
Dedicated corpus
Language type
Monolingual
Bilingual or multilingual
Whether to mark
Part-of-speech
Syntax structure information annotation (tree library)
Semantic information annotation
Raw corpus: corpus without any annotation
Familiar corpus: corpus with detailed information
Balanced corpus
Representativeness and balance in corpus collection
Seven principles
problem:
The scientific basis for selecting corpus for distribution points
Whether the degree of use truly reflects the use of language
Parallel corpus
Parallelism in the same language (select time, object, proportion, etc.)
Multilingual parallel sampling processing
Synchronous corpus: a corpus established by studying synchronic (same time) language
A diachronic corpus: a corpus of diachronic (developmental) research on language
Whether it is dynamic
Does the text have a quantitative circulation attribute
Whether deep processing is based on dynamic processing methods
Whether to obtain dynamic processing effect
Typical corpus introduction
Brown Corpus
The world's first standard corpus to collect samples based on systematic principles
LLC Spoken Corpus
Speaking materials such as dialogue and broadcasting
Longman Corpus
Respect the intuition and corpus authority of native speakers
UPenn Tree Library
Sentence grammatical structure annotation
Chinese PropBank and NomBank (the latter pays more attention to nouns)
Discourse Tree Bank discourse tree library (consistent relationship related to discourse connectivity)
Chinese Text Tree Bank (CTDB)
There are many implicit connections designed in Chinese
Prague Dependency Tree Library
Czech related
Three levels
Morphological layer: Morphological information
Analysis layer: syntactic information
Deep Grammar Layer: Deep Grammar Structure
Comprehensive Language Knowledge Base (CLKB)
Taiwan Chinese Academy of Research Balanced Corpus:
The world's first Chinese balanced corpus with complete part-of-speech tags
Spoken Translation Corpus (BTEC)
Speech-Translation TED Corpus
A spoken dialogue corpus constructed by the Institute of Automation of the Chinese Academy of Sciences and the Language Institute of the Academy of Social Sciences
CASIA multi-modal automatic abstract corpus
英文:Topic——Documents——Videos——Summaries
Chinese: Topic-Document-Video-Summary
Problems and current situation
problem:
Dynamic and static, depending on purpose
Representativeness and balance
scale
Corpus management and maintenance
Chinese corpus problem
specification
Property rights protection
status quo:
From canonical text
The marking system does not agree
No clear NLP task orientation
Language Knowledge Base
Knowledge abstracted from language, expressed in language
WordNet
Organize vocabulary information by word meaning-semantic dictionary
Semantic relationship: pointer between synonym sets