Relationship Extraction (1)

        From the process of relationship extraction, it can be divided into two types: pipeline extraction (Pipline) and joint extraction (Joint Extraction). Pipeline extraction is to divide the task of relationship extraction into two steps: first do entity recognition, and then extract two The relationship between entities; and the way of joint extraction is one step, and the extraction of entities and relationships is done at the same time. Pipeline extraction will cause errors to be transmitted and accumulated in each process, while the joint extraction method is more difficult to implement.

  From the perspective of the implemented algorithm, relationship extraction is mainly divided into four types:

  1. Hand-Written Patterns;

  2. Supervised Machine Learning;

  3. Semi-Supervised Learning (Semi-Supervised Learning, such as Bootstrapping and Distant Supervision);

  4. Unsupervised algorithm.

1.  Method of handwriting rule template

  1. Example:

  There is a relationship called hyponym, such as hyponym (France; European countries). This relationship can be extracted from both of the following sentences:

  European countries, especially France, England, and Spain...

  European countries, such as France, England, and Spain...

  Especially and such as between two entities can be seen as the characteristics of this relationship. Observing more sentences expressing this relationship, we can construct the following rule template to extract the entities that constitute the hyponym relationship, thereby discovering new triples.

  2. Advantages and disadvantages:

  The advantage is that the extracted triples have a high precision rate (Precision), which is especially suitable for relationship extraction in specific fields; the disadvantage is that the recall rate (Recall) is very low, which means that the search is accurate, but not complete, and for each This kind of relationship requires a lot of rules to be written by hand, which is quite miserable.

2.  Supervised learning methods

  The method of supervised learning is to label the entities and relationships in the training corpus, construct training sets and test sets, and then use traditional machine learning algorithms (LR, SVM, random forest, etc.) or neural networks to train classifiers.

  1. Machine Learning and Deep Learning Methods

  For traditional machine learning methods, the most important step is to construct features. Available features are:

  (1) Word features: words between entity 1 and entity 2, words before and after, and word vectors can be combined with Bag-of-Words and Bigrams.

  (2) Entity label feature: the label of the entity.

  (3) Dependent syntactic features: analyze the dependent syntactic structure of sentences and construct features. I don't know how to do this.

  Manually constructing features is very troublesome, and some features, such as dependency syntax analysis, rely on NLP tool libraries, such as HanLP, and errors caused by tools will inevitably affect the accuracy of features.

  With an end-to-end deep learning approach, it is not so strenuous. For example, use CNN or BI-LSTM as the sentence encoder, take the word embedding (Word Embedding) of a sentence as input, use CNN or LSTM as the feature extractor, and finally get the probability of N kinds of relationships through the softmax layer. In this way, the feature construction step is omitted, and naturally no error will be introduced in the feature construction.

  2. Advantages and disadvantages of supervised learning

  The advantage of supervised learning is that if the labeled training corpus is large enough, the effect of the classifier is relatively good, but the problem is that the cost of labeling is too high.

3.  Semi-supervised learning

  In view of the fact that the cost of supervised learning is too high, using semi-supervised learning for relational extraction is a direction worthy of research.

  There are two main algorithms for semi-supervised learning: Bootstrapping and Distant Supervision. Bootstrapping does not require sentences with marked entities and relationships as a training set, and does not need to train a classifier; while Distant Supervision can be regarded as a combination of Bootstrapping and Supervise Learning, which requires training a classifier.

3.1 Bootstrapping seed-based heuristic method

  The input of the Bootstrapping algorithm is a small number of entity pairs with a certain relationship. As a seed, the output is more entity pairs with this relationship. Knock on the blackboard! Instead of finding more relationships, discover more pairs of new entities that have a relationship.

        First prepare some high-quality entity-relationship pairs as initial seeds (Seed). As shown in the figure, the relationship between the author Liu Cixin and "Three-Body" is the author's relationship, which can be represented by the <Liu Cixin, Three-Body, Creator> triplet. At the same time, a large-scale corpus needs to be prepared to provide seeds Carry out pattern learning. The whole process will keep repeating the following steps :

  • Based on the initial seed, match all relevant sentences in a large-scale corpus;
  • Analyze the context of these sentences and extract some reliable templates;
  • Then use these templates to match the corpus and find more instances to be extracted (new seeds);
  • Then use the newly extracted examples to discover more new templates.

                      
        In this way, iterative learning is continued until the preset convergence condition is met. Usually, no new entities or templates can be found as the convergence condition, and the evaluation index can also be designed to end the iteration when the quality of the newly discovered instances and templates is lower than a specific condition.
        The method has low construction cost, is suitable for large-scale data construction, and may discover new implicit relationships. However, it has high requirements on the quality of the initial seed, the overall accuracy is low, and semantic deviation may occur during the continuous iteration process. For example, although the fourth sentence in the figure contains both <Liu Cixin> and <Three-Body> entities, However, the relationship between the two is not the <creator>, and more conditional constraints are required to control the quality of the extracted data.

  Example 2: "Founder" is a relationship. If we already have a small knowledge graph, there are 3 entity pairs expressing this relationship: (Yan Dinggui, You and I Loan), (Ma Yun, Alibaba), ( Lei Jun, Xiaomi).

  • The first step: Find sentences containing a certain entity pair (any one of the three) in a large corpus, and pick them all out. For example: Yan Dinggui founded Youwodai in 2011; Yan Dinggui is the founder of Youwodai; under the leadership of Chairman Yan Dinggui, Jiayin Jinke successfully went public in the United States.
  • Step 2: Summarize the words before and after or in the middle of the entity pair, and construct a feature template. For example: A founded B; A is the founder of B; under the leadership of A, B.
  • Step 3: Use the feature template to find more entity pairs in the corpus, and then score and sort all the found entity pairs. Entity pairs higher than the threshold are added to the knowledge map to expand the existing entity pairs.
  • Step 4: Go back to the first step, iterate, get more templates, and find more entity pairs that have the relationship.

  Careful friends will find that not all sentences containing "Yan Dinggui" and "You and I loan" express the relationship of "founder", for example: "Under the leadership of Chairman Yan Dinggui, Jiayin Jinke went to the United States IPO success" - this sentence does not express the relationship of "founder". There may be many kinds of relationships between a certain entity pair. How can we insist on the existing relationship in the knowledge graph? Isn't this going to get the wrong template and then magnify the error over and over again?

  That's right, this problem is called Semantic Draft, and there are generally two solutions :

  • One is manual verification, observe the selected sentences in each iteration, and remove the sentences that do not contain this relationship.
  • The second is that the Bootstrapping algorithm itself scores newly discovered templates and entity pairs, and then sets a threshold to screen out high-quality templates and entity pairs. The specific formula can be found in Chapter 17 of "Speech and Language Processing" (3rd Edition).

  2. Advantages and disadvantages of Bootstrapping

  The disadvantage of Bootstrapping is the semantic drift problem mentioned above, and the second is that the precision rate will continue to decrease and the recall rate is too low, because this is an iterative algorithm, and the accuracy rate will inevitably decrease every iteration, 80%- --->60%---->40%---->20%...   Therefore, the new entity pair discovered at last needs manual verification.

3.2 Remote Supervision

        Distant supervision is essentially a method to automatically label samples, but its assumptions are too strong, which will lead to the problem of mislabeling samples. 

        The remote supervision method was first proposed by Mintz, which combines the advantages of supervised learning and seed-based heuristic methods for connection extraction tasks. The remote supervision method is based on a premise assumption :

If there is a relationship between two entities, then all sentences that mention both entities are able to describe the relationship.

        Its core idea is to use entities to extract potential relationships in the corpus, and then use relationships to reverse locate and extract entities. First, a large amount of labeled data is obtained based on the remote supervision method, and then the classifier is trained by supervised methods such as machine learning or deep learning, and then the data automatically labeled by the program is divided, and the high-quality automatically labeled data is added to the training data, while the Low-quality data is discarded or handed over to manual labeling and review. The entire process can be referred to in the figure.

             

        The remote supervision method can quickly obtain a large amount of labeled data, but it also has two obvious shortcomings:

  • In some cases, the premise assumption may not hold true, resulting in learning a lot of noisy data;
  • When the data annotated by the distance supervision method is used to extract features to train the classifier, and when the pre-order NLP tasks such as part-of-speech tagging and syntactic analysis are used to construct features, there is a problem of error transmission, which will affect the effect of the classifier. This is also an improvement direction for follow-up research. Later researchers used neural networks as feature extractors instead of manually extracted features , and used word embeddings as text features, such as PCNN.

        But in general, the remote supervision method is still a very good method. At present, most of the methods that achieve state-of-the-art (most advanced) effects are based on this method, and there are many works dedicated to solving the above two problems. To improve the task effect, for example, Riedel proposed an enhanced version of the premise that a certain relationship between two entities must be established, and there must be at least one sentence containing these two entities to describe this relationship; another example is in the feature During the construction process, the feature extraction process of the traditional pipeline can also be abandoned, and the CNN network and the Attention mechanism can be used to extract features.

        PCNN: The highlight is the combination of multi-instance learning, convolutional neural network and segmental maximum pooling, which is used to alleviate the problem of wrong labeling of sentences and the error problem of artificially designed features, and improve the effect of relationship extraction.

4.  Unsupervised

  The effect of the semi-supervised method is already barely, and the effect of the unsupervised method is even worse, so I won’t introduce it here.

Guess you like

Origin blog.csdn.net/qq_27586341/article/details/128076648