2021 Stanford CS224N Course Notes~2

2 Neural Classifiers

image-20230310172736129

2.1 Coverage of this article

  • word2vec and word vector review
  • Algorithm Optimization Basics
  • Count and co-occurrence matrix
  • GloVe model
  • Word vector evaluation
  • word senses

2.2. Review: the main idea of ​​word2vec

2.2.1. Main steps

For details, see 1.3.2 The specific idea of ​​the Word2Vec algorithm

(1) Suiqi: start from a random word vector;

(2) Traversal: traverse each word in the entire corpus;

(3) Prediction: Try to use word vectors to predict surrounding words (see Figure 2.1):

image-20230309213429464

(4) Learning: Update the vectors so that they can better predict the actual words around them.

img

Note: the algorithm learns word vectors that capture good word similarity and meaningful orientation in word space!

The algorithm is limited to word vectors and does not do much else.

2.2.2. Parameter calculation

The parameters and calculation process of Word2vec:

According to the central word V (for example, the fourth word v 4) and the peripheral word U, do the dot product calculation, and then perform normalization processing to obtain the estimated probability value, as shown in Figure 2.2:

image-20230310181622467

​ In fact, what we want is like the above model. The model makes the same prediction at each position. It can give a reasonable high probability estimate for all the words (often) that appear in the context. This model is**"Bag of Words" Model**。

  • Each line represents the word vector of a word. The score obtained after dot multiplication is mapped to a probability distribution through softmax , and the probability distribution we get is the probability distribution of the words in the context for the central word. This distribution depends on the specific context where the context is located Position independent , so the prediction is the same at every position

  • We want the model to give a reasonably high probability estimate for all words that occur (fairly frequently) in the context

  • Stop words such as the, and, that, of are words with a higher probability obtained after dot multiplication of each word

    • Removing this part can make the word vector effect better

2.2.3. Maximizing the objective function

Word2vec maximizes the objective function by placing similar words in adjacent spaces (see Figure 2.3).

image-20230310184117351

In Figure 2.3, the idea of ​​t-SNE (t-distributed Stochastic Neighbor Embedding) is to use high-dimensional data to represent the similarity between points through conditional probability . One of the uses of t-SNE is to visually verify the effectiveness of the algorithm [similar words are relatively close in the word vector space], or algorithm evaluation. It is worth mentioning that t-SNE is one of the few algorithms that can consider the global and local relationships of data at the same time , and it works well on many clustering problems.

2.2.4. Optimization: Gradient Descent

​ Suppose we have a cost function J(θ) that we want to minimize, then we can learn a good word vector. The learning method of word embedding is the gradient descent method, which is an algorithm that minimizes J(θ) by changing θ .

​ The idea of ​​gradient descent is: from the current value of θ, calculate the gradient of J(θ), and then take small steps in the direction of negative gradient ; repeat.

Note: Our actual objective function may not be a convex function like the one shown below

​ [Learning factor α] However, if the learning step size α is too small, the convergence will be slow; if α is too large, it may not converge. For example, when the value of α is large, J(θ) may not decrease in each iteration, but will lead to divergence, or even failure to converge, and may miss extreme points, as shown in Figure 2.5:

image-20230310184833995

Recommended learning step size α:

First, choose α as 0.001, 0.01, 0.1, etc.;

Then, evaluate its convergence effect; if the convergence is too slow, you can enlarge α to 0.003, 0.03, 0.3, etc., until you are satisfied.

The following are gradient descent and stochastic gradient descent algorithms

2.2.4.1. Gradient descent

Gradient descent is to continuously update the equation. There are two representations for updating (where: α represents the learning step size):

image-20230310190646737

image-20230310191605127

2.2.4.2. Stochastic Gradient Descent (SGD)

image-20230310191737022

​ ① Problem: Because the above J(θ) is a function of all windows in the corpus (possibly billions!) , the calculation cost of its gradient is very high, and it takes a long time to perform an update . Therefore, for almost all For neural networks, it's all a very bad idea!

​ ②Scheme: The solution to the above problems isThe stochastic gradient descent (SGD) method is used: that is, the parameters are calculated and updated in a single sample, and all samples are traversed .

However, based on a single sample update, the parameters will fluctuate greatly, and the convergence process will not be smooth, so we will often use mini-batch gradient descent instead (for details, please refer to the article Neural Network Optimization Algorithm in ** ShowMeAI ’s Deep Learning Tutorial * *)

​ Mini-batch has the following advantages: reduce the noise of gradient estimation through batch averaging; parallelize the operation on the GPU to speed up the operation.

③ However, a new problem arises: sparseness .

image-20230310191129234

image-20230309213429464

image-20230310191328412

Therefore, a new plan is proposed: we mayOnly update word vectors that actually appear! Specifically, choose one of the following:

(1) either you need a sparse matrix update operation to update only certain rows (not columns!) of the full embedding matrices U and V ;

(2) Either you need to keep a hash table of word vectors, i.e. only update specific columns , or use a hash table to build a hash map for each word to word vectors (in the source code of Word2vec, to calculate word and store it in the form of a hash table), as shown in Figure 2.6(b).

If you have millions of word vectors and do distributed computation, it's important not to have to send huge updates all over the place!

2.2.5. Word2vec algorithm family: two vectors, two models, two algorithms

​ Why two vectors? Because it is easier to optimize . Finally, the average of the two is the final word vector.

​ Word2vec is a piece of software, the package actually contains two models and two algorithms (training methods):

​Two models : Continuous Bag-of-Words (CBOW) andSkip-Gram. CBOW predicts the word vector of the word based on the context words around the central word; Skip-Gram, on the contrary, predicts the probability distribution of words in the surrounding context based on the central word.

​Two algorithms : Negative Sampling and Hierarchical Softmax. Negative Sampling defines the target by drawing negative samples; Hierarchical Softmax defines the target by using an efficient tree structure to calculate the probability of all words. [In fact, the negative sampling method can be used to speed up the training rate]


​ The following only focuses on the Skip-gram model (jumping word model) and the Negative Softmax algorithm (negative sampling algorithm, the content of homework 2).

Because the normalization term (ie the denominator of the following equation) with negative sampling is computationally expensive , so, in standard word2vec, you can **Using Negative Sampling to Implement a Word-Skipping Model**。

image-20230309213429464

​Main idea : Train binary logistic regression for the true pair (the pairing of the central word and the window word, referred to as the positive pair) and several noise pairs (the pairing of the central word and the random word, referred to as the negative pair).

(详见:Distributed Representations of Words and Phrases and their Compositionality(Mikolov et al. 2013))

image-20230310200102254

Sampling probability distribution

When we sample words, we're not just based on their probability of appearing in the corpus, what we're doing is starting from a univariate logistic distribution of words. That is, the probability that the word actually occurs in our corpus . (eg: Suppose we have a corpus of one billion words and a particular word occurs 90 times in it, then we divide 90 by billion to be the probability of the word)

sigmoid function

image-20230310200520743

We want to maximize the probability that 2 words co-occur

  • We hope that the vector dot product of the center word and the real context word is larger, and the dot product of the center word and the random word is smaller
  • k is the number of samples we negative sample

2.3. Co-occurrence matrix

Another way to construct word vectors in natural language processing is to use the co-occurrence matrix (we set it as X). We have two ways, which can be based on window (window) or full document (full document) statistics:

我们能否**通过计数更有效地捕捉词义的本质**?答案是肯定的。换言之,我们**可以直接获取共现计数**。

​ To construct the co-occurrence matrix X, there are 2 conditions: window and complete document.

​ • Window Window : Similar to word2vec, a window is used around each word to capture some syntactic and semantic information;

​ • Document Word-document : The basic assumption of the co-occurrence matrix is ​​that words appearing in the same article are more likely to be related to each other. If the word wi appears in the document dj, add 1 to the co-occurrence matrix element Xij. When we have processed all document chapters in the database, we get a matrix X whose size is |V|×M , where |V| is the vocabulary and M is the number of documents. This method of constructing a co-occurrence matrix is ​​also adopted by the classic Latent Semantic Analysis . The co-occurrence matrix will begive general themes leading to latent semantic analysis(eg, all sports terms have similar entries).

Example 2.2: Window-based co-occurrence matrix.

A window-based co-occurrence matrix is ​​generated by using the number of co-occurrences of words and words in a fixed-length window (usually 5-10) .

If it is assumed that the window length is 1 (usually 5–10), the co-occurrence matrix is ​​symmetric (independent of the left or right of the context), and the corpus is the following three sentences: I like deep learning. I like NLP. I enjoy flying.

Then the co-occurrence matrix is ​​as follows (see Figure 2.7):

image-20230310202915500

2.3.1. Co-occurrence vectors

A co-occurrence vector is a row of the co-occurrence matrix , representing a word vector . Constructing word vectors directly based on the co-occurrence matrix has some obvious problems, as follows:

Problem: Simple counted co-occurrence vectors have the following flaws:

​ • Use the number of co-occurrences to measure the similarity of words, and the vector increases with the increase of vocabulary;

​ • Very high dimensionality: requires a lot of storage (albeit sparse);

​ • There is a sparse problem in the subsequent classification model, which makes the model less robust.

Solution: Build low-dimensional vectors .【Dimensionality reduction】

Idea: store "most" important information in a fixed small number of dimensions (such as 25-1000), build dense vectors, similar to Word2vec.

So, how to reduce dimensionality?

2.3.2. Dimensionality reduction method: singular value decomposition (SVD)

Here are two important concepts of dimensionality reduction: singular value decomposition and k-rank approximation .

The first concept is singular value decomposition

Definition of singular value decomposition: Matrix X (its dimension is mxn) can be decomposed into the product of the following three matrices :
X = U ∑ VT ( 2.7 ) X=U∑V^T ~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~(2.7)X=UVT (2.7)                                
where:

U is a unitary matrix (Unitary Matrix), its dimension is mxm;

V T is a conjugate matrix of a unitary matrix (that is, VT and U are orthogonal), and its dimension is nxn;

∑ is a diagonal matrix whose dimension is mxn and whose diagonals are non-zero values ​​(also called singular values ) in descending order.

Example 2.3: Decomposition of singular values, see Figure 2.8:

image-20230310203507345

Figure 2.8(a) is the complete form of singular value decomposition: the X matrix is ​​divided into three matrices U, Σ, V T , and there are 3 singular values ​​on the diagonal of the Σ matrix .

Figure 2.8(b) is a simplified form of singular value decomposition: because the elements of the fourth row, the fourth instance and the fifth column of the Σ matrix are all 0, this row and two columns can be deleted, and correspondingly the U matrix can delete one column, V One row is deleted from the T matrix, so all three matrices are reduced in dimensionality.

Figure 2.8© is the dimensionality reduction effect of large-scale applications: assuming m=1000,000, n=500,000, that is, the X matrix has 1 million times 500,000 = 500 billion elements, if it is printed in font size 5, there are two Hangzhou West Lake is so big! Singular value decomposition is to decompose the above large X matrix into three smaller matrices and multiply them together : a matrix U of 1 million times 100, a matrix Σ of 100 times 100, and a matrix Σ of 100 times 500,000 Matrix V T . The total number of elements in these three matrices is only 150 million, which is less than one-thousandth of the original ("Weak water is three thousand, just take a scoop to drink"),The corresponding amount of storage and calculation will be smaller by more than three orders of magnitude, so the effect of dimensionality reduction is very obvious

In order to reduce the scale and save the effective information as much as possible , the largest k values ​​of the diagonal matrix can be reserved, and the corresponding rows and columns of the matrices U , V can be reserved.

  • This is a classic linear algebra algorithm, computationally expensive for large matrices.

The second concept is the k-rank approximation.

**The rank of a matrix (rank)** is the maximum number of linearly independent row (or column) vectors in the matrix.

If the vector r cannot be expressed as a linear combination of r1 and r2, ie:
r ≠ ar 1 + br 2 r ≠ ar_1+br_2r=ar1+br2
Then the vector r is said to be linearly independent of the vectors r1 and r2.

Consider the following three matrices:

image-20230310204148048

In matrix A, row vector r2 is 2 times r1, ie r2 = 2 r1, so Rank(A)=1.

In matrix B, the row vector r3 is the sum of r1 and r2, that is, r3 = r1 + r2, but r1 and r2 are irrelevant, so Rank(B)=2.

In matrix C, all 3 rows are independent of each other, so Rank©=3.

The rank of a matrix can be considered as a representative of the amount of unique information represented by the matrix . The higher the rank, the more informative.

​ In large applications, **Using SVD to obtain a k-rank approximation of the X matrix**: Because, among all non-zero elements arranged in descending order on the diagonal of the diagonal matrix Σ, a small number (k) of the singular values ​​in the front are very large, and most of the singular values ​​in the latter are very small and can be omitted . so,Only take the first k singular values, and correspondingly truncate the three matrices (see Figure 2.9), the product of these three matrices is the k-rank approximation of the X matrix, namely X k .

image-20230310204515568

​ From Figure 2.9, we can see that the k-rank approximation of the X matrix undergoes two dimension reductions :

​ The first time is to clear the zero value , that is, to clear all the zero-value elements on the diagonal of the matrix Σ, and at the same time clear the corresponding rows and columns in the matrix U and V T (the green part in the figure);

​ The second time is to clear the small values , that is, keep the larger k non-zero values ​​on the diagonal of the matrix Σ as singular values, and remove the smaller non-zero values, and at the same time remove the corresponding values ​​in the matrix U and V T The rows and columns are cleared (the gray part in the figure).

​ Example 2.4: If the corpus is assumed to be I like deep learning. I like NLP. I enjoy flying, then the simple SVD word vector Python code is shown in Figure 2.10(a); print out the U matrix corresponding to the 2 largest singular values The first two columns of are shown in Figure 2.10(b).

image-20230310205013243

2.3.3. Tips for finding X matrix

Here are a few tricks Rohde et al. used in COALS in 2005

(Rohde et al. ms., 2005. An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence):

(1) Running SVD on raw counts does not work well;

(2) Scaling the counts in the cells would help a lot;

​ Problem: Function words (the, he, has) are too frequent, resulting in too much syntactic influence . Some fixes:

​ 1 Recording frequency

​ 2 min(X,t), where t ≈ 100 ( that is, limit the occurrence of high-frequency words , when more than 100 times, only take 100 times)

​ 3 Ignore function words

4 Use log to scale

(3) Slanted window: [in window-based counting, increase the count of closer words ] count more recent words (that is, attenuate the weight according to the distance from the central word), such as 1 co-occurrence when it is adjacent, and 0.5 co-occurrence when it is 5 words away from the central word ;

(4) Use Pearson's coefficient instead of counts, then set negative values ​​to 0; and so on.

​ The word vectors trained with the COALS model also exhibit many interesting properties (similar to Word2Vec), see Figure 2.11.

image-20230310205348314

​ In the syntactic pattern of Figure 2.11(a) , word vectors are basically cluster components, and different tenses of the same verb are basically clustered together .

In the semantic schema of Figure 2.11(b) , word vectors are basically linear components, and the basic directions of verbs and verb performers are consistent .

​ butSVD also has flaws: it is very difficult to train on a large data set, and the computational complexity is high ; if you want to integrate new vocabulary or articles, you must re-do SVD, and the computational cost is high

2.4. GloVe model

Let's summarize the characteristics of the two methods of obtaining word vectors based on co-occurrence matrix counting and prediction models :

image-20230312122100167

Count-based: direct estimation using global statistics of the entire matrix

Direct prediction: define a probability distribution and try to predict words

​ In particular, let's compare the advantages and disadvantages of SVD and Word2vector:

image-20230312122116248

Based on the above advantages and disadvantages, a group of researchers at Stanford were inspired, and the GloVe word vector was generated under such motivation.

​ GloVe (Global Vectors) is a method that combines the advantages of both methods , making full use of statistical data, so the training speed is fast , and it can be trained on a large corpus while training on a smaller corpus or training dimensions It can also perform well with smaller word vectors.

2.4.1. Encoding meaning components

Key idea : == the ratio of co-occurrence probability == can encode the meaning component

The point is not a single probability, but the ratio between them,Which contains the meaning component

Key insight from GloVe: the ratio of co-occurrence probabilities can encode meaning components (Pennington, Socher, and Manning, EMNLP 2014).

image-20230312123441528

We start with a simple example showing how to extract meaning directly from co-occurrence probabilities. Consider two words i and j which express a specific interest, for concreteness, suppose we are interested in concepts related to thermodynamics , for which i = ice and j = steam. We examine the relationship of these words by studying their ratio of co-occurrence probability x with various probe words.

​ In Figure 2.12, the left side is the qualitative reasoning, the right side is the quantitative calculation, and finally the quantitative is used to check the correctness of the qualitative . Here are three tests:

​ ①Because, solid words and ice words should be relatively strong (solid ice) , but seldom talk about solid steam, so the qualitative reasoning is that P(x|ice) should be much larger than P(x|steam) (that is, the large on the left), and the corresponding quantitative calculation result on the right is indeed very large (8.9).

​ ②Because gas is usually not used to describe ice, but it will appear at the same time as steam , so the qualitative reasoning is that P(x|ice) should be much smaller than P(x|steam) (ie small on the left), and The corresponding quantitative calculation result on the right is really small (8.5×10 -2 ).

​ ③Because **"water" is relatively related to "ice" and "steam", but "fashion" is not related to "ice" and "steam", so qualitative reasoning is in these two cases P (x|ice) should be very close to P(x|steam) (that is, the two terms on the left are ≈1 and ≈1), and the corresponding quantitative calculation results on the right are indeed very close (that is, the two terms on the right are 1.36 and 0.96).

The above three tests all verify that the reasoning is correct.

​ Compared with the original probabilities P(x|ice) and P(x|steam),**Relative probability P(x|ice)/P(x|steam)** can better distinguish related words (solid and gas) from irrelevant words (water and fashion), and can better distinguish two related words word


The question poses: how do we capture the ratio of co-occurrence probabilities as linear meaning components in the word vector space ?

​ Solution: According to the logarithmic bilinear model
$$
log-bilinear model: w_i⋅w_j=log⁡P(i|j) \

Vector difference: w_x⋅(w_a−w_b)=log⁡P\frac{P(x|a)}{P(x|b)}
$$
Find the objective function.

​ • If the vector dot product is equal to the logarithm of the co-occurrence probability (log bilinear model) , thenVector differences become ratios of co-occurrence probabilities(vector difference)

image-20230312132912536

  • Use squared error to push the dot product as close as possible to the logarithm of the co-occurrence probability

  • Use the f(x) weight function to restrict common words [giving less weight to noisy co-occurrences]

    image-20230312132955626

2.4.2. Objective function derivation

The above arguments suggest that a suitable starting point for word vector learning should be the ratio of co-occurrence probabilities (Pik/Pjk), rather than the probabilities themselves (Pik and Pjk). Noting that the ratio Pik /Pjk depends on the three terms i, j and k, the most general model takes the form:

image-20230312131600324

where w ∈ R d is a word vector, and w ~ ∈ R d is a separate context word vector. In this Equation 2.8, the right side is extracted from the corpus, and F on the left side may depend on some parameters that have not been specified. There are many possibilities for F , but by enforcing some requirements we can enforce a unique choice. First, we want F to encode information representing the Pik/Pjk ratio in the word vector space . Since vector spaces are inherently linear in structure , the most natural approach is to use vector differences . For this purpose, our considerations **Restricted to those functions F that only depend on the difference between two target words**, so modify equation (2.8), get:

image-20230312131825773

Next, we note that the parameters of F on the left side of equation (2.9) are vectors, while the right side is a scalar . While F can be viewed as a complex function, e.g. parameterized by a neural network, doing so confuses the linear structure we are trying to capture . To avoid this problem, we can first take the dot product of the parameters :

image-20230312132024188

This prevents F from mixing vector dimensions in undesirable ways. Next, note that for word-word co-occurrence matrices , the distinction between words and context words is arbitrary , and we are free to swap the two roles . In order to do this consistently, we must ** swap not only w↔w ~ but also X↔XT **. Under this relabeling, our final model should be invariant, but equation (2.10) is not. However,symmetryRecovery can be done in two steps.

image-20230312133745062

2.4.3. Model strengths and results

1. Advantages of the model

Training is very fast ;

Scalable to huge corpora ;

• Good performance even with small corpora and small vectors.

2. Model results

For example, a GloVe word vector example, we can find some words closest to frog (frog) through the word vector obtained by GloVe, and we can see that they are very similar animals. :

1. Frog

​ 2. Toad

3. Litoria

4. Leptodactylidae

5. Lana (rana)

6. Lizard

7. Eleutherodactylus

2.5. Evaluating word vectors

How to evaluate word vectors? There are two methods of evaluation in NLP: intrinsic and extrinsic.

  • Intrinsic Evaluation

    • Evaluate specific/intermediate subtasks
    • calculation speed
    • help to understand the system
    • Not sure if it's really useful unless a correlation is established with the actual task
  • External task mode

    • Evaluation on real tasks (such as downstream NLP tasks)
    • Computing accuracy can take a long time
    • It is not clear where the subsystem problem is, whether it is an interaction or another subsystem problem
    • If replacing one subsystem with another improves accuracy (wins)

2.5.1. Intrinsic Evaluation

2.5.1.1. Word vector analogy

The word embedding analogy task consists of the question: "Is a to b like c to?", iea:b::c:? . The dataset contains 19,544 such questions, divided into semantic and syntactic subsets . Semantic questions are usually about people or place analogies, such as "Athens is to Greece what Berlin is to?" . Syntactic questions are usually about verb tenses or analogies of adjective forms, such as "dance is to dance as fly is to ?" . In order to answer the question correctly, the model should uniquely identify the missing terms , and only exact correspondences are counted as correct matches. We answer the question "Is a to b like c to?" bycosine similarityto find the word d that represents wd closest to wb − wa + wc .

Specifically, word vector analogy is to calculate the maximum cosine similarity of word vectors

​ For a word pair a, b with a certain relationship , given the word c , find words with a similar relationship as shown in (2.17):

image-20230312135945022

The intuitive interpretation of equation (2.17) is: because x_b-x_a=x_d-x_c (for example, queen-king=actress-actor), so x_d=x_b-x_a+x_c, and then maximize the two word vectors x_d and x_i The dot product between and finally normalized to get the cosine similarity.

In fact, the word vector analogy problem is a parallelogram model

​ For example, according to the algorithm of the parallelogram model, if three known word vectors are input man:woman::king: ?, can the output be? is the word vector of queen, see Figure 2.14.

image-20230312140543119

In summary, word vector analogy evaluates word vectors by how well their additive cosine distances capture intuitive semantic and syntactic analogy issues.

Note: Sometimes the closest value returned by the parallelogram algorithm a:b::c: ?in the word2vec or GloVe embedding space is often not actually d , but one of the 3 input words or their morpheme variants (e.g. cherry:red::potato:x returns potatoes or potatoes instead of brown), so these must be explicitly excluded , i.e. discard those in the search (result output) that are identical to the input word !

Question: What if there are information but they are not linearly related ?

2.5.1.2. GloVe visualization

GloVe can capture the relational attributes of the word vector space , such as semantic attributes and syntactic attributes, see Figure 2.15

image-20230312140949361

​ The above is the spatial distribution of word vectors obtained by GloVe. We subtract the word vectors and find that the analogous word pairs have similar distances.

brother – sister, man – woman, king - queen

image-20230312141225856

2.5.1.3. Analogy Evaluation and Hyperparameters

Definition of hyperparameters: In the context of machine learning, a hyperparameter is a parameter whose value is set before starting the learning process , rather than the parameter data obtained through training. Usually, it is necessary to optimize the hyperparameters and select a set of optimal hyperparameters for the learning machine to improve the performance and effect of learning.

​ An example of the hyperparameter setting of the GloVe word vector analogy evaluation: set the dimension (Dim) of the word vector to 300, the number of corpus tokens (Size) to 6Billion, and the semantics (Sem) on six models from SVD to GloVe , syntax (Syn) and overall (Tot) were evaluated, and the results are shown in Figure 2.16:

image-20230312141449200

​ GloVe's research results are shown in Figure 2.17:

image-20230312141520459

​ As shown in Figure 2.17(a), more data is more accurate, and Wikipedia data outperforms news text data. This is because the larger the corpus for model training, the better the performance of the model . For example, word analogies can produce incorrect results if the test word was not included during training.

  • Because Wikipedia is all about explaining concepts and how they are related to each other , more explanatory text showing all the connections between things
  • And the news doesn't explain, it just explains some events

​As shown in Figure 2.17(b), the larger the dimension is, the more accurate it is, and a good dimension is about 300. For very high or low dimensional word embeddings, the model performs poorly. On the one hand, low-dimensional word embeddings cannot capture the meaning of different words in the corpus , which is a high bias problem with too low model complexity . For example, we consider the words "king", "queen", "man", "woman". Intuitively, we need to use two dimensions such as "gender" and "leadership" to encode them into 2-byte word vectors. Word vectors with lower dimensions do not capture the semantic differences between the four words. On the other hand, too high dimensionality may capture noise in the corpus that does not help generalization , which is the so-called high variance problem.

The figure below is some experiments and experiences for analog evaluation and hyperparameters

  • 300 is a good word vector dimension

  • Asymmetric contexts (only using words on one side) are not very good, but this may not be completely true in downstream tasks

  • The window size is set to 8, which is better for Glove vectors

    • It actually works when the window size is set to 2, and it is better for syntactic analysis, because the syntactic effect is very local

2.5.1.4. Another Intrinsic Evaluation

​ Another simple way to evaluate the quality of word vectors is to let **HumanityTo score the similarity of two words in a fixed range (such as 0-10, which is a 10-point scale), and then compare it with the cosine similarity of the corresponding word vector . This has been tried on various datasets including human evaluation. Therefore, the word vector distance (cosine similarity) can test the correlation between two words (and its correlation with human judgment), see Figure 2.18(a); it can also test the correlation with "Sweden (Sweden)". Close words, see Figure 2.18(b).

image-20230312142511379

2.5.1.5. Relevance evaluation

Some ideas from the Glove paper have been shown to improve the skip-gram (SG skipping word model) model (eg, averaging two vectors), see Figure 2.19:

image-20230312142724392

Figure 2.19 shows the results for five different word similarity datasets. Similarity scores are obtained from word vectors by first normalizing each feature in the vocabulary and then computing cosine similarity. We compute the Spearman rank correlation coefficient between this score and human judgment. CBOW∗ represents the vectors available on the word2vec website, which are trained with word and phrase vectors on the 100B word news data. GloVe outperforms it when using a corpus less than half the size.

2.5.2. External Evaluation

Extrinsic evaluation of word embeddings, targeting all downstream NLP tasks in this class . One example where good word embeddings should directly help: named entity recognition . Figure 2.20 shows the results of the GloVe model on the NER task using a CRF-based model. The Discrete model is a baseline that uses the set of synthetic discrete features that come with the Stanford NER model standard distribution, but without word vector features. In addition to the previously discussed HPCA and SVD models, comparisons were made with the models of Huang et al. (2012) (HSMN) and Collobert and Weston (2008) (CW). The CBOW model was trained using the word2vec tool. The GloVe model outperforms all other methods on all evaluation metrics,

Except for the CoNLL test set, the HPCA method performs slightly better on this test set. The results show that GloVe vectors are useful in downstream NLP tasks , as neural vectors were first demonstrated in (Turian et al., 2010).

image-20230312143325485

An example is in the task of named entity recognition, looking for names of people, institutions, and geographical locations, word vectors are very helpful

2.6. Word meaning and ambiguity

2.6.1. Polysemy and clustering

In Section 1.2.2, mentions sense and menaning : a sense is a discrete representation of an aspect of a word's meaning .

Words are **ambiguous**: the same word can be used to mean different things. For example, the word "mouse" has (at least) two meanings: (1) a small rodent, (2) a device for manually controlling a cursor; the word "bank" means: (1) a financial institution , (2) A sloping embankment.

​Most words are polysemy

  • especially common words
  • especially long standing words

So, can a vector contain all these meanings?

The answer is yes, word representations can be improved by global context and multiple word prototypes (Huang et al. 2012).

​ The main idea is: cluster word windows around words , retrain each word and assign it to multiple different clusters bank1, bank2, etc., see Figure 2.21.

Cluster all the contexts of common words to get some clear clusters by the word, thereby decomposing this common word into multiple words, such as bank1, bank2, bank3 .

Additional note : Although this is very rough, and sometimes the division between sensors is not very clear, and even overlap each other.

image-20230312144055100

2.6.2. Polysemy and linear algebra

word embeddingUbiquitous in NLP and information retrieval, it's not clear what words represent when they're polysemous. Studies (Arora, …, Ma, …, TACL 2018) show that multiple senses exist in linear superpositions in word embeddings , and simple sparse encodings can approximate the vectors that capture the senses . [A novelty of this technique is that each extracted word sense is accompanied by one of about 2,000 "discourse atoms" that give a succinct description of other words that co-occur with that sense . Discourse atoms can have independent interests and make this method potentially more useful].

The linear algebra of word meaning is briefly described as follows:

​The different meanings of a exist in a linear superposition (weighted sum) in standard word embeddings like word2vec , i.e.:

image-20230312150420816

Surprising result:

  • Just a weighted average is already good enough
  • Thanks to concepts derived from sparse coding, you can actually separate out meanings (provided they are relatively common)

Supplementary explanation : It can be understood that because words exist in high-dimensional vector spaces , different latitudes contain different meanings , so the weighted average will not damage the information stored in words at the latitudes to which different meanings belong.

Figure 2.22 shows an example. Each word (e.g. tie) is represented using up to five utterance atoms, which typically capture different meanings (with some noise/errors) . The content of each atom can be discerned by looking at nearby cosine similarity words . Algorithms often go wrong in the last atom (or two), as happened here. Similar results are obtained using other word embeddings such as word2vec and GloVe.

image-20230312150822572

2.7. Discussion and representation of classification

For classification problems, we have the training dataset: it consists of some samples

image-20230312160057637

  • xi is the input , such as words (indexes or vectors), sentences, documents, etc. (dimension d)

  • yi is the label (one of C classes) we are trying to predict , for example:

    • Categories: Feelings, Named Entities, Buy/Sell Decisions
    • other words
    • Multi-word sequences (will be mentioned later)

    Based on this data set, it can be trained to classify

2.7.1. Intuitive interpretation of classification

The intuition of classification is: training data {xi,yi}i=1N, make a decision.

Simple case: Visualization by Andrej Karpathy using ConvNetJS with fixed 2D word vector input for classification, softmax/logistic regression and linear decision boundary.

(Refer to the website http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)

Traditional machine learning/statistics (ML/Stats) method: Assuming xi is fixed , train (ie set) softmax/logistic regression weight W ∈ Rcxd to determine the decision boundary (hyperplane) , as shown in Figure 2.23. Specifically, for each fixed x, make a prediction:

image-20230312160342310

image-20230312160441173

2.7.2. Softmax Normalization Function: Linear Classifier

image-20230312160856093

2.7.3. Cross entropy loss function

​ The cross-entropy loss is most commonly used in the softmax classifier , and it is also a negative logarithmic probability form .

image-20230312161146199

2.7.4. Traditional machine learning optimization

image-20230312161258332

2.8. Neural Network Classifier

​The main tool for logistic only gives a linear decision boundary , which is not very powerful in itself, and its function may be very limited, especially when the problem is complex. For example, as shown in Figure 2.24(a), the Softmax classifier simply, linearly, and incorrectly classifies the two points indicated by the arrows as negative. In fact, if you use a nonlinear classifier (such as the Sigmod activation function that will be introduced below) , you can correctly classify them into positive classes, see Figure 2.24(b).

image-20230312161519673

​ As can be seen from Figure 2.24, there are nonlinear boundaries in the original space. The neural network wins because it can learn more complex functions with non-linear decision boundaries that can correctly classify!

  • Linear classifier Softmax (≈ logistic regression) alone is not very powerful

  • As shown in the figure above, Softmax gets a linear decision boundary

    • For complex problems, its expressive power is limited
    • There are some misclassified points, which require stronger nonlinear expression ability to distinguish—>Sigmod
  • Neural networks can learn more complex functions and nonlinear decision boundaries

  • tip : More advanced classification requires

    • word vector
    • Deeper Deep Neural Networks

2.8.1. Classification of word vectors

  • Generally in NLP deep learning:

    • We learn the regular parameter matrix W and word vectors x .
    • We learn traditional parameters and representations.
    • The word vectors are a re-presentation of the one-hot vectors - shifting them in the vector space of intermediate layers - so that a (linear) softmax classifier can better classify, conceptually through the embedding layer: x = Le( i.e. linear embedding), see Equation 2.26:
    • That is to understand the word vector as a layer of neural network , input the one-hot vector of the word and get the word vector representation of the word , and we need to update it

image-20230312162256621

Now let's estimate the number of parameters that need to be updated when training the model's weights (W) and word vectors ( x ) at the same time. We know that a simple linear decision model requires at least a d-dimensional word vector input and generates a distribution of C categories. So to update the weights of the model, we need C d parameters. If we also update word vectors for each word in vocabulary V, then V word vectors need to be updated, and each word vector has dimension d. So for a simple linear classification model, the total number of parameters is C d+V*d.

2.8.2. Neural Units

The physiological structure of a neural unit includes (see Figure 2.25(a)):

dendrites: neuronal fibers that receive incoming information from other neurons;

Cell body: accepts external information, summarizes various information, and performs threshold processing to generate nerve impulses;

Axon: The dendrites and cell bodies that connect other neurons and complete the transmission of information between neurons.

Synapse: The transmission of information in one direction, with variable strength and learning function.

image-20230312162827589

image-20230312162955299

2.8.3. Neural Networks

One Neural Network: Combining Multiple Logistic Regression

If you run multiple logistic regression units simultaneously, you can get a neural network.

If we provide an input vector and run it through a bunch of logistic regression functions, we get an output vector, but we don't have to decide in advance which variables these logistic regressions are trying to predict, as in Figure 2.26(a); we can Input again into another logistic regression function, the loss function will guide what the intermediate hidden variables should be in order to better predict the target of the next layer, etc., as shown in Figure 2.26(b); continuing, we have a multi-layer neural network, Figure 2.26©.

image-20230312163200847

There are many representation methods for multi-layer neural networks, see Figure 2.27:

image-20230312163339198

2.8.4. Sigmoid activation function: nonlinear classifier1

Why do we need a non-linear (aka "f") Sigmoid activation function? Because it can approximate arbitrarily complex functions . For example:

​ Without nonlinearity, deep neural networks can only do linear transformations, and additional layers can only be compiled into a single linear transformation: W1W2x=Wx;

With nonlinearity and more layers, more complex arbitrary functions can be approximated, see Figure 2.28

image-20230312163604455

  • For example: function approximation such as regression or classification

    • Without nonlinearity , deep neural networks can only do linear transformations
    • Multiple linear transformations still form a linear transformation W1W2x=Wx
  • Because linear transformation rotates and stretches space in a certain way , multiple rotations and stretches can be combined into one linear transformation [All in one]

  • For nonlinear functions, using more layers, they can approximate more complex functions

1 Compiler's note:The difference between Softmax function and Sigmiod function

Softmax is a normalization function that can be used as a linear classifier;

Sigmoid is an activation function that can be used as a nonlinear classifier.

Commonly used activation functions: Sigmoid, linear, relu

Guess you like

Origin blog.csdn.net/mwcxz/article/details/129477695