What are Embeddings in Machine Learning

Embedding has permeated the data scientist's toolkit and dramatically changed the way NLP, computer vision, and recommender systems work. However, many data scientists find them outdated and confusing. More people use them blindly without understanding what they are. In this article, we'll take a deep dive into what embeddings are, how they work, and how they typically work in real-world systems.

What are Embeddings?

To understand embeddings, we must first understand the basic requirements of machine learning models. Specifically, most machine learning algorithms can only take low-dimensional numerical data as input.

In the neural network below, each input feature must be a number. This means that in domains such as recommender systems, we must convert non-numeric variables (such as items and users) into numbers and vectors. We could try to represent items with product IDs; however, the neural network treats numeric inputs as continuous variables. This means that higher numbers are "bigger than" lower numbers. It also treats similar numbers as similar items. This makes perfect sense for a field like "age", but is meaningless when the numbers represent categorical variables. Before embedding, one of the most common methods is one-hot encoding.

insert image description here

one-time code

One-hot encoding is a common way to represent categorical variables. This unsupervised technique maps individual categories to vectors and produces binary representations. The actual process is simple. We create a vector of size equal to the number of categories and set all values ​​to 0. Then set the row associated with the given ID or IDs to 1.

insert image description here
Technically, this turns a category into a set of continuous variables, but what we actually end up with is a huge vector of 0s with one or a few 1s in it. This simplicity also has drawbacks. For variables with many unique categories, it creates an unmanageable dimensionality. Since each item is technically equidistant in the vector space, it ignores the context around the similarity. In a vector space, classes with smaller variances are no closer than classes with larger variances.

This means that the terms "hot dog" and "hamburger" are no closer than "hot dog" and "Pepsi." Therefore, we cannot evaluate the relationship between two entities. We can generate more one-to-one mappings, or try to group them and look for similarities. This requires a lot of work and manual labeling, which is often not feasible.

Intuitively, we would like to be able to create denser representations of categories and maintain some implicit relationship information between items. We need a way to reduce the number of categorical variables so that we can place items of similar categories closer together. This is exactly what embedding means.

Embedding solves coding problems

Embeddings are dense numerical representations of real-world objects and relationships, represented as vectors. Vector spaces quantify semantic similarity between categories. Embedding vectors that are close to each other are considered similar. Sometimes they're used directly in the "Similar to this" section of an eCommerce store. Other times, embeddings are passed to other models. In these cases, the model can share learnings from similar items, rather than treating them as two completely distinct categories, as is the case with one-hot encoding. Thus, embeddings can be used to accurately represent sparse data such as clickstreams, text, and e-commerce purchases as features for downstream models. On the other hand, embeddings are much more computationally expensive and less interpretable than one-hot encodings.

How are embeds created?

Common approaches to creating embeddings require us to first set up a supervised machine learning problem. As a side effect, training the model encodes categories into embedding vectors. For example, we could build a model that predicts the next movie a user will watch based on what the user is watching right now. The embedding model will decompose the input into a vector which will be used to predict the next movie. This means that similarity vectors are movies that are frequently watched after similar movies. This provides great performance for personalization. So even though we are solving a supervised problem (often called a proxy problem), the actual creation of embeddings is an unsupervised process.

Defining an agency problem is an art and greatly affects the behavior of the embedding. For example, YouTube's recommendation team realized that using "predict the next video a user will click on" would result in clickbait being widely recommended. They asked "predict the next video and how long they will watch it" as an alternative problem and achieved better results.

Common Embedding Models

Principal Component Analysis (PCA)

One method of generating embeddings is called Principal Component Analysis (PCA). PCA reduces the dimensionality of entities by compressing variables into smaller subsets. This allows the model to run more efficiently, but makes the variables more difficult to interpret and often results in loss of information. A popular implementation of PCA is a technique called SVD.

singular value decomposition

Singular value decomposition, also known as SVD, is a dimensionality reduction technique. SVD reduces the number of dataset features from N to K dimensions by matrix factorization. For example, we represent a user's video rating as a matrix of size (number of users) x (number of items), where the value of each cell is the user's rating for that item. We first choose a number k, which is our embedding vector size, and use SVD to transform it into two matrices. One is (number of users) xk and the other is kx (number of items).

In the resulting matrix, if we multiply the user vector by the item vector, we should get the predicted user ratings. If we multiply the two matrices, we end up with the original matrix, but densely populated with all our predicted scores. It follows that two items with similar vectors will result in similar ratings by the same user. In this way, we end up creating user and item embeddings.

insert image description here

word vector

Word2vec generates embeddings from words. Words are encoded as one-hot vectors and fed into hidden layers that generate hidden weights. These hidden weights are then used to predict other nearby words. Although these hidden weights are used for training, word2vec does not use them for the task of training. Instead, the hidden weights are returned as embeddings, and the model is discarded.

insert image description here
Words found in similar contexts will have similar embeddings. Beyond that, embeddings can also be used to form analogies. For example, the vector from king to man is very similar to the vector from queen to woman.

One problem with Word2Vec is that a single word has a vector map. This means that all semantic uses of a word are combined into one representation. For example, the word "play" in "I'm going to watch a play" and "I want to play" will have the same embedding, but the context cannot be distinguished.

insert image description here

BERT

Bi Direction Encoder Representations of Transformers, also known as BERT, is a pre-trained model for solving Word2Vec context problems. BERT is trained in two steps. First, it is trained on huge datasets like Wikipedia to generate embeddings similar to Word2Vec. The end user performs the second training step. They are trained using datasets well suited to their context, such as medical literature. BERT will be fine-tuned for specific use cases. Additionally, to create word embeddings, BERT takes into account the context of the words. This means that the word "play" in "I want to watch a play" and "I want to play" will correctly have different embeddings. BERT has become the go-to transformer model for generating text embeddings.

Embedding in the real world

The use of embedding started in research labs and quickly became state-of-the-art. Since then, embeddings have appeared in production machine learning systems in a variety of different domains, including NLP, recommender systems, and computer vision.

Recommended system

Recommender systems predict user preferences and ratings for various entities/products. The two most common approaches are collaborative filtering and content-based. Collaborative filtering uses operations to train and form recommendations. Modern collaborative filtering systems almost all use embeddings. For example, we can use the SVD method defined above to build a recommender system. In this system, user embeddings are multiplied by item embeddings to generate rating predictions. This provides a clear relationship between users and products. Similar items get similar ratings from similar users. This property can also be used in downstream models. For example, Youtube's recommender uses embeddings as input to a neural network that predicts watch time.

insert image description here

semantic search

Users expect search bars to be smarter than regular expressions. Whether it's a customer support page, a blog, or Google, a search bar should understand the intent and context of a query, not just look at words. Search engines used to be built around TF-IDF, which also creates embeddings from text. This semantic search works by using nearest neighbors to find the document embedding that is closest to the query embedding.

Today, semantic search leverages more complex embeddings (e.g. BERT) and may use them in downstream models.

computer vision

In computer vision, embeddings are often used as a way to translate between different contexts. For example, if training a self-driving car, we can convert an image of the car into an embedding, and then decide what to do based on the context of the embedding. By doing this, we can do transfer learning. We can take generated images from games like Grand Theft Auto, embed them in the same vector space, and train a driving model without feeding it a lot of expensive real-world images. Tesla is practicing this today.

Another interesting example is the AI ​​art machine: https://colab.research.google.com/drive/1n_xrgKDlGQcCF6O-eL3NOd_x4NSqAUjK#scrollTo=TnMw4FrN6JeB. It will generate an image based on the text entered by the user. For example, if we enter "Nostalgia", we get the following image.

insert image description here
It works by converting the user's text and images into embeddings in the same latent space. It consists of four converters: Image -> Embed, Text -> Embed, Embed -> Text, Image -> Text. With all these transformations, we can convert text to images and vice versa using embeddings as intermediate representations.

insert image description here

embedded operation

In the examples above, we saw that there are some common operations applied to embeddings. Any production system that uses embedding should be able to implement some or all of the following.

average

Using something like word2vec, we can end up with embeddings for each word, but we usually need an embedding for a full sentence. Similarly, in a recommender system, we may know which items a user clicked on recently, but their user embeddings may not have been retrained for several days. In these cases, we can average the embeddings to create higher-level embeddings. In the sentence example, we can create sentence embeddings by averaging each word embedding. In recommender systems, we can create user embeddings by averaging the last N items clicked by the user.

insert image description here

Subtraction/Addition

We mentioned earlier how word embeddings encode analogies through vector differences. Vector addition and subtraction can be used for a variety of tasks. For example, we can find the average difference between cheap and luxury brand coats. We can store that delta and use it whenever we want to recommend luxury items similar to the one the user is currently viewing. We can find the difference between Coke and Diet Coke and apply it to other drinks, even those without Diet Coke, to find the closest thing to a diet version.

insert image description here

nearest neighbor

Nearest neighbor (NN) is usually the most useful embedding operation. It finds something similar to the current embed. In recommender systems, we can create embeddings of users and find the items most relevant to them. In a search engine, we can find the documents that are most similar to the search query. However, nearest neighbor is a computationally expensive operation. Simply doing it is O(N*K), where N is the number of items and K is the size of each embedding. However, in most cases, when we need nearest neighbors, the approximation will suffice. If we recommend five items to a user and one of them is technically the sixth closest item, the user probably won't care. Approximate nearest neighbor (ANN) algorithms typically reduce the lookup complexity to O(log(n)).

in conclusion

Embeddings are an essential part of the data science toolkit and continue to gain in popularity. Embedding enables the team to break with the latest technologies in multiple disciplines, from NLP to recommender systems. As they become more popular, more attention will be paid to how they work in real-world systems. We think embedded storage will be a critical part of machine learning infrastructure, which is why we open source it. Give it a try, and if this sounds like a great full-time project, we're hiring!

Guess you like

Origin blog.csdn.net/weixin_42990464/article/details/131522739