Summary of 8 time series classification methods

Classifying time series is one of the common tasks for applying machine and deep learning models. This post will cover 8 types of time series classification methods. This ranges from simple distance or interval-based methods to those using deep neural networks. This post is intended as a reference post for all time series classification algorithms.

time series definition

Before covering various types of time series (TS) classification methods, we first unify the concept of time series, TS can be divided into univariate or multivariate TS.

  • A univariate TS is an ordered set of (usually) real values.
  • A multivariate TS is a set of univariate TSs. Each timestamp is a vector or array of real values.

Datasets of uni- or multivariate TSs usually contain an ordered set of uni- or multivariate TSs. Furthermore, datasets often contain labels represented by a single encoded vector whose length represents labels of different classes.

The goal of TS classification is defined by training any classification model on a given dataset such that the model learns a probability distribution over the provided dataset. That is, the model should learn to correctly assign class labels when given TS.

distance-based method

Distance-based or nearest-neighbor TS classification methods use various distance-based metrics to classify given data. It is a supervised learning technique in which the prediction of a new TS depends on the label information of the known time series it is most similar to.

A distance metric is a function that describes the distance between two or more time series, and it is deterministic. A typical distance metric is:

  • p-norm (such as Manhattan distance, Euclidean distance, etc.)
  • Dynamic Time Warping (DTW)

After deciding on the metric, a k-nearest neighbor (KNN) algorithm is usually applied, which measures the distance between a new TS and all TSs in the training dataset. After all distances are calculated, select the k closest ones. Finally the new TS is assigned to the category to which the majority of the k nearest neighbors belong.

While the most popular norms are certainly the p-norm, especially the Euclidean distance, they have two major drawbacks that make them less suitable for the TS classification task. Since the norm is only defined for two TSs of the same length, it is not always possible to obtain sequences of equal length in practice. Norm only independently compares the two TS values ​​at each time point, but most TS values ​​correlate with each other.

DTW, on the other hand, can address two limitations of the p-norm. Classical DTW minimizes the distance between two time series points whose timestamps may differ. This means that slightly shifted or distorted TSs are still considered similar. The figure below visualizes the difference between p-norm based measures and how DTW works.

Combined with KNN, DTW is used as a benchmark benchmark algorithm for various benchmark evaluations of TS classification.

KNN can also be implemented as a decision tree. For example, the neighborhood forest algorithm models a forest of decision trees, using a distance metric to partition TS data.

Interval- and frequency-based methods

Interval-based methods usually segment the TS into multiple distinct intervals. Each subsequence is then used to train a separate machine learning (ML) classifier. An ensemble of classifiers is generated, each classifier acting on its own interval. Computing the most common class among individually classified subsequences will return the final labels for the entire time series.

Time Series Forest

The most famous representative of interval-based models is Time Series Forest. TSF is an ensemble of decision trees built on random subsequences of the initial TS. Each tree is responsible for assigning a class to an interval.

This is done by computing summary features (usually mean, standard deviation, and slope) to create eigenvectors for each interval. Decision trees are then trained based on the computed features and predictions are obtained by a majority vote of all trees. The voting process is necessary because each tree evaluates only a certain subsequence of the initial TS.

In addition to TSF, there are other interval-based models. Variants of TSF use additional features such as the median, interquartile range, minimum, and maximum of the subsequence. Compared with the classic TSF algorithm, there is also a rather complex algorithm called Random Interval Spectral Ensemble (RISE) algorithm.

RISE

The RISE algorithm differs from the classic TS forest in two respects.

  • Use a single TS interval per tree
  • It is trained by using spectral features extracted from TS (instead of using summary statistics for mean, slope, etc.)

In the RISE technique, each decision tree is built on a different set of Fourier, autocorrelation, autoregressive, and partial autocorrelation features. The algorithm works as follows:

Choose the first random intervals of TS and compute the above features on these intervals. A new training set is then created by combining the extracted features. On these basis, a decision tree classifier is trained. Finally these steps are repeated with different configurations to create the ensemble model, which is a random forest of a single decision tree classifier.

dictionary-based approach

Dictionary-based algorithms are another class of TS classifiers, which are based on the structure of a dictionary. They cover a large number of different classifiers and can sometimes be combined with the above classifiers.

Here is a list of dictionary-based methods covered:

  • Bag-of-Patterns (BOP)
  • Symbolic Fourier Approximation (SFA)
  • Individual BOSS
  • BOSS Ensemble
  • BOSS in Vector Space
  • contractable BOSS
  • Randomized BOSS
  • WEASEL

Such methods usually first convert the TS into a sequence of symbols, from which "WORDS" are extracted through a sliding window. The final classification is then done by determining the distribution of "WORDS", which is usually done by counting and sorting the "WORDS". The theory behind this approach is that time series are similar, meaning they belong to the same class if they contain similar "WORDS". The main process is generally the same for dictionary-based classifiers.

  • Run a sliding window of a specific length on TS
  • Convert each subsequence to a "WORDS" (with a certain length and a fixed set of letters)
  • Create these histograms

Below is a list of the most popular dictionary-based classifiers:

Bag-of-Patterns Algorithm

The Bag-of-Patterns (BOP) algorithm works similarly to the Bag-of-Patterns algorithm for text data classification. This algorithm counts the number of occurrences of a word.

The most common technique for creating words from numbers (here raw TS) is called Symbolic Aggregation Approximation (SAX). The TS is first divided into different blocks, and each block is then normalized, which means it has a mean of 0 and a standard deviation of 1.

Usually the length of a word is longer than the number of real values ​​in the subsequence. Therefore, binning is further applied to each block. Then calculate the average actual value for each bin, and map it to a letter. For example, assign the letter "a" for all mean values ​​below -1, "b" for all values ​​greater than -1 and less than 1, and "c" for all values ​​above 1. The figure below visualizes this process.

Here each segment contains 30 values, which are grouped into groups of 6, and each group is assigned three possible letters to form a five-letter word. Finally the number of occurrences of each word is summed and used for classification by plugging them into the nearest neighbor algorithm.

Symbolic Fourier Approximation

Contrary to the idea of ​​the above BOP algorithm, in which the original TS is discretized into letters and then words, a similar method can be applied to the Fourier coefficients of TS.

The most famous algorithm is Symbolic Fourier Approximation (SFA), which can be divided into two parts.

Computes the discrete Fourier transform of TS while preserving a subset of the computed coefficients.

  • Supervised: Univariate feature selection is used to select higher-ranked coefficients based on statistics such as F-statistics or χ2-statistics
  • Unsupervised: usually take a subset of the first coefficient, representing the trend of TS

Each column of the resulting matrix is ​​discretized independently, converting TS subsequences of TS into individual words.

  • Supervised: Binning edges are computed such that the impurity criterion of instance entropy is minimized.
  • Unsupervised: Compute bin edges so that they are based on extrema of Fourier coefficients (bins are uniform) or quantiles of these (same number of coefficients in each bin)

Based on the above preprocessing, various algorithms can be used to further process the information to obtain a prediction of TS.

BOSS

The Bag-of-SFA-Symbols (BOSS) algorithm works as follows:

  • Extract the subsequence of TS through the sliding window mechanism
  • Applies the SFA transformation on each fragment, returning an ordered set of words
  • Calculate the frequency of each word, which produces a histogram of TS words
  • Classification is performed by applying algorithms such as KNN combined with a custom BOSS metric (small variation of Euclidean distance).

The variant of the BOSS algorithm contains many variants:

BOSS Ensemble

The BOSS Ensemble algorithm is often used to construct multiple individual BOSS models, each varying in terms of parameters: word length, alphabet size, and window size. Patterns of various lengths are captured by these configurations. A large number of models are obtained by grid searching the parameters and keeping only the best classifiers.

BOSS in Vector Space

The BOSS in Vector Space (BOSSVS) algorithm is a variant of the individual BOSS method that uses vector space models, computes a histogram for each class, and computes a term frequency-inverse document frequency (TF-IDF) matrix. The classification is then obtained by finding the class with the highest cosine similarity between the TF-IDF vector of each class and the histogram of the TS itself.

Contractable BOSS

The Contractable BOSS (cBOSS) algorithm is computationally much faster than the classic BOSS method.

Acceleration is achieved by grid searching not the entire parameter space but a randomly selected sample from it. cBOSS uses a subsample of the data for each base classifier. cBOSS improves memory efficiency by only considering a fixed number of best base classifiers rather than all classifiers above a certain performance threshold.

Randomized BOSS

The next variant of the BOSS algorithm is Randomized BOSS (RBOSS). The method adds a stochastic process in the selection of the sliding window length and cleverly aggregates the predictions of the individual BOSS classifiers. This is similar to the cBOSS variant, reducing computation time while still maintaining baseline performance.

WEASE

The TS Classifier Extraction (WEASEL) algorithm improves the performance of the standard BOSS method by using sliding windows of different lengths in the SFA transformation. Similar to other BOSS variants, it uses window sizes of various lengths to convert TS into feature vectors, which are then evaluated by a KNN classifier.

WEASEL uses a specific feature derivation method, filtering out the most relevant features by proceeding only with non-overlapping subsequences of each sliding window applying the χ2 test.

Combine WEASEL with Multivariate Unsupervised Symbols (WEASEL+MUSE) to extract and filter multivariate features from TS by encoding contextual information into each feature.

Shapelet-based approach

Shapelets-based methods use the idea of ​​subsequences (i.e., shapelets) of the initial time series. The shapelets are chosen in order to use them as class representatives, which means that the shapelets contain the main characteristics of the class, which can be used to distinguish different classes. In optimal cases, they can detect local similarities between TSs within the same class.

The figure below gives an example of a shapelet. It's just a subsequence of the whole TS.

Using shapelets-based algorithms involves the problem of determining which shapelets to use. It is possible to select by hand crafting a set of shapelets, but this can be very difficult. It is also possible to automatically select shapelets using various algorithms.

Algorithm Based on Shapelet Extraction

Shapelet Transform is an algorithm based on Shapelet extraction proposed by Lines et al. It is one of the most commonly used algorithms at present. Given a TS of n real-valued observations, a shapelet is defined by a subset of the TS of length l.

The minimum distance between a shapelet and the entire TS can use the Euclidean distance - or any other distance measure - between the shapelet itself and all shapelets of length l starting from the TS.

Then the algorithm selects k best shapelets whose lengths belong to a certain range. This step can be viewed as a kind of univariate feature extraction, where each feature is defined by the distance between the shapelet and all TSs in the given dataset. The shapelets are then ranked based on some statistics. These are usually f-statistics or χ²-statistics, which order shapelets according to their ability to distinguish classes.

After completing the above steps, any type of ML algorithm can be applied to classify the new dataset. For example knn based classifiers, support vector machines or random forests etc.

Another problem with finding ideal shapelets is the dreaded time complexity, which scales exponentially with the number of training samples.

Algorithms Based on Shapelet Learning

Algorithms based on shapelet learning try to address the limitations of algorithms based on shapelet extraction. The idea is to learn a set of shapelets that are able to distinguish classes, rather than extracting them directly from a given dataset.

There are two main advantages to doing this:

  • It can obtain shapelets that are not included in the training set but are strongly discriminative for categories.
  • No need to run the algorithm on the entire dataset, which can significantly reduce training time

But this approach also has some disadvantages caused by the use of differentiable minimization functions and chosen classifiers.

Instead of Euclidean distance, we must rely on differentiable functions, so that shapelets can be learned by gradient descent or backpropagation algorithms. The most common relies on the LogSumExp function, which smoothly approximates the maximum value by taking the logarithm of the sum of the exponentials of its arguments. Since the LogSumExp function is not strictly convex, the optimization algorithm may not converge properly, which means it may lead to bad local minima.

And since the optimization process itself is the main component of the algorithm, multiple hyperparameters need to be added for tuning.

But the method is very useful in practice and can generate some new insights into the data.

Kernel-based methods

A slight variation of the shapelet-based algorithm is the kernel-based algorithm. Learn and use a random convolution kernel (the most common computer vision algorithm), which extracts features from a given TS.

The Randomized Kernel Transform (ROCKET) algorithm is specifically designed for this purpose. . It uses a large number of kernels that vary in length, weights, biases, dilation, and padding, and are created randomly from a fixed distribution.

After selecting the kernel, you also need a classifier that can select the most relevant features to distinguish the classes. The original paper uses ridge regression (an L2-regularized variant of linear regression) to perform predictions. There are two benefits to using it, firstly its computational efficiency, even for multi-class classification problems, and secondly, the ease of fine-tuning a unique regularization hyperparameter using cross-validation.

One of the core advantages of using kernel-based or ROCKET algorithms is that their use is relatively cheap to compute.

feature-based approach

Feature-based methods can generally cover most algorithms that use some kind of feature extraction for a given time series, followed by classification algorithms to perform predictions.

Regarding features, from simple statistical features to more complex Fourier-based features. A large number of such features can be found in hctsa (https://github.com/benfulcher/hctsa), but trying and comparing each feature can be an impossible task, especially for larger datasets. So the typical time series feature (catch22) algorithm is proposed.

catch22 algorithm

This method aims to infer a small TS feature set, which not only requires strong classification performance, but also further minimizes redundancy. catch22 selected a total of 22 features from the hctsa library (the library provides more than 4000 features).

The developers of this method obtained a small subset that still maintained excellent performance by training different models on 93 different datasets to obtain 22 features and evaluating the best-performing TS features on them. The classifier on it can be chosen freely, which makes it another hyperparameter to tune.

Matrix Profile Classifier

Another feature-based approach is the Matrix Profile (MP) classifier, which is an MP-based interpretable TS classifier that provides interpretable results while maintaining baseline performance.

Designers extracted a model called Matrix Profile from a shapelet-based classifier. The model represents all distances between a subsequence of TS and its nearest neighbors. In this way, MP is able to efficiently extract features of TS, such as motifs and discords, motifs are subsequences of TS that are very similar to each other, and discords describe sequences that are different from each other.

As a theoretical classification model, any model can be used. The developers of this method chose a decision tree classifier.

In addition to these two mentioned methods, sktime also provides some more feature-based TS classifiers.

model integration

Model ensemble is not a stand-alone algorithm per se, but a technique for combining various TS classifiers to create better combined predictions. Model ensembles reduce variance by combining multiple individual models, similar to random forests using large numbers of decision trees. And using various types of different learning algorithms leads to a wider and more diverse set of learned features, which in turn leads to better class discrimination.

The most popular model ensemble is the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE). It exists in many different kinds of similar versions, but what they all have in common is to combine the information from different classifiers, i.e. the predictions, by using a weighted average for each classifier.

Sktime uses two different HIVE-COTE algorithms, the first of which combines the probabilities of each estimator, including a shapelet transform classifier (STC), a TS forest, a RISE and a cBOSS. The second is defined by a combination of STC, Diverse Canonical Interval Forest Classifier (DrCIF, a variant of TS Forest), Arsenal (an ensemble of ROCKET models), and TDE (a variant of the BOSS algorithm).

The final predictions are obtained by the CAWPE algorithm, which assigns weights to each classifier obtained from the relative estimated quality of the classifiers found on the training dataset.

The following figure is a common diagram used to visualize the working structure of the HIVE-COTE algorithm:

Deep Learning-Based Approaches

Regarding deep learning based algorithms, one could write a long article by itself explaining all the details about each architecture. But this paper only provides some commonly used TS classification benchmark models and techniques.

Although deep learning-based algorithms are very popular and widely studied in fields such as computer vision and NLP, they are not common in the field of TS classification. Fawaz et al. An exhaustive study of current state-of-the-art methods in their paper on Deep Learning for TS Classification: Summary Over 60 neural network (NN) models with six architectures are studied:

  • Multi-Layer Perceptron
  • Fully Convolutional NN (CNN)
  • Echo-State Networks (based on Recurrent NNs)
  • Encoder
  • Multi-Scale Deep CNN
  • Time CNN

Most of the above models were originally developed for different use cases. So it needs to be tested according to different use cases.

Also released in 2020 is the InceptionTime network. InceptionTime is an ensemble of five deep learning models, each of which was created with InceptionTime first proposed by Szegedy et al. These inception modules simultaneously apply multiple filters of different lengths to TS, while extracting relevant features and information from both shorter and longer subsequences of TS. The figure below shows the InceptionTime block.

It consists of multiple inception modules stacked in a feed-forward manner and connected with residuals. Finally, the global average pooling and fully connected neural network generate prediction results.

The diagram below shows a single inception module in action.

Summarize

The vast list of algorithms, models, and techniques summarized in this article can not only help understand the vast field of time series classification methods, I hope it will be helpful to you

quote

[1] Johann Faouzi. Time Series Classification: A review of Algorithms and Implementations. Ketan Kotecha . Machine Learning (Emerging Trends and Applications), Proud Pen, In press, 978–1–8381524- 1–3. ffhal-03558165

[2] Fawaz, Hassan Ismail, et al. “Deep learning for time series classification: a review”

[3] Dinger, Timothy R., et al. “What is time series classification?”

[4] Edin, Frederik, Time series classification — an overview

[5] Anion, Alexandra, A brief introduction to time series classification algorithms

https://avoid.overfit.cn/post/3183d68076724a3db2654ecd22bd20c4

Author: Jan Marcel Kezmann

おすすめ

転載: blog.csdn.net/m0_46510245/article/details/128745768