[Stanford Ph.D. Thesis] Accelerated Machine Learning Algorithms Based on Adaptive Sampling

6dab86ddfa6c42f93fdd4adb67202ce5.png

来源:专知
本文为论文介绍,建议阅读5分钟这篇论文的结果是基于自适应采样文献中的技术。

870b5043b50ead7e9cfdf56c3917bec1.png

In the era of massive data, efficient machine learning algorithms have become crucial. However, many common machine learning algorithms rely on subroutines that are computationally prohibitively expensive on large datasets. Typically, existing techniques subsample the data or use other methods to increase computational efficiency, but this comes at the cost of introducing some approximation error. The paper shows that often enough effects can be obtained with little loss of quality simply by replacing computationally intensive subroutines with a special randomization method. The results in this paper are based on techniques from the adaptive sampling literature. Chapter 1 is introduced with a specific adaptive sampling problem: optimal arm identification in multi-armed bandit machines. We first provide an environmental setting and a formal description of the optimal arm identification problem. We then introduce a general algorithm called "sequential elimination" for the optimal arm identification problem. In Chapters 2, 3, and 4, we will apply the techniques developed in Chapter 1 to different problems. In Chapter 2, we discussed how to reduce the k-medoids clustering problem to a series of optimal arm identification problems. We leverage this finding to propose a new algorithm based on sequential elimination that is comparable in clustering quality to the previous state-of-the-art but reaches the same solution much faster. Under the general assumption of the data generating distribution, our algorithm achieves an O(n logn) reduction in sample complexity, where n is the size of the dataset.

In Chapter 3, we analyzed the problem of training tree-based models. Most of the training time for such models is spent splitting each node of the tree, i.e. determining at which feature and corresponding threshold to split each node. We show that the node splitting subroutine can be reduced to an optimal arm identification problem, and introduce a state-of-the-art algorithm for training trees. Our algorithm relies only on the relative quality of each possible split, rather than explicitly on the size of the training dataset, and reduces the explicit dependence on the dataset size n from O(n) to O (1). Our algorithm is generally applicable to many tree-based models such as Random Forest and XGBoost. In Chapter 4, we study the maximum inner product search problem. We note that, like the k-medoids and node segmentation problems, the maximum inner product search problem can be reduced to an optimal arm identification problem. With this observation, we propose a novel algorithm for the maximum inner product search problem in high-dimensional datasets. Under reasonable assumptions about the data, our algorithm reduces the explicit scaling with dataset dimensionality d from O(√d) to O(1). Our algorithm has several advantages: it requires no preprocessing of the data, it naturally handles adding or removing data points, and it includes a hyperparameter to trade off accuracy and efficiency. Chapter 5 concludes with a summary of the contributions of this paper and possible directions for future work.

https://searchworks.stanford.edu/view/14783548

cb99ef59518e3c67614ffa44b0b456da.png

ef7f7ca0af6ab69e0cfa5b1b72ebd19b.png

ff152b1d43d53bd9890283add2e9118f.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131745932