[Paper Express] CSET - Big AI Potential of Small Data

[Paper Express] CSET - Big AI Potential of Small Data

【Original text】 : Small Data's Big AI Potential

Author information】: Husanjot Chahal, Helen Toner, Ilya Rahkovsky

获取地址:https://cset.georgetown.edu/publication/small-datas-big-ai-potential/

Blogger keywords: small data, application analysis

Recommended related papers:

- 无

overview:

This problem brief provides an introduction and overview of "small data" AI approaches, that is, methods that help address situations where little or no labeled data is available, and reduce our reliance on massive datasets collected from the real world. According to the traditional understanding of artificial intelligence, data is an important strategic resource, and any meaningful progress in cutting-edge artificial intelligence technology will require a large amount of data. This overemphasis on "big data" ignores the existence and obscures the potential of the methods we describe in this paper, which do not require large datasets for training.

We conduct the analysis in two parts. Part I introduces and categorizes the major small data approaches , which we broadly group into five categories—transfer learning, data labeling, artificial data, Bayesian methods, and reinforcement learning—and list why they are important. In doing so, we aim not only to point out the potential benefits of using small data methods, but also to deepen the non-technical reader's understanding of when and how data can be useful to AI. The second part draws on the original CSET dataset, presents some exploratory findings , assesses the current and projected progress of small data methods in scientific research, provides an overview of which countries are leading the way, and the main sources of funding for this research. Based on our findings, we concluded the following four key takeaways:

a) AI is not synonymous with big data, there are several alternatives that can be used in different small data settings.

b) Research on transfer learning is growing rapidly (even faster than the larger and better-known field of reinforcement learning), making it possible that this approach will work better in the future than it does today, and be more widely used.

c) The U.S. and China compete fiercely in small data methods, with the U.S. leading in the two largest categories of reinforcement learning and Bayesian methods, while China has a smaller but growing category in transfer learning, the fastest growing category leading edge.

d) For the time being, transfer learning may be a promising target for more US government funding due to its small share of investment in small data methods relative to the investment pattern in the overall AI field.

Introduction:

Conventional wisdom holds that cutting-edge AI relies on vast amounts of data. According to this AI concept, data is an important strategic resource, and how much data a country (or company) has access to is seen as a key indicator of AI progress. This understanding of the role of data in AI is not entirely inaccurate—many current AI systems do use a lot of data. But ** if policymakers think this is an eternal truth for all AI systems, they are going astray **. Overemphasizing data ignores the existence and underestimates the potential of several AI approaches that do not require large labeled datasets or data collected from real-world interactions. In this paper, we refer to these methods as "small data" methods.

** What we mean by "small data" is not a clear category, so there is no single, formal, consistent definition. **Academic articles discuss small data in relation to the application domain under consideration, often relating it to the size of the sample, e.g. kilobytes or megabytes versus terabytes of data Popular media articles attempt to describe small data in relation to various factors Data, as its availability and human comprehension, or as the amount and format of data that makes it accessible, informative, and actionable, especially for business decisions Many references to data often end up treating it as a generic resource. However, data is irreplaceable, and AI systems in different domains require different types of data and different types of methods, depending on the problem at hand

This study describes small data from a policymaker's perspective. Government actors are often considered potentially powerful players in the field of AI because of their access to the nature of real-world interactions, and their ability to collect vast amounts of data—e.g., climate monitoring data, geological surveys, border control, social security, voter Registration, vehicle and driver records and more. Most national comparisons of AI competitiveness agree that China has a unique advantage because it has access to more data, citing its large population, strong data collection capabilities, and lack of privacy protection. Part of our motivation for writing this paper is Illustrates a range of techniques that make this situation less real than is commonly assumed.

Finally, it is sometimes argued that government agencies will only benefit from the AI ​​revolution if they are able to digitize, clean, and label vast amounts of data. While this suggestion makes sense, it would be inaccurate to suggest that all advances in AI depend on these conditions. This belief belies the view that the future of **AI may not only be about big data, but that AI in government (and beyond) can be achieved in the absence of massive investments in big data infrastructure Innovation can still happen **.

In the articles that follow, our goal is not only to point out the potential benefits of using a small data approach, but also to deepen the non-technical reader's understanding of when and how data is useful. This introduction can be considered a primer on small data methods or methods that can minimize reliance on "big data". This analysis is divided into two parts. The first part explains technically what "small data" methods are, which categories form part of these methods, and why they are important. It provides the conceptual basis for the analysis of the data plotted in Section II. The second part comes from the original CSET dataset, especially our merged academic literature corpus, which covers more than 90% of the world's academic output, to demonstrate our support for small data on the three pillars of research progress, national competitiveness and funding. method discovery. Through these methods we attempt to examine current and projected scientific research progress and identify which country is leading the way, as well as the main source of funding for the research under study. Based on our findings, we summarize four key takeaways.

insert image description here

Key elements:

This article introduces and outlines a range of "small data" approaches to artificial intelligence. Finally, based on our findings, we make the following main points:

** AI is not synonymous with big data, especially not large pre-labeled datasets. ** Big data’s role in the AI ​​boom of the past decade is undeniable, but making large-scale data collection and labeling a prerequisite for AI advancements would lead policymakers astray. There are various approaches to choose from, and different approaches can be used in different situations: if the problem at hand is data-scarce but related problems are data-rich, perhaps transfer learning is useful; environment where the agent can learn by trial and error rather than pre-collected data, then reinforcement learning may be required; etc...

Research on transfer learning is growing especially rapidly—even faster than the larger and better-known field of reinforcement learning. The implication is that this method may work better and be more widely used in the future than it does now. Therefore, if policymakers are faced with a lack of data for a problem of interest, it would be helpful to seek to identify relevant datasets that might serve as a starting point for transfer learning-based approaches.

According to our cluster-based research approach, the United States and China are highly competitive in small data approaches, and both the United States and China are the top two countries (by the number of research papers) in the five categories we consider. While the U.S. has a large lead in the two largest methods (reinforcement learning and Bayesian methods), China has a smaller but growing lead in transfer learning (the fastest growing category).

For the time being, transfer learning may be a promising target for more U.S. government funding. U.S. government funding represents a relatively small share of funding for small data approaches relative to the investment pattern in the AI ​​field as a whole. This may be because research in these areas is not prioritized by the US government, or because US private sector actors tend to allocate a higher proportion of funding to research in these methods. Regardless, considering transfer learning as a rapidly emerging field, it may represent a promising opportunity to increase funding from U.S. government sources.

【Paper Express | Featured】

Forum address: https://bbs.csdn.net/forums/paper

Guess you like

Origin blog.csdn.net/qq_36396104/article/details/129693559