From Data Mining to Knowledge Discovery in Databases

This is a classic review paper in the field of data mining, so it is translated into Chinese to deepen the impression and facilitate review and review. A considerable part of it is machine translation, which may not be smooth. Starting from the process of KDD in Section 3, it is a combination of man-machine translation, correcting the bad parts of some / road translation, and then there are some nouns and knowledge points that I still have I don’t quite understand it, and it may be rough (although I think it looks like a lump...but I just post it and save it. It’s faster to read Chinese with the same hairstyle as lazy sheep than English. I will read it in the future. No matter how perfect the time is, I also plan to summarize and refine the key points. The full text is as follows:


Data mining and knowledge discovery in databases has attracted a lot of research, industry and media attention. Why the excitement? This article provides an overview of this emerging field, illuminating how data mining and knowledge discovery relate to each other and to related fields such as machine learning, statistics, and databases. This paper mentions specific real-world applications, specific data mining techniques, challenges for real-world applications of knowledge discovery, and current and future research directions.

In a wide variety of fields, data is being collected and accumulated at an alarming rate. We urgently need a new generation of computational theory and tools to help humans extract useful information (knowledge) from rapidly growing digital data. These theories and tools are the subject of the emerging field of knowledge discovery from databases (KDD).

At an abstract level, the field of KDD is concerned with the development of methods and techniques for understanding data. The basic problem of knowledge discovery (KDD) is to map low-level (often too large to understand) data into other, possibly more compact forms (e.g., a short report), more abstract (e.g., a descriptive approximation or a model of the process that generated the data), or more usefully (e.g., a predictive model that estimates future values). At the heart of this process is the application of specific data mining methods in order to mine or extract patterns.

This paper first discusses the historical background of KDD and data mining, and their intersection with other related fields. This paper provides a brief overview of recent practical applications of KDD. The definitions of knowledge discovery and data mining are given, and a general multi-step knowledge discovery process is given. This multi-step process includes the application of data mining algorithms as a specific step in the process. The data mining steps are discussed in more detail in the context of specific data mining algorithms and their applications. The paper also outlines the problems in practical applications. Finally, the article lists the challenges for future research and development, and specifically discusses the potential opportunities of artificial intelligence technology in KDD systems.

1. Why is KDD needed?

Traditional methods of turning data into knowledge rely on manual analysis and interpretation. For example, in the healthcare industry, experts typically analyze current trends and changes in healthcare data on a regular basis, such as quarterly. The experts then submit a detailed analysis report to the host care facility; this report becomes the basis for future care management decisions and planning. In an entirely different application, planetary geologists sift through remotely sensed images of planets and asteroids, carefully locating and cataloging geological objects of interest like impact craters. Whether it’s science, marketing, finance, healthcare, retail, or any other field, the classic approach to data analysis basically relies on one or more analysts being familiar with the data and acting as the interface between the data, users, and products.

The need to scale up human analytical capabilities to handle the vast numbers of bytes we can collect is both economical and scientific. Businesses use data to gain a competitive advantage, improve efficiency, and provide more valuable services to customers. The data we collect about our environment is the fundamental evidence we use to build theories and models of the universe we live in. Because computers have enabled humans to collect far more data than we can digest, it was natural to turn to computational techniques to help us mine meaningful patterns and structures from vast amounts of data. Therefore, KDD attempts to solve a real problem that the digital information age brings to all of us: data overload.

.

2. Data Mining and Knowledge Discovery in the Real World

The current interest in KDD is largely the result of media interest in successful KDD applications, e.g., focus articles in BusinessWeek, Newsweek, Byte, PCWeek, etc. releases within the last two years large number of journals. Unfortunately, it's not easy to separate fact from media hype. Nonetheless, several examples of well-documented successful systems can rightly be called KDD applications and have been deployed for operational use on large-scale real-world problems in science and business.

A fundamental field of application in science is astronomy. In this regard, SKICAT has achieved notable success, and astronomers have used the system to image analyze, classify, and catalog objects in sky survey images (Fayyad, Djorgovski, and Weir, 1996). In its first application, the system was used to process 3 terabytes (10^12 bytes) of image data to produce the second Palomar Observatory survey, with an estimated 10^9 objects to be detected. SKICAT outperformed both humans and traditional computing techniques in classifying faint sky objects.

On the business side, major KDD application areas include marketing, finance (especially investing), fraud detection, manufacturing, telecommunications, and Internet brokerage.

Marketing: In marketing, the main application is the database marketing system, which analyzes the customer database to identify different customer groups and predict their behavior. Business Week (Berry 1994) estimates that more than half of retailers are using or planning to use database marketing, and those that use it have had excellent results; for example, American Express reported a 10 to 15 percent increase in credit card usage. Another notable marketing application is market basket analysis systems (Agrawal et al. 1996), which can detect patterns such as "If a customer buys X, he/she is likely to buy Y and Z as well." Very valuable.

Investing: Many companies use data mining to invest, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural networks, and genetic algorithms to manage a portfolio totaling $600 million; since its inception in 1993, the system has outperformed the broader market (Hall, Mani, and Barr 1996).

Fraud Detection : HNC Falcon and Nestor PRISM systems are used to monitor credit card fraud, monitoring millions of accounts. The FAIS system (Senator et al. 1995) from the US Department of the Treasury's Financial Crimes Enforcement Network is used to identify financial transactions that may indicate money laundering.

Manufacturing : CASSIOPEE Troubleshooting System as General Electric and SNECMA are used by three major European airlines to diagnose and predict Boeing 737 problems. To obtain fault families, clustering methods are used.

Telecommunications: The Telecommunications Alert Sequence Analyzer (TASA) was built in collaboration with telecommunications equipment manufacturers and three telephone networks (Manila, Toivonen and Verkamo 1995). The system uses a novel framework to locate frequently occurring alarm events from the alarm stream and present them as rules. The large number of rules discovered can be explored through flexible information retrieval tools that support interactivity and iteration. In this way, TASA provides pruning, grouping and sorting tools to optimize basic brute force results.

Data Cleaning: The MERGE-PURGE system was applied to identify duplicate benefit claims (Hernandez and Stolfo 1995). It has been successfully applied to data from the Washington State Department of Welfare.

In other areas, a widely publicized system is IBM's ADVANCED SCOUT, a specialized data mining system that helps NBA coaches organize and interpret data from NBA games (US News, 1995).

Finally, a novel and increasingly important type of discovery is that based on the navigation of intelligent agents in information-rich environments. Although the idea of ​​active triggers has been analyzed in the database field for a long time, the real successful application is the Internet with the emergence of active triggers. These systems require users to specify profiles of interest and search for relevant information in various public domain and proprietary resources.

These are just a few of the many systems that use KDD techniques to automatically generate useful information from large amounts of raw data. See Piatetsky-Shapiro et al. (1996) for an overview of the problems in developing industrial KDD applications.

3. Data mining and KDD

Historically, the concept of discovering useful patterns in data has been known by various names, including data mining, knowledge extraction, information discovery, information acquisition, data archaeology, and data pattern processing. The term data mining is primarily used by statisticians, data analysts, and the management information systems (MIS) community. It is also widely used in the database field. The term knowledge discovery in databases was coined at the first KDD workshop in 1989 (Piatetsky-Shapiro 1991) to emphasize that knowledge is the end product of data-driven discovery . It has been popularized in the field of artificial intelligence and machine learning.

In our view, KDD refers to the whole process of discovering useful knowledge from data, and data mining is a specific step in this process. Data mining is the application of specific algorithms to extract patterns from data. The distinction between the KDD process and the data mining step (in the process) is central to this article. Other steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of mining results are crucial to ensure useful knowledge is obtained from the data. Blindly applying data mining methods can be a dangerous activity that can easily lead to the discovery of meaningless and invalid patterns.

3.1 The interdisciplinary nature of KDD

KDD evolved from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, data visualization, and high-performance computing. The unified goal is to extract high-level knowledge from low-level data in the context of large datasets.

The data mining step of KDD currently relies heavily on known techniques such as machine learning, pattern recognition, and statistics to discover patterns from data during the data mining step of the KDD process. A natural question is how is KDD different from pattern recognition or machine learning (and related fields)? The answer is that these fields provide data mining methods for the data mining step in the KDD process. KDD focuses on the entire process of discovering knowledge from data, including the storage and acquisition of data, how to extend the algorithm to massive data sets but still maintain high efficiency, how to interpret and visualize the results, and how to effectively model and process the entire human-computer interaction. support. The KDD process can be viewed as encompassing multidisciplinary activities that go beyond any one specific discipline such as machine learning. Against this backdrop, there are clear opportunities for other areas of AI (besides machine learning) to contribute to KDD. KDD places particular emphasis on finding intelligible patterns that can be interpreted as useful or interesting knowledge. For example, while a neural network is a powerful modeling tool, it is relatively difficult to understand compared to a decision tree. KDD also emphasizes scaling and robustness of modeling algorithms for large noisy datasets.


Data mining is a step in the KDD process that consists of applying data analysis and discovery algorithms that produce a particular enumeraton of patterns( or models) over the data. Discovery algorithms that generate specific schema (or model) enumerations on the data.

Areas of research related to artificial intelligence include machine discovery, which aims at the discovery of rules of thumb in observation and experimentation, and causal modeling, which is used to infer causal models from data. Statistics in particular has a lot in common with KDD. Discovering knowledge from data is basically a statistical job. Statistics provides a language and framework for quantifying the uncertainty that arises when one tries to infer general patterns from specific samples of a general population. As mentioned earlier, the term data mining has negative connotations in statistics since computer-based data analysis techniques were first introduced in the 1960s. This problem arises because if you search long enough in any data set (even randomly generated data), you will find patterns that appear to be statistically significant but are not. Clearly, this question is critical to KDD. In recent years, significant progress has been made in understanding these types of issues in statistics. Most of these jobs are directly related to KDD. As such, data mining is a legitimate activity as long as one knows how to do it properly; if data mining is not done well (regardless of the statistical aspects of the problem), it should be avoided. KDD can also be seen as encompassing a broader modeling perspective than statistics. The goal of KDD is to provide tools to automate (to the extent possible) the entire data analysis process and the statistician's "art" of hypothesis selection.

The driving force behind KDD is the database domain (the second D in KDD). In fact, when the data cannot be loaded into the main memory, how to efficiently perform data manipulation is crucial for KDD. Database techniques for efficient data access, grouping and sorting operations when accessing data, and optimizing queries form the basis for scaling algorithms to larger data sets. Most data mining algorithms from statistics, pattern recognition, and machine learning assume the data is in main memory, not noticing how the algorithm can break down if only limited views of the data are possible.

A related field that has evolved from databases is data warehousing, which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehouses provide the foundation for KDD in two important ways: (1) data cleansing and (2) data access.

Data Cleansing : As an organization must consider a unified logical view of the various data and databases it owns, it must address the issues of mapping data to a single naming convention, uniformly representing and handling missing data, and dealing with noise and errors where possible.

Data access : Uniform, well-defined methods must be created to access data and provide access paths for data that has been difficult to access in the past, such as data stored offline.

Once organizations and individuals figure out how to store and access their data, the natural next question is, what else can we do with all that data? This is where opportunities for KDD naturally arise.

A popular approach to data warehouse analysis is called Online Analytical Processing (OLAP), named after a set of principles proposed by Codd (1993). OLAP tools focus on providing multidimensional data analysis, outperforming SQL in computing summaries and decompositions across multiple dimensions. The goal of OLAP tools is to simplify and support interactive data analysis, whereas the goal of KDD tools is to automate as much of the process as possible. Therefore, KDD is beyond what is currently supported by most standard database systems

3.2 Basic Definitions

KDD is an important process of identifying valid, novel, potentially useful and ultimately understandable patterns in data. (Fayyad, Piatetsky-Shapiro, and Smyth 1996)

Here, data is a set of facts (e.g., cases in a database), and schema is an expression in some language that describes a subset of the data or a model applicable to that subset. Thus, in our usage here, extracting a schema also specifies the fitting of a model to data; the finding of structure from data; or, in general, any high-level description of a set of data. The term ''process'' implies that KDD consists of many steps, including data preparation, pattern search, knowledge evaluation, and refinement, all of which are repeated in multiple iterations. By nontrivial, we mean that some searching or reasoning is involved; that is, it's not simply calculating a predefined quantity like calculating the average of a set of numbers.

The discovered patterns should be valid to some extent for new data. We also want the pattern to be novel (at least to the system, preferably to the user) and potentially useful, that is, to bring some benefit to the user or task. In the end, the patterns should be understandable, if not immediately then after some post-processing.

The previous discussion implies that we can define quantitative metrics to evaluate the extracted patterns. In many cases, a measure of certainty (e.g., estimated predictive accuracy on new data) or utility (e.g., benefits, perhaps dollars saved due to better predictions or faster response times of the system) can be defined. Concepts like novelty and intelligibility are more subjective. In some cases, intelligibility can be estimated by simplicity (eg, bits describing a pattern). An important concept, called interestingness (see Silberschatz and Tuzhilin [1995] and Piatetsky-Shapiro and Matheus [1994]), is often taken as an overall measure combining validity, novelty, usefulness, and simplicity. Interestingness functions can be defined explicitly or implicitly expressed through the ranking of discovered patterns or models by the KDD system.

With these concepts in place, we can consider a schema as knowledge if it crosses a certain threshold of interestness, which is by no means an attempt to define knowledge in philosophy or even in popular opinion. In fact, the knowledge in this definition is purely user-oriented and domain-specific, determined by whatever functions and thresholds the user chooses.

Data mining is a step in the KDD process that involves the application of data analysis and discovery algorithms that, within acceptable limits of computational efficiency, produce specific enumerations of patterns (or models) to data. Note that the space of patterns is usually infinite, and enumeration of patterns involves some form of search in this space. Practical computational constraints place strict limits on the subspaces that data mining algorithms can explore.

The KDD process includes using a database with any necessary selection, preprocessing, subsampling, and transformation; applying data mining methods (algorithms) to enumerate patterns from it; evaluating the products of data mining to identify subclasses of the enumerated patterns that are considered knowledge set. The data mining component of the KDD process is concerned with algorithmic methods for extracting and enumerating patterns from data. The overall KDD process (Fig. 1) includes the evaluation and possible interpretation of the mined patterns to determine which patterns can be considered as new knowledge. The KDD process also includes all additional steps described in the next section. The concept of an overall user-driven process is not unique to KDD: similar proposals have been made in statistics (Hand 1994) and machine learning (Brodley and Smyth 1996).

3. The process of KDD

The KDD process is interactive, iterative, and includes many steps related to the decisions made by the user . Brachman and Anand (1996) gave a practical view of the KDD process, emphasizing the interactive nature of this process. Here we roughly outline the basic steps of the process:

First, gain an understanding of the application domain and related prior knowledge, and identify the goals of the KDD process from the customer's perspective.

Second, construct a target dataset: select a dataset, or focus on a subset of variables or data samples, and perform knowledge discovery on this basis.

Third, data cleaning and preprocessing. Basic operations include removing noise where appropriate, gathering necessary information to model or explain noise, deciding on strategies for handling missing data fields, and accounting for time series information and known changes.

Fourth, data dimensionality reduction and projection: Find features that are useful for interpreting data based on task goals. Through dimensionality reduction or transformation methods, the effective number of variables considered can be reduced, or a constant representation of the data can be found.

Fifth, find a specific data mining method to match KDD's mission goals/ match the KDD process's goal (step 1) with a specific data mining method. For example, summarization, classification, regression, clustering, etc., are also described later in Fayyad, Piatetsky-Shapiro, and Smyth (1996).

Sixth, exploratory analysis and model and hypothesis selection: choose a data mining algorithm, and choose a method for searching for patterns in the data. This process involves finding suitable models and parameters (e.g., models for categorical data are different from models for vectors of real numbers), matching specific data mining methods with the overall criteria of the KDD process, e.g. end users may be more interested in understanding this model, rather than its predictive performance.

Seventh, data mining. Search for patterns of interest in a specific representation or set of representations, including classification rules or trees, regression and clustering. By correctly performing the above steps, the user can greatly facilitate the data mining method.

Eighth, interpret the mined patterns, possibly returning to any of steps 1 to 7 for further iterations. This step can also involve visualization of the extracted patterns and models, or of the data given the model.

Ninth, act on discovered knowledge: use the knowledge directly, incorporate the knowledge into another system for further manipulation, or simply record and report the knowledge to interested parties. This process also includes examining and resolving potential conflicts with previously believed (or extracted) knowledge.

A KDD process can contain significant iterations and can contain cycles between any two steps. The basic flow of steps (though not the potentially massive number of iterations and loops) is shown in Figure 1. Most of the previous work on KDD focuses on step 7 data mining. In practice, however, other steps are equally (possibly more) important to the successful application of KDD. After defining the basic concepts and introducing the process of KDD, we now focus on the data mining part.

4. Data Mining Steps in the KDD Process

The data mining step of the KDD process usually involves repeated iterative applications of specific data mining methods. This section presents an overview of the main goals of data mining, a description of the methods used to achieve these goals, and a brief description of the data mining algorithms that incorporate these methods.

Knowledge discovery goals are defined by the intended use of the system. We can distinguish two types of goals: (1) prove verification (2) discover Discovery. For verification, the system can only verify user assumptions. For Discovery, new patterns are automatically discovered. The Discovery goal is further divided into prediction (the system discovery pattern is used to predict the future behavior of some entities) and description (the system discovery pattern is used to display to the user in a way that is easy for humans to understand). In this article, we mainly focus on data mining for Discovery.

Data mining involves fitting models or identifying patterns with observed data. Fitting a model plays the role of inferential knowledge: whether a model reflects useful or interesting knowledge is part of the overall interactive KDD process, and often requires human subjective judgment. There are two main mathematical forms used in model fitting. (1) Statistical (2) Logical. Statistical methods allow for uncertainty effects in the model, whereas logistic models are purely deterministic. We focus primarily on statistical methods for data mining, which tend to be the most widely used basis for practical data mining applications, given the typical uncertainties in the generation of real-world data.

Most data mining methods are based on tried and tested techniques in machine learning, pattern recognition, and statistics: classification clustering regression and more. The combination of different algorithms under each heading tends to confuse novice and experienced data analysts alike. It should be emphasized that among the many data mining methods often mentioned in the literature, there are actually only a few basic techniques. The actual underlying model representation used by a particular method usually comes from a combination of a handful of well-known options : polynomials, splines, kernels and basis functions, threshold-Boolean functions, and so on. Therefore, these algorithms differ mainly in the goodness-of-fit criteria for evaluating the model or in the search method to find a good fit.

In our brief overview of data mining methods, we specifically try to convey the notion that most, if not all, methods can be viewed as extensions or hybrids of some fundamental techniques and principles. We first discuss the main methods of data mining and then show that data mining methods can be viewed as consisting of three main algorithmic components. (1) Model representation, (2) Model evaluation, (3) Search.

In the discussion of KDD and data mining methods, we use a simple example to make some concepts more concrete. Figure 2 shows a simple 2D artificial dataset with 23 samples. Each point on the graph represents a person who received a loan from a certain bank at some time in the past. The horizontal axis represents personal income; the vertical axis represents the person's total personal debt (mortgage, car loan, etc.). The data is broken down into two categories: (1) "x" for people who have defaulted on their loans, and (2) "o" for people whose loans are in good standing with the bank. Thus, this simple artificial dataset can represent a historical dataset that can contain useful knowledge from the perspective of the lending bank. Note that in practical KDD applications, there are usually more dimensions (up to hundreds) and more data points (thousands or even millions).

The purpose here is to illustrate the basic idea of ​​a small problem in 2D space.

4.1 Data Mining Methods

In practice, the two main goals of data mining are prediction and description. As mentioned earlier, prediction involves using some variables or fields in a database to predict unknown or future values ​​of other variables of interest, while description focuses on finding human-interpretable patterns that describe the data. Although the line between predictive and descriptive is not sharp (some predictive models can be descriptive to the extent they are understandable, and vice versa), this distinction is useful for understanding the overall discovery goal. The relative importance of prediction and description can vary widely for a particular data mining application. The goals of prediction and description can be achieved by using various specific data mining methods.

4.1.1 Classification

Classification is learning a function that maps (classifies) data items into several predefined classes. Examples of classification methods as part of knowledge discovery applications include classification of financial market trends (Apte and Hong 1996) and automatic identification of objects of interest in large image databases (Fayyad, Djorgovski, and Weir, 1996). It is simply divided into two types of regions. Note that it is unlikely to achieve perfect class partitioning with a linear decision boundary. A bank might want to use the classification area to automatically decide whether to extend a loan to a prospective loan applicant.

4.1.2 Regression

Regression is learning a function that maps data items to real-valued predictor variables. There are many applications of regression, such as predicting biomass in forests from remotely sensed microwave measurements, estimating the probability of patient survival from the results of a battery of diagnostic tests, predicting consumer demand for new products from advertising spending, and forecasting time series where input Variables can be lagged versions of predictors. Figure 4 shows the results of a simple linear regression where total debt is fitted as a linear function of income: the fit is poor because there is only a weak correlation between the two variables.

4.1.3 Clustering

Clustering is a common descriptive task in which one tries to identify a finite class or cluster to describe data. These categories can be mutually exclusive and complete, or they can be composed of richer representations, such as hierarchical or overlapping categories. Examples of applications of clustering in knowledge discovery include finding homogeneous subgroups of consumers in marketing databases and identifying subclasses of infrared sky survey spectra. Figure 5 shows clustering the loan data into three possible clusters; note that the clusters are overlapping, allowing data points to belong to more than one cluster. The original class labels (represented by x and o in the previous figure) have been replaced by a +, indicating that the class membership is no longer assumed to be known.

4.1.4 Summarization

Summarization includes methods for finding compact descriptions for subsets of data. A simple example would be to tabulate the mean and standard deviation of all fields. More complex methods involve the derivation of Summarization rules, multivariate visualization techniques, and the discovery of functional relationships between variables. Summarization is often applied to interactive exploratory data analysis and automatic report generation.

4.1.5 Dependency Modeling

Dependency modeling involves finding a model that describes important dependencies between variables. There are two levels of dependency models: (1) the structural level of the model specifies (usually in graphical form) which variables are locally dependent on each other; (2) the quantitative level of the model specifies the strength of the dependencies, using some numerical scale. For example, PDNs use conditional independence to specify structural aspects of the model, and probabilities or correlations to specify the strength of dependencies (Glymour et al. 1987; Heckerman 1996). PDNs are used in an increasing number of application domains, such as databases, information Development of expert systems for probabilistic medicine in the fields of retrieval and human genome modeling.

4.1.6 Variation and Deviation Analysis

Change and deviation detection focuses on finding the most significant changes in previously measured or normative data.

4.2 Components of Data Mining Algorithms

The next step is to construct concrete algorithms to implement the general approach we outline. We can identify three main components in any data mining algorithm: (1) model representation, (2) model evaluation, and (3) search.

This simplified view is not necessarily complete or all-encompassing; rather, it is a convenient way of expressing key concepts of data mining algorithms in a relatively unified and compact manner.

4.2.1 Model representation

Model representation is a language used to describe discoverable patterns. If the representation is too limited, no amount of training time or examples will produce an accurate model for the data. It is important that data analysts fully understand the representativeness assumptions that may be inherent in a particular methodology. It is also important that algorithm designers clearly state which representative assumptions are made by a particular algorithm. Note that the representational power of the model increases the danger of overfitting the training data, thereby reducing prediction accuracy on unseen data.

4.2.2 Model evaluation criteria

Is a quantitative description that expresses the degree to which a particular model (a model and its parameters) achieves the goals of the KDD process. For example, predictive models are often judged on the basis of their empirical predictive accuracy on some test set. Descriptive models can be evaluated along the dimensions of predictive accuracy, novelty, utility, and understandability of the fitted models.

4.2.3 Search method

The search method consists of two parts: (1) parameter search; (2) model search.
Once the model representation (or representation family) and model evaluation criteria are fixed, then the data mining problem becomes a pure optimization task: find parameters and models that optimize the evaluation criteria from the selected model family. In parameter search, the algorithm must search for parameters that optimize model evaluation criteria given observed data and a fixed model representation. Model search occurs as a cycle of parameter search methods: the model representation is changed in order to consider a range of models.

5. Some Data Mining Methods

A wide variety of data mining methods exist, but here, we only focus on a subset of popular techniques. Each approach is discussed in the framework of model representation, model evaluation, and search.

5.1 Decision tree and rules

Decision trees and rules using univariate splitting have simple representations that make inferring models relatively easy for users to understand. However, restrictions on specific tree or rule representations can greatly limit the functional form. For example, Figure 6 illustrates the effect of thresholding the income variable on the loan dataset: it is clear that using this simple thresholding (parallel to the feature axis) severely limits the types of classification boundaries that can be generalized. If the model space is enlarged to allow more general expressions (such as multivariate hyperplanes of arbitrary angles), then the model will be more powerful in terms of predictions, but may be harder to understand. A large number of decision tree and rule induction algorithms are described in the machine learning and applied statistics literature.

To a large extent, they rely on likelihood-based model evaluation methods, with varying degrees of methods penalizing model complexity. The greedy search method is a commonly used search method, which involves growing and pruning rules and tree structures. Trees and rules are primarily used in predictive modeling, including classification (Apte and Hong 1996; Fayyad, Djorgovski, and Weir 1996) and regression, although they can also be applied to summary descriptive modeling.

5.2 Nonlinear regression and classification methods

These methods include a range of forecasting techniques, fitting combinations of input variables with linear and nonlinear combinations of basis functions (sigmoids, splines, polynomials). Examples include feedforward neural networks, adaptive spline methods, and projection pursuit regression. Using a neural network as an example, Figure 7 shows the nonlinear decision boundaries that a neural network might find for the loan dataset. In terms of model evaluation, relatively little is known about the representation properties of fixed-scale networks estimated from finite datasets, although appropriately sized networks can generally approximate any smooth function to any desired accuracy. The standard squared error and cross-entropy loss functions used to train neural networks can be viewed as log-likelihood functions for regression and classification, respectively. Backpropagation is a parameter search method that performs gradient descent on parameters in the parameter (weight) space, starting from random initial conditions, to find the local maximum value of the likelihood function. Non-linear regression methods, though for example, although the classification boundaries of Figure 7 may be more accurate than the simple threshold boundaries of Figure 6, the advantage of threshold boundaries is that the model can express, with some degree of certainty, a simple rule of the form "if Income greater than the threshold, then the loan will have a good status." Strong representation, but difficult to interpret.

For example, although the classification boundary of Figure 7 may be more accurate than a simple threshold division, the advantage of a single threshold boundary is that the model can be expressed as a simple rule "if the income is greater than the threshold, then the loan will have a good status."

5.3 Case-based approach

A model is approximated using representative examples from the database; that is, predictions for new examples are derived from properties of similar examples from the model for which the predictions are known. Techniques include nearest neighbor classification and regression algorithms (Dasarathy 1991) and case-based reasoning systems (Kolodner 1993). Figure 8 demonstrates the use of the nearest neighbor classifier on the loan dataset: the class at any new point in the 2D space is the same as the class at the nearest point in the original training dataset.

A potential disadvantage of case-based methods (compared to tree-based methods) is that a well-defined distance metric is required to evaluate the distance between data points . For the loan data in Figure 8, this would not be a problem because income and debt are calculated in the same units. However, if one wishes to include variables such as loan tenure, gender, and occupation, more effort is required to define a reasonable metric between these variables. Model evaluation is usually based on cross-validated estimates of prediction errors (Weiss and Kulikowski 1991): the parameters of the model to be estimated can include the number of neighbors used for prediction and the distance metric itself. Like nonlinear regression methods, case-based methods are often asymptotically strong in approximating properties, but conversely, can be difficult to interpret because the model is implicit in the data rather than expressed explicitly. Related techniques include kernel density estimation and mixture models.

5.4 Probabilistic graph-dependent models

Graphical models illustrate probabilistic dependencies using graph structures. The model specifies which variables are directly dependent on each other. Typically, these models are used for categorical or discrete-valued variables, but extensions to special cases such as Gaussian densities can also be applied to real-valued variables. In the fields of artificial intelligence and statistics, these models were originally developed within the framework of probabilistic expert systems; the structure and parameters of the models (conditional probabilities attached at the link to the figure) were derived by experts. Recently, there has been a lot of work in both artificial intelligence and statistics on how to directly learn the structure and parameters of graphical models from databases. Model evaluation criteria are usually of the Bayesian form, and parameter estimation can be a mixture of closed-form estimation and iterative methods, depending on whether the variables are directly observed or hidden. Model search can consist of a greedy hill-climbing approach. Prior knowledge, such as partial ordering of variables based on causality, is useful in reducing the model search space. Although still in the research stage, graphical model induction methods are of particular interest because the graph form is easily understood by humans.

5.5 Relational Learning Model

Relational Learning Models, while the representation of decision trees and rules is limited to propositional logic, relational learning (also known as inductive logic programming) uses the more flexible schema language of first-order logic. Relational learners can easily find formulas such as X = Y. To date, most studies on evaluation methods for relational learning models are logical in nature. The extra representational power of the relational model comes at the cost of heavy computational demands on the search side.

6. Discussion

Given the wide range of data mining methods and algorithms, our overview is inevitably limited in scope; many data mining techniques, especially specialized methods for specific types of data and domains, are not specifically mentioned. We believe that a general discussion of data mining tasks and components is of general relevance to various methods. For example, consider time series forecasting, which has traditionally been converted to predictive regression tasks (autoregressive models, etc.). Recently, more general models have been developed for time series applications, such as nonlinear basis functions, instance-based models, and kernel methods. Furthermore, people model descriptive graphs and local data of time series rather than purely predictive modeling. Therefore, although different algorithms and applications may appear different on the surface, it is not uncommon for them to share many common components. Understanding data mining and model induction at the component level can clarify the behavior of any data mining algorithm and make it easier for users to understand its overall contribution and applicability to the KDD process.

An important point is that each technique is generally better suited to certain problems than others . For example, decision tree classifiers can be used to find structure in high-dimensional spaces, and in problems with mixed continuous and categorical data (since tree methods do not require a distance metric) However, classification trees may not be suitable for describing classes with second-order polynomials The problem of real decision boundaries between (for example). Therefore, there is no general data mining method, and choosing a specific algorithm for a specific application is an art. In practice, most of the work of the application lies in formulating the problem correctly (asking the right question), not in optimizing the algorithmic details of a particular data mining method.

Because our discussion and overview of data mining methods is brief, we want to make two points clear:

First, our overview of automated search focuses on automated methods for extracting patterns or models from data. Although this approach is consistent with the definition we gave earlier, it does not necessarily mean that other fields refer to data mining as well. For example, some use the term to designate any manual search of data, or assisted by queries to database management systems, or to refer to human-visualized patterns in data. In other fields, it refers to the automatic correlation of data from transactions or the automatic generation of transactional reports. We choose to focus only on those methods that include some degree of autonomy in the search

Second, beware of the hype: the state of the art in automated methods for data mining is still in a fairly early stage of development. There are currently no established criteria for deciding which method to use in which situations, and many methods are based on rough heuristic approximations to avoid the expensive searches required to find the optimal or best solution. Therefore, the reader should be careful when confronted with exaggerating the system's ability to mine useful information from large (or even small) databases.

7. Application issues

Here we examine the criteria for selecting potential applications, which can be divided into practical and technical categories. The practical criteria for the KDD project are similar to those for other advanced technology applications, including the potential impact of the application, lack of simpler alternative solutions, and strong organizational support for using the technology. Privacy and legal issues should also be considered when processing personal data applications.

Technical criteria include factors such as the availability of sufficient data. In general, the more fields, the more complex the schema sought, and the more data required. However, strong prior knowledge (see discussion later) can significantly reduce the number of cases required. Another issue to consider is the correlation of attributes. It is important to have data attributes that are relevant to the task; also, low noise levels (few data errors) are another consideration. A large amount of noise makes identifying patterns difficult unless there is a large sample of data that mitigates random noise and helps spot aggregated patterns. Although changing and time-oriented data makes application development more difficult, it can be more useful because it is easier to retrain systems than humans. Finally, and probably one of the most important considerations, is prior knowledge. It is useful to know some information about the domain - what are the important fields, what are the possible relationships, what are the user utility functions, what are the already known schemas, etc.

Research and Application Challenges

We outline some of the major current research and application challenges of KDD. This list is by no means exhaustive, and it is intended to give the reader an idea of ​​the types of problems KDD practitioners address.

(1) Larger database

Databases with hundreds of fields and tables, millions of records, and multi-gigabyte sizes are common, and terabyte (1012 bytes) databases are starting to appear as well. Methods for handling large data volumes include more efficient algorithms, sampling, approximation, and massively parallel processing.

(2) High dimension

Not only are there often a large number of records in a database, but there may also be a large number of fields
(attributes, variables); thus, the dimensionality of the problem is high. High-dimensional datasets create intractable problems in increasing the search space for model induction in a combinatorial explosion. Additionally, it increases the chances of data mining algorithms finding spurious patterns that are usually invalid. Solutions to this problem include reducing the effective dimensionality of the problem and using prior knowledge to identify irrelevant variables.

(3) Overfitting

When an algorithm uses a limited data set to find the best parameters for a particular model, it can model not only the general patterns in the data, but also any noise specific to the data set, leading to poor performance of the model on the test data very bad. Possible solutions include cross-validation, regularization, and other complex statistical strategies.

(4) Assessing statistical significance

A problem (related to overfitting) arises when the system searches through many possible models. For example, if the system tests models at the 0.001 significance level, then on average, N/1000 of these models will be considered significant given purely random data. This point is often overlooked by many initial attempts at KDD. One way to deal with this is to use methods that adjust the test statistic as a function of the search, e.g. Bonferroni adjustments for independent tests or randomized tests.

(5) Changing data and knowledge

Rapidly changing (non-stationary) data can invalidate previously discovered patterns. Furthermore, variables measured in a given application database can be modified, deleted, or augmented with new measurements over time. Possible solutions include an incremental approach to updating the schema, as well as viewing changes as opportunities for discovery by using changes to prompt patterns that only search for changes.

(6) Missing and noisy data

This problem is especially acute in commercial databases. According to reports, the U.S. Census data has an error rate of up to 20% in some fields. If the database is not designed with "discovery" in mind, important properties can be lost. Possible solutions include more sophisticated statistical strategies to identify hidden variables and correlations.

(7) Complex relationship between fields

Hierarchical attributes or values, relationships between attributes, and more complex ways of representing knowledge of database content require algorithms that can use this information efficiently. Historically, data mining algorithms have been developed for simple attribute value records, although new techniques for deriving relationships between variables are being developed.

(8) Comprehensibility of patterns

In many applications, it is important to make discoveries easier to understand. Possible solutions include graphical representations (Buntine 1996; Heckerman 1996), rule structures, natural language generation, data and knowledge visualization techniques. Rule refinement strategies (e.g. Major and Mangano [1995]) can be used to address related problems: the discovered knowledge may be implicitly or explicitly redundant.

(9) User interaction and prior knowledge

Many current KDD methods and tools are not truly interactive and cannot easily incorporate prior knowledge about the problem, except in simple ways. Domain knowledge is very important in all steps of KDD. Bayesian methods (eg, Cheeseman [1990]) use prior probabilities of data and distributions as a form of encoding prior knowledge. Other applications deduce database capabilities to discover knowledge, which is then used to guide data mining searches.

(10) Integration with other systems

A standalone discovery system might not be very useful. Typical integration issues include integration with database management systems (e.g., through query interfaces), integration with spreadsheets and visualization tools, and accommodating real-time sensor readings


Conclusion: The potential role of AI in KDD

In addition to machine learning, other AI fields are likely to make significant contributions to various aspects of the KDD process. We give a few examples here:

Natural language offers significant opportunities for free-form text mining, especially automatic annotation and indexing prior to classification of text corpora. Limited parsing capabilities can go a long way toward determining what an article is referring to. So it can be of great help ranging from simple natural language processing to language understanding. Furthermore, natural language processing can serve as an effective interface to provide hints to mining algorithms and to visualize and explain the knowledge discovered by the KDD system.

Planning Planning considers a complex data analysis process. It involves performing complex data access and data transformation operations; applying preprocessing procedures; and, in some cases, paying attention to resource and data access constraints. Typically, data processing steps are expressed in terms of postconditions and preconditions required for a particular routine application, which is easily formulated as a planning problem. Additionally, planning capabilities can play an important role in automated agents (see next item) to collect data samples or conduct searches to obtain desired data sets.

Intelligent agents can gather necessary information from various sources. In addition, information agents can be activated remotely via the network, or triggered when an event occurs and start analysis operations. Finally, agents can help navigate and model the World-Wide Web (Etzioni 1996), another area of ​​growing importance.

Uncertainty in artificial intelligence includes the problem of managing uncertainty, rational reasoning mechanisms in the presence of uncertainty, and causal reasoning. These are the basic KDD theory and practice. In fact, the KDD-96 meeting had a joint session with the UAI-96 meeting this year (Horvitz and Jensen 1996).

Knowledge representation includes ontologies, which are new concepts for representing, storing and accessing knowledge. Also included are schemas for representing knowledge and enabling KDD systems to use prior human knowledge about the underlying process. Also included are schemes for representing knowledge and allowing KDD systems to use prior human knowledge about the underlying process.

These potential contributions of AI are only a few; many others, including research on human-computer interaction, knowledge acquisition techniques, and reasoning mechanisms, have opportunities to contribute to KDD.

Finally, we give definitions of some basic concepts in the field of KDD. Our main aim is to clarify the relationship between knowledge discovery and data mining. We outline the KDD process and basic data mining methods. Due to the wide range of data mining methods and algorithms, our overview is inevitably limited: there are many data mining techniques, especially specialized methods for specific types of data and domains. Although various algorithms and applications may appear very different on the surface, it is not uncommon for them to share many common components. Understanding data mining and model induction at this component level can clarify the task of any data mining algorithm and make it easier for users to understand its overall contribution and applicability to the KDD process.

This paper represents a step towards a common framework that we hope will eventually provide a unified view of common overall goals and methods used. We hope this will eventually lead to a better understanding of the various approaches in this multidisciplinary field and how they fit together.

Guess you like

Origin blog.csdn.net/codelady_g/article/details/122725851