"Data Mining Concepts and Techniques" Secretary 3

data preprocessing

Welcome to the real world !

Data preprocessing techniques:

  • Data cleaning: used to remove noise from data and correct inconsistencies.
  • Data integration: Combining data from multiple data sources into a consistent data store, such as a data warehouse.
  • Data reduction: Reduce the size of data by, for example, aggregating, removing redundant features, or clustering.
  • Data Transformation: Can be used to compress data into smaller intervals.

These techniques are not mutually exclusive and can be used together. Data cleaning may involve transformations that correct erroneous data.

data quality

Data quality includes accuracy, completeness, consistency, timeliness, trustworthiness and interpretability .

The reality is that the data you want to analyze using data mining techniques is incomplete (missing attribute values ​​or some attributes of interest, or contains only aggregated data), incorrect or noisy (containing errors or having values ​​that deviate from the desired ), and are inconsistent (there are differences in the partial codes used for product classification).

Welcome to True world!

Next, analyze the reasons and find solutions:

Data characteristics reason describe
inaccuracy A malfunction of the device collecting the data; when the user does not want to submit personal information, an incorrect value (birthday January 1) may be deliberately entered into the mandatory input field; due to inconsistent naming conventions or input field formats; Has an incorrect property value
Incomplete Ignored when input; relevant data is not recorded, it may be equipment failure; historical or modified data may be ignored; There are missing values
inconsistency Attribute definitions are different; for example, market evaluation standards are different; the same numerical attribute name is different; there is redundancy
Not time-sensitive The data was not updated in time; the data at the end of the month was not updated in time to impact the assessment effect; can not be updated in time
untrustworthy The data has been altered artificially, resulting in a result; Reflects how much data is trusted by users
uninterpretable Data between departments has individuality; for example, data is coded in accounting but cannot be understood by sales departments; reflect whether the data is easy to understand

The main task of data preprocessing

data cleaning

"Clean" data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Works to avoid the function being modeled overfitting the data.

data integration

Attributes representing the same concept may have different names in different databases, leading to inconsistencies and redundancy. Typically, when preparing data for a data warehouse, data cleaning and integration are performed as preprocessing steps. Data cleaning can also be performed again to detect and remove redundancies that may be caused by integration.

data protocol

Faced with a huge amount of data, what if the size of the dataset is reduced without compromising the results of data mining?

  • Dimension Statute. Using wavelet transforms, PCA, attribute subset selection and attribute construction
  • Numerical reduction. Replace data with smaller representations using regression and log-linear models or histograms, clustering, sampling, or data aggregation .

data transformation

  • Discretization and Concept Hierarchy
  • normalize

In conclusion, real-world data is generally dirty, incomplete, and inconsistent. Data preprocessing techniques can improve the quality of data, thereby helping to improve the accuracy and efficiency of subsequent mining processes. Since high-quality decisions necessarily depend on high-quality data, data preprocessing is an important step in the knowledge discovery process. Detecting data anomalies, adjusting data as early as possible, and regulating the data to be analyzed will bring high returns to decision-making.

data cleaning

missing value

  1. ignore tuples
  2. Fill in missing values ​​manually
  3. Fill missing values ​​with a global constant light: Unknown
  4. Fill missing values ​​using the attribute's centrality measure (mean or median): skewed or not
  5. Use the attribute mean or median of all samples that belong to the same class as the given tuple
  6. Fill with the most likely value: predict using regression, Bayesian, decision tree. (most scientific)

noisy data

Noise : The random error or variance of the variable being measured.

data integration

entity recognition problem

When the properties of one database match the properties of another database, special attention must be paid to the structure of the data. Designed to ensure that functional dependencies and referential constraints in the source system match those in the target system.

In one system, discount may be used for orders, but in another system, it is used for items in the order.

Redundancy and correlation analysis

Redundancy : An attribute is redundant if it can be "derived" by another attribute or set of attributes.

Related analysis :

  1. Chi-square test for nominal data

insert image description here

  1. Correlation Coefficients for Numerical Data

  2. Covariance of Numerical Data

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326386713&siteId=291194637