Examples to explain database data deduplication

This article is shared from the Huawei Cloud Community " GaussDB Database SQL Series - Data Deduplication " by Gauss Squirrel Club Assistant 2.

I. Introduction

Data deduplication is a relatively common operation in databases. Complex business scenarios, data sources from multiple business lines, etc. will lead to the storage of duplicate data. This article uses the GaussDB database as the experimental platform and will explain in detail how to remove duplicates.

2. Data deduplication application scenarios

Database management (including backup) : Data deduplication in the database can avoid repeated data storage and backup, improve database storage efficiency, and reduce backup storage costs. 

Data integration: In the process of data integration, data from multiple data sources needs to be merged. Deduplication can avoid the impact of duplicate data on the merged results. 

Data analysis ( or mining) : When performing data analysis or data mining, deduplication can avoid the interference of repeated data on the analysis or mining results and improve the accuracy of analysis. 

E-commerce platform: Deduplicating products on the e-commerce platform can avoid listing the same products repeatedly and improve the user experience of the platform. 

Financial risk control: In the field of financial risk control, deduplication can avoid the impact of duplicate data on the risk control model and improve the accuracy of risk control. 

3. Data deduplication case (GaussDB)

Actual business scenario + GaussDB database

1. Example scene description

Take the deduplication of customer information in the insurance industry as an example. In order to prevent agents from repeatedly contacting customers (which can easily cause customer complaints), customers need to be uniquely identified. In the following two situations, it is necessary to identify a person (uniquely). In this case, data deduplication needs to be performed.

• Situation 1: The same customer has different source channels: the customer purchased both life insurance and property insurance (two different source systems);

• Scenario 2: The same customer returns multiple times: The customer purchases multiple times from the same channel (renewal or purchase of different products of the same insurance type).

2. Define duplicate data

Identify him as a person through "name + document type + document number". That is, as long as these three fields are repeated, the data is considered to be duplicate data. (Of course, there are more complex scenarios, such as "name + ID type + ID number + mobile phone number + license plate number", etc., which will not be introduced in detail this time).

3. Formulate deduplication rules

1) Choose one from more than one

• Random: According to the deduplication rules, a piece of data is randomly retained.

• Priority: According to deduplication rules + business logic, retain a piece of data that is needed first. For example, give priority to “whether you have a house or a car”.

2) All-in-one

• Merge duplicate data into one piece of data, and the merging rules are determined based on business logic.

4. Create test data (GaussDB)

The customer information field mainly includes information such as "name, gender, date of birth, document type, document number, source, whether you have a car, whether you have a house, marital status, mobile phone number,...".

--Create customer information table 

CREATE TABLE customer( 

name VARCHAR(20) 

,sex INT 

,birthday VARCHAR(10) 

,ID_type INT 

,ID_number VARCHAR(20) 

,source VARCHAR(10) 

,IS_car INT 

,IS_house INT 

,marital_status INT 

,tel_number VARCHAR(15) 

); 
--Insert test data 

INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101****','Life Insurance',' 1','1','1',''); 

INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101****',' Auto insurance','1','0','1',''); 

INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101*** *','','','','','186****0701'); 

INSERT INTO customer VALUES('李思','1','1989-01-02','1' ,'61010019890102****','life insurance','1','1','1',''); INSERT INTO 

customer VALUES('李思','1','1989-01-02' ,'1','61010019890102****','auto insurance','1','0','1',''); INSERT INTO customer 

VALUES('李思','1','1989- 01-02','1','61010019890102****','','','','','186****0702'); --View results SELECT * FROM 
customer 

;

Tip: Some field values ​​of INT type take the value of the dictionary table, which is omitted here.

5. Write a deduplication method (GaussDB)

The following examples do not include excessive processing of data cleaning, data desensitization, business logic, etc. These steps are recommended for "pre-processing". This example focuses on describing the process of deduplication.

1 ) Random retention : According to business logic, a record is randomly retained.

SELECT *

FROM (SELECT *

,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ) as row_num

FROM customer)

WHERE row_num = 1;

Description :

• ROW_NUMBER(): Starting from the first row, each row is assigned a unique and consecutive number.

• PARTITION BY col1[, col2...]: Specify the partition column, such as the deduplication key "name, ID type, ID number".

• WHERE row_num = 1: Take the number 1 generated by ROW_NUMBER().

2) Retain by priority : According to business logic, a record with a mobile phone number is retained first. If there are multiple records containing a mobile phone number or whether there is a mobile phone number, they will be randomly retained on this basis .

--Keep the record row containing the mobile phone number 

SELECT t.* 

FROM (SELECT * 

,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ORDER BY tel_number ASC) as row_num 

FROM customer) t 

WHERE t.row_num = 1;

Description :

• ROW_NUMBER(): Starting from the first row, each row is assigned a unique and consecutive number.

• PARTITION BY col1[, col2...]: Specify the partition column, such as the deduplication key "name, ID type, ID number".

• ORDER BY col [asc|desc]: Specifies the column to be sorted on. Ascending order (ASC) means to keep only the first row, while descending order (DESC) means to keep the last row.

• WHERE row_num = 1: Take the number 1 generated by ROW_NUMBER().

3) Merge retention: Based on business logic, merge field information with high integrity and accuracy. For example, the record row containing the mobile phone number is given priority to be completed. The fields that need to be completed are "whether you have a car, whether you have a house, and marital status", and their values ​​are the corresponding records whose source is "auto insurance".

--合并保留

SELECT t1.name

,t1.sex

            ,t1.birthday

            ,t1.id_type

            ,t1.id_number

            ,t1.source

            ,t2.is_car

            ,t2.is_house

            ,t2.marital_status

            ,t1.tel_number

FROM

(SELECT t.*

FROM (SELECT *

,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ORDER BY tel_number ASC) as row_num

FROM customer) t

WHERE t.row_num = 1) t1

LEFT JOIN

(SELECT *

FROM customer

WHERE source ='车险' and is_car IS NOT NULL AND is_house IS NOT NULL AND marital_status IS NOT NULL) t2

ON t1.name =t2.name

and t1.id_type=t2.id_type

and t1.id_number=t2.id_number

Description :

The t1 table gives priority to retaining record rows containing mobile phones (duplication removal) and serves as the main table. The t2 table is the source table of fields that need to be completed. The two tables are related through "name + certificate type + certificate number", and then merge the required information.

6. Attachment: Deduplication of all fields

In database applications, for example, if all fields are duplicated due to repeated misoperations, data doubling, etc., deduplication must also be performed at this time. In addition to the three methods introduced earlier, you can also use the keywords DISTINCT and UNION to remove duplicates, but you need to pay attention to the data volume and SQL performance. (Test by yourself)

1 )  D ISTINCT  (assuming all have the following three fields)

2 )  U NION (assuming all have the following three fields)

4. Suggestions for improving data deduplication efficiency

The best deduplication is actually to "intercept" the data at the source. Of course, it is impossible to completely avoid it due to business flow, but we can improve the efficiency of duplication removal:

Choose an appropriate deduplication algorithm 

According to the characteristics and scale of the data set, selecting a suitable deduplication algorithm can greatly improve the deduplication efficiency.

Optimize data storage structure 

Using appropriate data storage structures, such as hash tables, B+ trees, etc., can speed up data search and comparison, thereby improving the efficiency of deduplication.

Parallel processing 

Parallel processing is used to divide the data set into multiple subsets, perform deduplication processing separately, and finally merge the results, which can greatly speed up deduplication.

Use indexes to speed up searches 

Indexing key fields in the data set can speed up searches and comparisons, thereby improving the efficiency of deduplication.

Pre-filtering 

Using pre-filtering, you first perform some simple filtering and processing on the data set, such as removing null values, removing invalid characters, etc., which can reduce the number of comparisons and thereby improve the efficiency of deduplication.

Deduplication result cache (temporary table) 

Caching deduplication results can avoid repeated calculations, thereby improving deduplication efficiency.

• Rewriting (backup) not recommended 

Involving some partition tables, it is not recommended to directly rewrite the deduplicated result set to the production table, create a temporary replacement, or perform post-backup operations.

5. Summary

Data deduplication involves a wide range of aspects, including the discovery of duplicate data, the definition of deduplication rules, the methods and efficiency of deduplication, the difficulties and challenges of deduplication, etc. However, there is only one principle for deduplication, and that is to be business-oriented. Define duplicate data and formulate deduplication rules and plans based on business needs. During the use of GaussDB database, we will also encounter duplication scenarios. This article introduces you to the application background, cases, duplication solutions, etc. Welcome to test and communicate.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10141349