This article is shared from the Huawei Cloud Community " GaussDB Database SQL Series - Data Deduplication " by Gauss Squirrel Club Assistant 2.
I. Introduction
Data deduplication is a relatively common operation in databases. Complex business scenarios, data sources from multiple business lines, etc. will lead to the storage of duplicate data. This article uses the GaussDB database as the experimental platform and will explain in detail how to remove duplicates.
2. Data deduplication application scenarios
• Database management (including backup) : Data deduplication in the database can avoid repeated data storage and backup, improve database storage efficiency, and reduce backup storage costs.
• Data integration: In the process of data integration, data from multiple data sources needs to be merged. Deduplication can avoid the impact of duplicate data on the merged results.
• Data analysis ( or mining) : When performing data analysis or data mining, deduplication can avoid the interference of repeated data on the analysis or mining results and improve the accuracy of analysis.
• E-commerce platform: Deduplicating products on the e-commerce platform can avoid listing the same products repeatedly and improve the user experience of the platform.
• Financial risk control: In the field of financial risk control, deduplication can avoid the impact of duplicate data on the risk control model and improve the accuracy of risk control.
3. Data deduplication case (GaussDB)
Actual business scenario + GaussDB database
1. Example scene description
Take the deduplication of customer information in the insurance industry as an example. In order to prevent agents from repeatedly contacting customers (which can easily cause customer complaints), customers need to be uniquely identified. In the following two situations, it is necessary to identify a person (uniquely). In this case, data deduplication needs to be performed.
• Situation 1: The same customer has different source channels: the customer purchased both life insurance and property insurance (two different source systems);
• Scenario 2: The same customer returns multiple times: The customer purchases multiple times from the same channel (renewal or purchase of different products of the same insurance type).
2. Define duplicate data
Identify him as a person through "name + document type + document number". That is, as long as these three fields are repeated, the data is considered to be duplicate data. (Of course, there are more complex scenarios, such as "name + ID type + ID number + mobile phone number + license plate number", etc., which will not be introduced in detail this time).
3. Formulate deduplication rules
1) Choose one from more than one
• Random: According to the deduplication rules, a piece of data is randomly retained.
• Priority: According to deduplication rules + business logic, retain a piece of data that is needed first. For example, give priority to “whether you have a house or a car”.
2) All-in-one
• Merge duplicate data into one piece of data, and the merging rules are determined based on business logic.
4. Create test data (GaussDB)
The customer information field mainly includes information such as "name, gender, date of birth, document type, document number, source, whether you have a car, whether you have a house, marital status, mobile phone number,...".
--Create customer information table CREATE TABLE customer( name VARCHAR(20) ,sex INT ,birthday VARCHAR(10) ,ID_type INT ,ID_number VARCHAR(20) ,source VARCHAR(10) ,IS_car INT ,IS_house INT ,marital_status INT ,tel_number VARCHAR(15) ); --Insert test data INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101****','Life Insurance',' 1','1','1',''); INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101****',' Auto insurance','1','0','1',''); INSERT INTO customer VALUES('Zhang San','1','1988-01-01','1','61010019880101*** *','','','','','186****0701'); INSERT INTO customer VALUES('李思','1','1989-01-02','1' ,'61010019890102****','life insurance','1','1','1',''); INSERT INTO customer VALUES('李思','1','1989-01-02' ,'1','61010019890102****','auto insurance','1','0','1',''); INSERT INTO customer VALUES('李思','1','1989- 01-02','1','61010019890102****','','','','','186****0702'); --View results SELECT * FROM customer ;
Tip: Some field values of INT type take the value of the dictionary table, which is omitted here.
5. Write a deduplication method (GaussDB)
The following examples do not include excessive processing of data cleaning, data desensitization, business logic, etc. These steps are recommended for "pre-processing". This example focuses on describing the process of deduplication.
1 ) Random retention : According to business logic, a record is randomly retained.
SELECT * FROM (SELECT * ,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ) as row_num FROM customer) WHERE row_num = 1;
Description :
• ROW_NUMBER(): Starting from the first row, each row is assigned a unique and consecutive number.
• PARTITION BY col1[, col2...]: Specify the partition column, such as the deduplication key "name, ID type, ID number".
• WHERE row_num = 1: Take the number 1 generated by ROW_NUMBER().
2) Retain by priority : According to business logic, a record with a mobile phone number is retained first. If there are multiple records containing a mobile phone number or whether there is a mobile phone number, they will be randomly retained on this basis .
--Keep the record row containing the mobile phone number SELECT t.* FROM (SELECT * ,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ORDER BY tel_number ASC) as row_num FROM customer) t WHERE t.row_num = 1;
Description :
• ROW_NUMBER(): Starting from the first row, each row is assigned a unique and consecutive number.
• PARTITION BY col1[, col2...]: Specify the partition column, such as the deduplication key "name, ID type, ID number".
• ORDER BY col [asc|desc]: Specifies the column to be sorted on. Ascending order (ASC) means to keep only the first row, while descending order (DESC) means to keep the last row.
• WHERE row_num = 1: Take the number 1 generated by ROW_NUMBER().
3) Merge retention: Based on business logic, merge field information with high integrity and accuracy. For example, the record row containing the mobile phone number is given priority to be completed. The fields that need to be completed are "whether you have a car, whether you have a house, and marital status", and their values are the corresponding records whose source is "auto insurance".
--合并保留 SELECT t1.name ,t1.sex ,t1.birthday ,t1.id_type ,t1.id_number ,t1.source ,t2.is_car ,t2.is_house ,t2.marital_status ,t1.tel_number FROM (SELECT t.* FROM (SELECT * ,ROW_NUMBER() OVER (PARTITION BY name,id_type,id_number ORDER BY tel_number ASC) as row_num FROM customer) t WHERE t.row_num = 1) t1 LEFT JOIN (SELECT * FROM customer WHERE source ='车险' and is_car IS NOT NULL AND is_house IS NOT NULL AND marital_status IS NOT NULL) t2 ON t1.name =t2.name and t1.id_type=t2.id_type and t1.id_number=t2.id_number
Description :
The t1 table gives priority to retaining record rows containing mobile phones (duplication removal) and serves as the main table. The t2 table is the source table of fields that need to be completed. The two tables are related through "name + certificate type + certificate number", and then merge the required information.
6. Attachment: Deduplication of all fields
In database applications, for example, if all fields are duplicated due to repeated misoperations, data doubling, etc., deduplication must also be performed at this time. In addition to the three methods introduced earlier, you can also use the keywords DISTINCT and UNION to remove duplicates, but you need to pay attention to the data volume and SQL performance. (Test by yourself)
1 ) D ISTINCT (assuming all have the following three fields)
2 ) U NION (assuming all have the following three fields)
4. Suggestions for improving data deduplication efficiency
The best deduplication is actually to "intercept" the data at the source. Of course, it is impossible to completely avoid it due to business flow, but we can improve the efficiency of duplication removal:
• Choose an appropriate deduplication algorithm
According to the characteristics and scale of the data set, selecting a suitable deduplication algorithm can greatly improve the deduplication efficiency.
• Optimize data storage structure
Using appropriate data storage structures, such as hash tables, B+ trees, etc., can speed up data search and comparison, thereby improving the efficiency of deduplication.
• Parallel processing
Parallel processing is used to divide the data set into multiple subsets, perform deduplication processing separately, and finally merge the results, which can greatly speed up deduplication.
• Use indexes to speed up searches
Indexing key fields in the data set can speed up searches and comparisons, thereby improving the efficiency of deduplication.
• Pre-filtering
Using pre-filtering, you first perform some simple filtering and processing on the data set, such as removing null values, removing invalid characters, etc., which can reduce the number of comparisons and thereby improve the efficiency of deduplication.
• Deduplication result cache (temporary table)
Caching deduplication results can avoid repeated calculations, thereby improving deduplication efficiency.
• Rewriting (backup) not recommended
Involving some partition tables, it is not recommended to directly rewrite the deduplicated result set to the production table, create a temporary replacement, or perform post-backup operations.
5. Summary
Data deduplication involves a wide range of aspects, including the discovery of duplicate data, the definition of deduplication rules, the methods and efficiency of deduplication, the difficulties and challenges of deduplication, etc. However, there is only one principle for deduplication, and that is to be business-oriented. Define duplicate data and formulate deduplication rules and plans based on business needs. During the use of GaussDB database, we will also encounter duplication scenarios. This article introduces you to the application background, cases, duplication solutions, etc. Welcome to test and communicate.
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released