This article is shared from the Huawei Cloud Community " Introduction to the Data Desensitization Principles and Usage Methods of GaussDB (DWS) Security Management " by VV Yixiao.
1 Introduction
- Applicable versions: 8.2.0 and above
GaussDB (DWS) product data desensitization function is an important technological breakthrough for database products to internalize and consolidate data security capabilities. It provides the desensitization function of column-level sensitive data within the scope of designated users. It has the advantages of flexibility, efficiency, transparency, friendliness, etc. It greatly enhances the data security capabilities of the product and achieves reliable protection of sensitive data.
The arrival of the big data era has subverted the operating model of traditional businesses and stimulated new production potential. Data has become an important factor of production and a carrier of information, and the flow of data also hides higher-order value information. For data controllers and data processors, how to maximize the value of data flow is the original intention and significance of data mining. However, the exposure of a series of information leakage incidents has caused data security to receive more and more widespread attention. Countries and regions are gradually establishing and improving laws and regulations related to data security and privacy protection to provide legal protection for user privacy protection. How to strengthen data security and privacy protection at the technical level and put forward more functional requirements for the data warehouse product itself is also the most effective way to build data security.
GaussDB (DWS) product version 8.1.1 releases the data desensitization feature, which provides the desensitization function for column-level sensitive data within the specified user range. It has the advantages of flexibility, efficiency, transparency, and friendliness, and greatly enhances the data security capabilities of the product.
2. The concept of data desensitization
Data Masking, as the name suggests, is to shield sensitive data. Any data that may cause serious harm to society or individuals if leaked is common sensitive data. Personally identifiable information, such as name, ID number, address, mobile phone number, and email address. Enterprises are not suitable to disclose information, such as business license number, tax registration certificate, employee salary, device information such as IP address, MAC address, bank card number, protected Your health information, intellectual property rights, etc. are all sensitive information. This sensitive information is deformed through desensitization rules to achieve reliable protection of private data. Common desensitization rules in the industry include replacement, rearrangement, encryption, truncation, and masking. Users can also customize desensitization rules based on the desired desensitization algorithm.
Generally, a good implementation of data desensitization needs to follow the following two principles. First, try to retain meaningful information before desensitization for applications after desensitization; second, prevent hackers from cracking to the greatest extent.
Data desensitization is divided into static data desensitization and dynamic data desensitization. Static data desensitization is the "movement and simulation replacement" of data. After the data is extracted and desensitized, it is sent to downstream links for free access, reading and writing. After desensitization, the data is isolated from the production environment, which satisfies meet business needs while ensuring the security of the production database. Dynamic data desensitization performs desensitization processing in real time while accessing sensitive data. It can implement different desensitization schemes for different roles, different permissions, and different data types, thereby ensuring that the returned data is available and safe.
2.1 DWS dynamic data desensitization
The data desensitization function of GaussDB (DWS) abandons the pain points of high dependence and high cost of desensitization at the business application layer, and internalizes data desensitization into the security capabilities of the database product itself, providing a complete, safe, flexible and transparent set of , a friendly data desensitization solution, which belongs to dynamic data desensitization. After the user identifies the sensitive fields, he can create a masking policy by binding the built-in masking function based on the target field. The redaction policy has a many-to-one relationship with the table object. A masking strategy contains three key elements: table object, effective condition, masked column- masked function pair. It is a collection of all masked columns on the table object. Different fields can use different masking functions based on data characteristics. According to Different desensitization strategies can also be set on the same table depending on the effective conditions and desensitization column-desensitization function pairs. If and only if the effective condition is true, the query statement will trigger the desensitization of sensitive data. The desensitization process is built into the SQL engine and is transparent and invisible to users of the production environment.
Third-party agent desensitization tool vs. data warehouse DWS desensitization engine
- The third-party proxy tool is a transfer station between users and data warehouse clusters. It is a plug-in desensitization tool outside the base. It cannot directly participate in the generation environment, and complex scenarios are difficult to handle.
- DWS is a desensitization engine based on the direct interaction between the data warehouse base and the storage engine and SQL engine. It performs real-time desensitization during the query execution process, and the desensitization results are directly returned to the user.
- The agent desensitization tool is static desensitization, and the DWS desensitization engine is dynamic desensitization.
Advantages of DWS dynamic desensitization engine
- Good base synergy. The desensitization engine runs through many aspects of the data warehouse base and participates in the parsing, rewriting, optimization and execution of the SQL engine based on preset desensitization strategies. The desensitization process is invisible to the user.
- Policies are configurable. Customers can identify sensitive data based on their own business scenarios and flexibly preset desensitization strategies for designated columns of business tables.
- Strategies are scalable. The product has a built-in desensitization function that can cover most common desensitization effects and supports user-defined desensitization functions.
- Data availability. The original sensitive data in the database participates in the operation, and is desensitized only at the time of leaving the database (when the results are returned).
-
Data access is controlled. The original sensitive data will not be visible to users under the conditions for the desensitization policy to take effect.
-
All scene data will not be leaked. Base interaction can reduce the potential risk of leakage of sensitive data transmission links,
It is more secure and reliable, and fully identifies various potential malicious phishing scenarios and provides effective protection.
3. How to use data desensitization
Dynamic data desensitization is a real-time desensitization process based on whether the effective conditions are met during the execution of the query statement. Validation conditions are usually based on the judgment of the current user role. The visible range of sensitive data is preset for different users. The system administrator has the highest authority and can see any field in any table at any time. Identifying restricted user roles is the first step in creating a desensitization strategy.
Sensitive information depends on the actual business scenario and security dimensions. Taking natural persons as an example, the sensitive fields of individual users include: name, ID number, mobile phone number, email address, etc.; in the banking system, as a customer, it may also involve bank card number, Expiration time, payment password, etc.; in the company system, as an employee, it may also involve salary, educational background, etc.; in the medical system, as a patient, it may also involve medical treatment information, etc. Therefore, identifying and sorting out sensitive fields in specific business scenarios is the second step in creating a desensitization strategy.
The product has a series of common desensitization function interfaces built-in, which can specify parameters for different data types and data characteristics to achieve different desensitization effects. The desensitization function can use the following three built-in interfaces, and also supports custom desensitization functions. The three built-in desensitization functions can cover the desensitization effects in most scenarios. It is not recommended to use custom desensitization functions.
-
MASK_NONE: No desensitization, only for internal testing.
-
MASK_FULL: Full desensitization to a fixed value.
-
MASK_PARTIAL: Use the specified masking characters to partially mask the content within the masking range.
Different desensitization columns can use different desensitization functions. For example, mobile phone numbers usually display the last four digits and replace the front numbers with "*"; the amount is uniformly displayed as a fixed value of 0, etc. Determining the desensitization function that needs to be bound to the desensitization column is the third step in creating a desensitization strategy.
Taking the employee table emp of a company, the table's owner user alice, and users matu and july as an example, we will briefly introduce the use process of data desensitization. Among them, table emp contains employees' names, mobile phone numbers, emails, pay card numbers, salaries and other private data. The user Alice is the human resources manager, and the users Matu and July are ordinary employees.
It is assumed that the table, user, and the user's view permissions on table emp are all in place.
1. Create a desensitization policy mask_emp, which only allows Alice to view all employee information. Matu and July are not visible to payroll card numbers and salaries. The field card_no is a numeric type, and MASK_FULL is used to fully desensitize it to a fixed value of 0; the field card_string is a character type, and MASK_PARTIAL is used to partially desensitize the original data according to the specified input and output format; the field salary is a numeric type, and the number 9 is used to partially desensitize it. All digit values before the penultimate digit.
postgres=# CREATE REDACTION POLICY mask_emp ON emp WHEN (current_user != 'alice') ADD COLUMN card_no WITH mask_full(card_no), ADD COLUMN card_string WITH mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV','VVVV-VVVV-VVVV-VVVV','#',1,12), ADD COLUMN salary WITH mask_partial(salary, '9', 1, length(salary) - 2);
Switch to matu and july and view the employee table emp.
postgres=> SET ROLE matu PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+---------+---------------------+----------------------+------------+--------------------- 1 | anny | 13420002340 | 0 | ####-####-####-1234 | [email protected] | 99999.9990 | 1999-10-02 00:00:00 2 | bob | 18299023211 | 0 | ####-####-####-3456 | [email protected] | 9999.9990 | 1989-12-12 00:00:00 3 | cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows) postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+---------+---------------------+----------------------+------------+--------------------- 1 | anny | 13420002340 | 0 | ####-####-####-1234 | [email protected] | 99999.9990 | 1999-10-02 00:00:00 2 | bob | 18299023211 | 0 | ####-####-####-3456 | [email protected] | 9999.9990 | 1989-12-12 00:00:00 3 | cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows)
2. Due to work adjustment, matu entered the human resources department to participate in the company's recruitment matters, and all employee information was also visible, modifying the policy's effective conditions.
postgres=> ALTER REDACTION POLICY mask_emp ON emp WHEN(current_user NOT IN ('alice', 'matu'));
Switch to users matu and july, and view the employee table emp again.
postgres=> SET ROLE matu PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+------------------+---------------------+----------------------+------------+--------------------- 1 | anny | 13420002340 | 1234123412341234 | 1234-1234-1234-1234 | [email protected] | 10000.0000 | 1999-10-02 00:00:00 2 | bob | 18299023211 | 3456345634563456 | 3456-3456-3456-3456 | [email protected] | 9999.9900 | 1989-12-12 00:00:00 3 | cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows) postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+---------+---------------------+----------------------+------------+--------------------- 1 | anny | 13420002340 | 0 | ####-####-####-1234 | [email protected] | 99999.9990 | 1999-10-02 00:00:00 2 | bob | 18299023211 | 0 | ####-####-####-3456 | [email protected] | 9999.9990 | 1989-12-12 00:00:00 3 | cici | 15512231233 | | | [email protected] | | 1992-11-06 00:00:00 (3 rows)
3. Employee information phone_no, email and birthday are also private data. Update the mask_emp masking policy and add three masking columns.
postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN phone_no WITH mask_partial(phone_no, '*', 4); postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN email WITH mask_partial(email, '*', 1, position('@' in email)); postgres=> ALTER REDACTION POLICY mask_emp ON emp ADD COLUMN birthday WITH mask_full(birthday);
Switch to user july and view the employee table emp.
postgres=> SET ROLE july PASSWORD 'Gauss@123'; postgres=> SELECT * FROM emp; id | name | phone_no | card_no | card_string | email | salary | birthday ----+------+-------------+---------+---------------------+----------------------+------------+--------------------- 1 | anny | 134********* | 0 | ####-####-####-1234 | ********163.com | 99999.9990 | 1970-01-01 00:00:00 2 | bob | 182********* | 0 | ####-####-####-3456 | ***********qq.com | 9999.9990 | 1970-01-01 00:00:00 3| cici | 155******** | | | **************sina.com | | 1970-01-01 00:00:00 (3 rows)
4. Considering the friendliness of user interaction, GaussDB (DWS) provides system views redaction_policies and redaction_columns to facilitate users to directly view more desensitization information.
postgres=> SELECT * FROM redaction_policies; object_schema | object_owner | object_name | policy_name | expression | enable | policy_description ---------------+--------------+-------------+-------------+-----------------------------------+--------+-------------------- public | alice | emp | mask_emp | ("current_user"() = 'july'::name) | t | (1 row) postgres=> SELECT object_name, column_name, function_info FROM redaction_columns; object_name | column_name | function_info -------------+-------------+------------------------------------------------------------------------------------------------------- emp | card_no | mask_full(card_no) emp | card_string | mask_partial(card_string, 'VVVVFVVVVFVVVVFVVVV'::text, 'VVVV-VVVV-VVVV-VVVV'::text, '#'::text, 1, 12) emp | email | mask_partial(email, '*'::text, 1, "position"(email, '@'::text)) emp | salary | mask_partial(salary, '9'::text, 1, (length((salary)::text) - 2)) emp | birthday | mask_full(birthday) emp | phone_no | mask_partial(phone_no, '*'::text, 4) (6 rows)
5. Suddenly one day, when employee information can be shared within the company, just delete the masking policy mask_emp of the emp table.
postgres=> DROP REDACTION POLICY mask_emp ON emp;
For more usage details, please refer to the GaussDB (DWS) product documentation.
4. Invisible data desensitization
When using the data desensitization function, there may be cases where sensitive data is processed and calculated before output. In this case, if the desensitized data is used for in-database calculations, the desensitized data itself will have an impact on the query results under conditions such as aggregate functions and filter conditions. Therefore, the data desensitized will be used to address this phenomenon. Min introduces the ability to count as invisible. The so-called invisible means that the original sensitive data is used in the database to participate in processing calculations, and the sensitive data is only desensitized when it is shipped out of the database. To use the computable invisible function, you need to set the switch enable_redactcol_computable=on.
Currently, the scenarios that support the direct participation of sensitive data in processing calculations are as follows:
-
SELECT nullif(salary, 1) FROM emp; projection column expression nullif
-
SELECT email LIKE '%.com' FROM emp; projected column LIKE expression
-
SELECT to_days(birth) FROM david.emp; Projection column function to_days
-
SELECT count(*) FROM emp; aggregate function
-
SELECT * FROM emp WHERE cardid IS NOT NULL; filter conditions
-
SELECT name, avg(salary) * 12 FROM emp GROUP BY name; grouping condition (name is a desensitized field)
-
SELECT (SELECT salary+10 FROM emp ORDER BY id LIMIT 1); Subquery position projection column expression
-
The two tables use sensitive fields as association conditions
-
CTE expression projected column
The scenarios that trigger data desensitization at the time of shipment are as follows:
-
Table query
-
View query
-
DML RETURNING clause
-
COPYExport
-
GDS table export
-
CURSOR… FETCH
-
The stored procedure definition uses the masked table to query the stored procedure.
4.1 Inheritance of desensitization strategy
For INSERT/UPDATE/MERGE INTO/CREATE TABLE AS statements, when the subquery is a projection operation on a sensitive field, desensitized inheritance will be triggered, so that the new table containing sensitive information contains the same information as the source table. The desensitization strategy avoids the problem of sensitive data being leaked by inserting sensitive data from the source table into a new table and then querying the new table. In addition, the inheritance of the masking policy belongs to the table dimension activity, and the inheritance activity does not pay attention to whether the masking policy takes effect under the current session or role conditions in the subquery part.
The first step in inheriting the desensitization strategy is to perform sensitive lineage analysis. For any user to execute a DML statement, the subquery part of the source table and its target projection column will be traversed. Once the desensitization strategy exists in the source table and the target projection column is a desensitization field, It is considered that when using the source table to insert/update target table data, there is a risk of exposing sensitive data of the source table.
When inheriting a masking strategy, first generate the masking strategy information and masking field information that apply to the target table from the sensitive information of the source table marked in the traversal. Then the system built-in generates a policy creation statement and writes it to the system table pg_redaction_policy/pg_redaction_column. The built-in desensitization policy is named "inherited_rp". Finally, set the system table metadata inherit mark field to true.
Note that if the INSERT execution session/user meets the trigger conditions, when the insertion result is printed with the RETURNING clause, the returned result will be desensitized, and the log information "Parent redaction policies/columns" will record the source of policy inheritance.
With the introduction of desensitization policy inheritance behavior, some desensitization policy conflict scenarios have arisen. For example, the target column of the SELECT statement query is not the original sensitive field, but a complex expression of the sensitive field. The expression is calculated first and then desensitized. How to define the desensitization behavior in this case? The two sub-branches of the SETOP set operation use different desensitization effects for the same target column. How to define the desensitization result of the target column of the outer statement at this time? In sensitive lineage analysis of multiple INSERT/UPDATE operations, the source table projection columns of the same target column adopt different desensitization effects, and the effective conditions of the source table policy may also be different. How should the desensitization policy be inherited in this case?
For these conflict scenarios, based on the principle that protecting any sensitive data of the user from being leaked takes precedence over the desensitization effect of sensitive data that does not have original characteristics, when a desensitization effect conflict is encountered, it is upgraded to the general desensitization effect mask_full. mask_full is a full masking function that can cover any data type. It only focuses on the expression return value type, which can ensure that the masked data will not be leaked. However, the masked result will not be able to represent the characteristics of the original data, making the masked result less readable. Poor. In addition, for function expressions such as length and count, the calculation results will not expose any original data characteristics and information, so the ALTER FUNCTION ... [NOT] MASKED syntax is provided to support users to manually configure a non-desensitizing function whitelist.
4.2 Protection against malicious phishing
Certain sensitive information is known, and through multiple heuristic matches, the visible user information is reversely corroborated, thereby stealing the user's private data. Heuristically match sensitive information using filter conditions or projection operations that assist expressions in the form of equivalence judgments. These behaviors of extracting user private data through known constant values and equivalence/similar equivalence judgment expressions are malicious attempts.
postgres=> SELECT name FROM emp WHERE name in('张三'); INFO: The result of target column {"name"} is masked. name ------ open* (1 row)
As shown in the above example, although the user's name information has been desensitized, since the query condition is for the user 'Zhang San', even if it is desensitized to 'Zhang*', we can still easily determine the desensitization here. The sensitive information is 'Zhang San', which leads to the risk of leakage of Zhang San's user information.
In response to this problem, we have adopted a "pessimism" model. Any constant equivalence judgment may have the risk of malicious arbitrage and should be prohibited. Examples are as follows:
postgres=> SELECT name FROM emp WHERE name in('张三'); ERROR: Redaction column "name" cannot be referenced in equivalence conditions with const value. HINT: Please use EXPLAIN command to see more details.
The scenarios where malicious arbitrage using constants is prohibited are summarized as follows:
1. Constant equivalence judgment expressions, compound expressions, and equivalent expressions for desensitized fields
2. Assuming that the name field is a desensitized field and the current session meets the policy triggering conditions, the statement has the following (but is not limited to) characteristics, there is a risk of malicious arbitrage, and execution is prohibited:
• name = 'Zhang San'
• name = 'Zhang San' OR name = '李思'
• name in ('Zhang San', '李思')
• CASE name WHEN ‘张三’ THEN true …
• CASE WHEN name in ('张三', '李思') THEN …
• Advanced package dbms_output.put_line
3. An error will be reported when the statement is executed: ERROR: Redaction column “name” cannot be referenced in equivalence conditions with const value.
5. Implementation principle of data desensitization
The GaussDB (DWS) data desensitization function is based on the existing implementation framework of the SQL engine and enables real-time desensitization processing that is imperceptible to the outside world during the execution of query statements by restricted users. Regarding its internal implementation, it is shown in the figure above. We regard the redaction policy as a rule bound to the table object. During the query rewriting phase of the optimizer, each TargetEntry of the TargetList in the Query Tree is traversed. If a redaction column of the base table is involved, and the current When the desensitization rule takes effect (that is, the effective conditions of the desensitization policy are met and enable is turned on), it is concluded that the Var object to be desensitized is involved in this TargetEntry. At this time, the desensitization column system table pg_redaction_column is traversed to find the corresponding desensitization column binding. A certain desensitization function can be replaced by the corresponding FuncExpr. After the above rewriting of the Query Tree, the optimizer will automatically generate a new execution plan, the executor will execute according to the new plan, and the query results will be desensitized to sensitive data.
Compared with the original statement, statement execution with data desensitization adds logical processing of data desensitization, which will inevitably bring additional overhead to the query. This cost is mainly affected by three factors: the data size of the table, the number of masked columns involved in querying the target column, and the masking function used in the masked column.
For a simple query statement, take the tpch table customer as an example to test the above factors, as shown in the following figure.
In Figures (a) and (b), the base table customer uses either the MASK_FULL desensitization function or the MASK_PARTIAL desensitization function based on the field type and characteristics. MASK_FULL only desensitizes original data of any length and type to a fixed value, so the output result is very different from the original data. Figure (a) shows the execution time of simple query statements in desensitized and non-desensitized scenarios under different data scales. Solid icons represent non-desensitization scenarios, and hollow icons represent restricted users, that is, desensitization scenarios. It can be seen that the larger the data size, the greater the difference between the query time with desensitization and the original statement. Figure (b) shows the impact of different numbers of masked columns involved in the query on statement execution performance under 10x data scale. When one masked column is involved, the query with masking is slower than the original statement. It is traced back that this column uses the MASK_PARTIAL partial masking function. The query result only changes the format of the result, and the length of the result content does not change. It is in line with the theoretical conjecture that "statement execution with desensitization will have corresponding performance degradation". As the number of masked columns involved in the query increased, we discovered a strange phenomenon. The desensitized scenario actually executed faster than the original statement. Further tracing the masking functions associated with masked columns in multi-column scenarios, we found that it is precisely because of the masked columns using the MASK_FULL masking function that the output result set saves a lot of time compared to the original data, thus making multi-column queries more efficient. Simple queries with data masking are actually much faster.
In order to support the above speculation, we adjusted the masking function. All masked columns use MASK_PARTIAL to partially mask the original data, so that the external readability of the original data can be retained in the masking results. Therefore, as shown in Figure ©, when the masked columns are all associated with partial masking functions, the statement with data masking is about 10% worse than the original statement. Theoretically, this degradation is within the acceptable range. The above test is only for simple query statements. When the statement is complex enough to include aggregate functions or complex expression operations, this performance degradation may be more obvious.
6. Summary
GaussDB (DWS) product data desensitization function is an important technological breakthrough for database products to internalize and consolidate data security capabilities. It mainly covers the following three aspects:
A set of simple and easy-to-use syntax for data desensitization strategies;
A series of flexibly configured built-in desensitization functions that cover common private data desensitization effects;
A complete and convenient desensitization strategy application solution enables real-time, transparent and efficient desensitization of original statements during execution.
All in all, this data desensitization function can fully meet the data desensitization requirements of customers' business scenarios, support the desensitization effect of common private data, and achieve reliable protection of sensitive data.
[Warm reminder] If you have any questions during use, please feel free to communicate and give feedback at any time.
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
High school students create their own open source programming language as a coming-of-age ceremony - sharp comments from netizens: Relying on the defense, Apple released the M4 chip RustDesk. Domestic services were suspended due to rampant fraud. Yunfeng resigned from Alibaba. In the future, he plans to produce an independent game on Windows platform Taobao (taobao.com) Restart web version optimization work, programmers’ destination, Visual Studio Code 1.89 releases Java 17, the most commonly used Java LTS version, Windows 10 has a market share of 70%, Windows 11 continues to decline Open Source Daily | Google supports Hongmeng to take over; open source Rabbit R1; Docker supported Android phones; Microsoft’s anxiety and ambitions; Haier Electric has shut down the open platform