Information Security - Data Security - Data Security Innovation in the New Digital Format - Tokenization

0 Preface

With technological innovation leading the wave of digitalization sweeping the world, data has become the core factor of production for enterprise development. High-tech companies such as Google and Facebook have achieved great commercial success by providing free and excellent software and services, accessing a large number of users, and driven by data resources. However, at the same time of rapid development, companies have neglected data governance, causing a large number of data leakage, algorithm abuse, and privacy-related issues. This crisis has reached a climax along with landmark events such as Facebook's "Cambridge Analytica" scandal and the 2020 US election. Based on concerns about data security and privacy, the European Union's GDPR led the introduction of modern privacy compliance, and then swept the world, becoming another irreversible trend.

There are two roads ahead for enterprises, one is to ensure survival and development through data technology innovation, and the other is to ensure the security of user data. In the choice and balance of these two roads, some companies have fallen, while some companies have survived and burst out with new vitality.

It can be seen that only by changing our thinking and being brave in innovation can we turn crises into opportunities and achieve long-term development. We need to recognize the turning trend: the transition from extensive, inefficient, flood-intensive carbon growth in the first half of the digital age to high-quality, high-efficiency data carbon neutrality based on efficient data management and governance capabilities. To survive and stand out in this transformation, technological innovation is an important starting point, and the key point is to grasp two core ideas:

  1. It is necessary to recognize the productivity characteristics of powerful data applications, actively carry out technological transformation, make full use of advanced data management techniques, and improve data use efficiency and governance level.
  2. In-depth study and understanding of the purpose and essence of privacy compliance, follow the core idea of ​​"available, invisible", and achieve the unity of efficiency and governance.

1. The challenge of data technology to security

In the digital application environment, data has the following characteristics:

  1. Data mobility and openness : According to the theory of digital economics, if data is to create commercial value, it must be low-cost, large-scale supply, and efficient flow. If the traditional network security is minimized, approved layer by layer, and fortified layer by layer, the vitality of data production will be severely limited. In addition, the cost of meeting advanced protection baselines at every node through which data flows is beyond the reach of organizations.
  2. Data reproducibility and loss of control : Data is a liquid asset, once accessed, its control will be transferred, and the provider will lose control over it. Traditional trust boundaries are becoming more and more blurred in data applications, which makes the implementation of centralized security policies under the new data architecture costly and has little effect.
  3. Changeable data forms and complex applications : data will be transmitted, stored and processed in almost all IT systems, and its complexity is beyond imagination. Coupled with AI, machine learning and various innovative data applications, the logic of data usage is more difficult to figure out, and it is almost impossible to understand the whole picture of the data.
  4. Data threats are complex and ever-changing : the huge commercial value of data attracts black and gray industrial chains, internal and external personnel, and even commercial and international spies. Attack techniques and motivations emerge in an endless stream, making it impossible to defend against.

Figure 1 Conventional system is data exposure in the central protection mode

In the traditional mode, data circulates in the system in clear text, and the data exposure is huge. Attackers obtain a large amount of data through various channels such as applications, storage, host system entrances, and authorized accounts for attacking systems.

Figure 2 Horizontal data exposure in conventional mode

In digital scenarios, data will be transferred among tens of thousands of applications and tasks. Each application has its own logic, and the cost of making all applications compliant is huge. To protect data security in such a broad and complex environment, if the traditional system-centric defense model is adopted, the defense front will be too long, and the pattern of attacking strong and defending weak will put data security governance at a long-term disadvantage. It is necessary to change the way of thinking and create a data-endogenous security mechanism. In the environment of rapid expansion of data services, security protection capabilities will also increase. This is a data-centric security defense innovation mechanism.

2. Tokenization - digital world banking system

The tokenization scheme refers to the banking system in the real world. Before the emergence of the banking system, economic activities in the market were mainly cash transactions. The excessive exposure of cash has resulted in a large number of theft and robbery cases. Although the bodyguard business is popular, only a few rich people can afford to hire it, so a large amount of social assets are lost. The banking system emerged as the times require: after users obtain cash, they immediately go to the bank to exchange cash into deposits (equivalent substitutes), and then this substitute - electronic cash - is circulated throughout society, and can only be exchanged in very rare scenarios into cash. With the penetration of the banking system and the popularity of various online payment applications, such cash usage scenarios are becoming less and less. If you want to rob money, you can only go to the bank, and the bank is heavily guarded.

Similarly, data, as a core asset, can be used to replace plaintext data (P) with a one-to-one corresponding pseudonym-Token when personal sensitive data (PII) first enters the organization's business system. In the subsequent application environment of the entire organization, Token is efficiently circulated. Because there is a one-to-one correspondence between Token and plaintext, it can replace plaintext for transmission, exchange, storage and use in most scenarios of the life cycle, and Token can only be converted into plaintext through safe and reliable Tokenization services. Hackers and internal or external malicious actors are useless (invisible) even if they get it. Due to Token's own security attributes, as long as the main data sources and data hubs are controlled within the organization, only Token is used for circulation. The new plaintext data needs to be actively replaced with Token to achieve data security by default, which fundamentally solves the governance problem of personal sensitive data.

Figure 3 Tokenized data exposure

As shown in Figure 3 above, by promoting Tokenization, we can compress the services that can actually access plaintext to 2 digits, and reduce the exposure of data services to less than 1%.

Figure 4 Longitudinal exposure of data after tokenization

As shown in Figure 4 above, after the Token transformation, sensitive data has 0 storage, 0 cache, 0 interface, and 0 data warehouse; only a small amount of host memory and UI with decryption authority can obtain access to plaintext data. UI implements risk control through subsequent fine-grained access control and audit risk control measures. For a small amount of memory data, due to the limited amount, specific reinforcement and risk control measures can be used to strengthen it. If comprehensive tokenization is realized, the overall risk control capability of sensitive data can also be greatly enhanced.

3. Introduction of Tokenization Solution

3.1 What is Tokenization

Tokenization is a solution to replace personal sensitive data with Token, which is an insensitive data equivalent substitute, and circulate it in business systems to reduce data risks and meet privacy compliance. Tokenization is a kind of de-identification technology. It first appeared in the payment card industry (PCI) scenario to replace bank cards (PANs), and there is a trend to replace personal sensitive information (PII) in general digital scenarios.

  1. Personal unique identification information (PII) : Any unique identification information that can be directly or indirectly linked to a specific natural person, such as ID card, mobile phone number, bank card, email, WeChat account or address. Relying on PII information alone, combined with other public information, can find natural persons. Once PII information is leaked, it may cause life and property damage to individuals such as identity fraud and fraud. Therefore, various laws and regulations at home and abroad clearly require enterprises to protect the entire life cycle of PII.
  2. De-identification (De-identification) : Through certain technical means, sensitive data is replaced or transformed, and the association between personal sensitive data and natural persons is temporarily or permanently eliminated. Specific means include pseudonymization, anonymization, data encryption, etc.
  3. Pseudonymization : By replacing sensitive data with artificial IDs or pseudonyms, no one can use pseudonyms to establish a relationship with the original natural person without authorization. Tokenization is a pseudonymization implementation mechanism. In a broad sense, the two concepts can be interchanged. Pseudonymization is a recognized de-identification scheme including GDPR. Note that there is a one-to-one correspondence between pseudonymization and PII, and the original data can be restored in specific scenarios.
  4. Anonymization : Covering or replacing part or all of sensitive data so that it completely loses its connection with the original data or natural persons. Anonymization is irreversible, and commonly used anonymization techniques include Data Masking.
  5. Data encryption : Use data encryption algorithms, such as the national secret symmetric algorithm SM4 and the common encryption algorithm AES, to encrypt sensitive data and generate cipher text (Cipher), which cannot be performed unless the key management system (KMS) encryption key authorization is obtained Decrypt to get the plaintext. Note that unlike the pseudonymized Token, the ciphertext can only be used after decrypting the plaintext, without any direct use attributes. Therefore, ciphertext can only be used for storage and information transmission, which greatly limits the scope of use, such as data analysis scenarios such as search, association, query, and matching.

3.2 Basic Design of Tokenization

3.2.1 Available, Invisible

1. Usability implementation

a) The big data analysis scenario uses the uniqueness of Token to realize functions such as deduplication, statistics, and association of data mining, processing, and analysis.

b) Information transmission. In all other scenarios, Token can completely replace plaintext data circulating in the entire system by using its uniqueness, and solve data usage in links such as exchange, association, query, and matching.

c) Use of sensitive functions: In scenarios where plaintext data must be used, tokenized services can be used to exchange back plaintext to achieve full usability.

2. Implementation of invisibility

The security of Tokenization itself is the security basis of the entire solution. Therefore, Tokenization must ensure its security from design to implementation to prevent illegals from using Token to obtain the corresponding original plaintext, resulting in data leakage. For details, please refer to Chapter 4 - Tokenized Security Implementation.

3.2.2 Basic Architecture Requirements

In order to meet the data protection capabilities in complex scenarios, the tokenization solution is required to meet several main architectural requirements:

  1. Business adaptability : Tokenization needs to meet the data exchange requirements of all data application scenarios, including online transactions, real-time and offline data applications, and all scenarios such as AI and machine learning algorithms.
  2. Security : To ensure that the desensitization attribute of Token is protected by assuring its association with plaintext. This needs to be guaranteed by the security of algorithms and tokenized services, as well as the multiple security of downstream applications.
  3. Usability and efficiency : The introduction of tokenization should not increase the efficiency and stability of business systems.

3.3 Token generation logic

The logic of tokenization is to generate a globally unique ID-Token for sensitive data within the enterprise. Usually there are 3 schemes to realize ID generation.

Figure 5 Logical diagram of Tokenization

1. Randomization  : Tokens are generated completely randomly, and are stored in a one-to-one mapping relationship table (this is a tokenized generation method in a narrow sense). Because there is no algorithmic relationship between Token and plaintext, forward and reverse correlation can only be performed through Tokenized services, so it is the safest solution. But the disadvantage of this solution is that in order to ensure a high degree of consistency of Token, the new Token generation logic cannot be concurrent, otherwise there will be a one-to-many consistency problem. In order to ensure data consistency, certain distributed capabilities and performance will be sacrificed. Invisibly increases usability risks, especially in remote remote scenarios.

Figure 6 Tokenization generation method 1

2. MAC method : Through the unified salted hash HMAC algorithm, any process and any location can generate the same Token to ensure consistency. The mapping relationship between the generated Token and plaintext is listed to realize the anti-tokenization capability. The advantage of this method is that it can be distributed across regions, but the disadvantage is that certain security is sacrificed. Once the attacker obtains the salt, the algorithm can be used to calculate Token in batches. We can achieve a balance between security and availability by adopting an appropriate protection mechanism for the salt (using the same protection strategy as the encryption key).

Figure 7 Tokenization generation method 2

3. Deterministic encryption and decryption : use a deterministic encryption algorithm, such as (AES-SIV), or format-preserving encryption (FPE), to encrypt plaintext and generate a reversible Token. This algorithm destroys the security technology of encryption - randomness, but the current algorithm generally has loopholes, so it is not recommended to use it. In addition, there is a natural loophole in this algorithm, that is, the key cannot be rotated.

Figure 8 Tokenization generation method 3

3.4 Logical Architecture of Tokenization Solution

Figure 9 Tokenized logical architecture diagram

Tokenized services need to meet the compatibility, security and availability of all business scenarios, mainly through multiple access integration solutions. And integrate the necessary security measures. Tokenized services are logically divided into access layer, service layer and storage layer.

  1. Access layer : It is mainly used to connect business applications and personnel access, and complete the conversion between Token and plaintext, that is, Tokenization and anti-Tokenization request requirements. Provide man-machine interface (Portal), service interface (API) calls and big data task requests respectively. Due to tokenized security requirements, the access layer needs to ensure reliable security measures, such as fine-grained access control, IAM, service authentication, and big data job authentication capabilities.
  2. Service layer : Actual implementation of tokenization and anti-tokenization. It mainly completes the generation, storage and query of Token.
  3. Storage layer : The storage layer mainly includes online storage and data warehouses. Due to security considerations, the Tokenized mapping table does not store plaintext but encrypted ciphertext. At the same time, the HMAC algorithm is used to establish the association relationship of HASH > Token > ciphertext to realize the business logic of exchanging plaintext for Token (positive check) and Token for plaintext (reverse check). Note that the application cannot directly obtain the plaintext through tokenization, but obtains the ciphertext, and obtains the decryption permission through KMS to obtain the plaintext by local decryption.

3.5 Panorama of Tokenization Application

Figure 10 Tokenized data circulation panorama

Component Description:

1. Online Data Sources

The main data source of sensitive data, once entering the company, it needs to connect to the Tokenization service API to convert it into a Token and store it in the warehouse. In certain scenarios, data will also be connected to the data warehouse. The other role of the data source is to provide and share sensitive data downstream, which can be through API, MQ or shared storage such as S3 and other media.

2. Data warehouse data source

Directly pouring or coming from online, sensitive data enters the data warehouse, and the tokenization task needs to be enabled to convert the plaintext into a token, and then provide it to other downstream big data applications.

3. Tokenization service

a) Tokenized online services provide clear text exchange Token services for online transactions and factual tasks through API.

b) Tokenize offline Hive, provide offline data cleaning services for big data tasks, and convert plaintext into tokens.

4. KMS and encryption and decryption

a) Distribute the encryption key for tokenization, and encrypt the plaintext to form a ciphertext field.

b) Distribute a decryption key to all applications with decryption authority for decryption.

5. Data application

a) Conventional intermediate applications: services that can complete business functions based on Token. Obtain Token from the data source and pass it downstream.

b) Decryption application: According to business requirements and under the premise of meeting the security baseline, exchange Token for ciphertext, and connect to the encryption and decryption module for decryption to obtain plaintext.

4. Tokenized security implementation

4.1 The Security Essentials of Tokenization

The security assumption of Tokenization is the irrelevance of Token and plaintext. If any person or system illegally saves and constructs a comparison dictionary between Token and plaintext or has the ability to construct this dictionary, the security mechanism of Tokenization will be completely destroyed. Therefore, the core of tokenized security is to prevent the generation of this table.

4.2 Security Risk and Security Design

1. Tokenized service itself security risk and control

a) Token generation logic security : The unique ID generated by random Token is the most secure method, and a trusted random number generator is required. If conditions permit, a hardware-based cipher machine can be used to generate random numbers. If implemented by software, an encryption-based pseudo-random number generation mechanism is required. If the HMAC method is used to generate, the safety of the salt needs to be ensured.

① It can only be created, distributed and stored through trusted mechanisms such as KMS;

② It can only be used when the tokenized service is running;

③ Rotate the salt on a regular basis. It is recommended that the used salt be safely deleted on a daily or weekly basis;

④ Make sure to use a secure Hash algorithm, such as SHA-256 or SM3.

b) Tokenized runtime security : Tokenized services use a dedicated system with special reinforcement.

c) Tokenized storage security : Considering big data scenarios and various storage requirements, it is required that Tokenized storage itself does not store sensitive information, but only includes indexes, tokens, and ciphertexts. At the same time, tokenized storage requires strict access control.

d) Tokenized access security :

① The API requires reliable service authentication, MTLS + Oauth2 tickets are recommended, and access log auditing is enabled at the same time;

② Token exchange plaintext logic only returns ciphertext, and the request service uses KMS to decrypt locally, and centrally controls the decryption authority;

③ The UI provides users with the ability to manually encrypt and decrypt Token and plaintext. The requirements must go through IAM, and support ABAC-based fine-grained, risk-based access control.

2. Secondary security generated by ecological upstream and downstream services and applications

Regardless of the data source or the downstream plaintext data consumer, because of the tokenized interface access authorization, it is technically possible to call the interface remotely and traverse the full mapping relationship between Token and plaintext. Therefore, security measures need to be extended to these systems and users to ensure that there will be no data leakage caused by these erroneous behaviors or program loopholes.

a) Build a data application security baseline to restrict upstream and downstream data usage behavior;

b) Strictly prohibit any form of illegal plaintext, especially the forwarding and dumping of Token-plaintext mapping relationship data;

c) It is forbidden to set up a proxy, and the data service subject must directly connect with the Tokenization service;

d) All ecosystems must undergo a full security review, including subsequent changes. ensure baseline compliance;

e) Incorporate all upstream and downstream services into the monitoring system, including its storage, data interface, application code logic, and lineage;

f) Global monitoring and scanning to ensure that all non-compliant processes are discovered and processed in a timely manner.

5. Engineering practical experience

Tokenization service is not complicated in design. Once implemented, it will completely change the organization's data usage habits and fundamentally solve the contradiction between data usage efficiency and security compliance.

However, its powerful protection is based on the transformation of data usage logic, breaking the old habit of using plaintext data, and the implementation process faces huge challenges, including lax maintenance of application code, redundant and chaotic historical data, complex Chaotic access logic, these problems will bring obstacles to system transformation. All businesses involving sensitive data need to cooperate with the transformation. Projects of this scale must be coordinated in terms of process planning, organizational guarantees, and technical support. Meituan has also accumulated a lot of experience in the process of promoting the company's transformation for reference.

  1. Consistent policy formulation and communication : Tokenization, as a company-level data security governance strategy, requires a unified understanding of the overall situation. Strategic requirements, specific implementation guidelines, tools and methods, etc. need to be clear and consistent, and be effectively communicated to all relevant parties through concise documents to reduce communication and error costs. Among them, the security baselines of various data application scenarios such as decryption policies, access control policies, API interface policies, big data, and AI need to be complete. In addition, through effective communication channels, including training, product manuals, interface documents and other channels to reach all users, R&D and business personnel.
  2. Break it down into pieces and move forward flexibly : The access links of sensitive data are long and the relationship is complicated, coupled with the limitation of governance level, it has invisibly increased the difficulty of transformation and the pressure of psychological and actual investment. Segment the transformation logic unit horizontally and vertically, down to the service level, to achieve fragmented grayscale transformation and make the transformation more agile.
  3. DevOPS transformation : Because Tokenization changes the logic of data usage, it must be carried out by all business and R&D investments. The labor cost is huge, so encapsulating the transformation logic into a simple and easy-to-use SDK can reduce the difficulty of transformation and the risks caused by human errors. In addition, testing and acceptance can be completed by business self-service closed-loop through automated scanning, data cleaning, acceptance and detection tools.
  4. Tokenization service capability : After the transformation, tokenization will become a strong dependence of some related applications, and faults and performance problems of tokenization can directly affect business applications. Therefore, the performance, availability, and stability of Tokenized services are very important, and they need to be carefully designed by a professional team, and continuously tested, verified, and optimized to avoid failures. At the same time, it is also necessary to provide certain degradation and fault tolerance capabilities on the basis of security.
  5. Operation and governance : With the advancement of the project, Token will surpass plaintext and become the mainstream. By controlling the main data entry and main data suppliers, Token can be guaranteed to be used by default within the organization to achieve default security; in addition, the cold data of the enterprise, static Data or relatively independent island data will still cause omissions and risks. Therefore, it is necessary to support multi-dimensional perception capabilities such as scanning and monitoring for all data support systems. At the same time, abnormal data is mapped to specific businesses to ensure full coverage of Tokens.
  6. Learning, improvement, and iteration : With the evolution of digital innovation, new data forms and data applications will continue to be added. Projects need to cope with this change and continuously improve tools and processes to ensure long-term strategies are guaranteed.

6. Matters not covered

In the future, data security governance will continue to be extended.

At the data level, Tokenization does not solve unstructured data such as pictures and videos. It may be necessary to pass encryption directly. Tokenization does not solve the problem of data exchange across enterprise trust boundaries, which requires new technologies such as privacy computing and multi-party secure computing. The main object of tokenization is structured PII information stored in DB and Hive. Semi-structured data hidden in JSON and unstructured PII data in logs and files are not processed, and it needs to be completed with powerful data discovery and data governance tools.

In the entire data security system, PII is just a drop in the ocean. Tokenization actually only solves the internal data usage scenarios of enterprises, but it sets a precedent for default security and design security. Since PII information is the core data of personal sensitive information, merits and demerits can be tokenized, and the source can be traced to data collection, and can be extended to three-party data exchange. In addition, lossless data deletion and other capabilities can be realized through Token de-association.

Data security is a huge issue, especially under the strong development demand of digital transformation, in the face of complex data applications, network security requires more technological innovations, we hope to inspire more data through Tokenization The safe path to innovation.

7. Author of this article

Zhigang, security architect of Meituan, expert in cryptography, cloud native and DevOPS security, data security and privacy compliance.

Original article: Data security innovation under the new digital format - Tokenization - Meituan Technical Team

Guess you like

Origin blog.csdn.net/philip502/article/details/127279111