AI redefines web security

About the Author:


Cong Lei

 

Partner and Vice President of Engineering at White Mountain.

Mr. Cong Lei joined Baishan in 2016 and is mainly responsible for the R&D management of cloud aggregation products and the construction of cloud chain product systems.

Cong Lei worked for Sina from 2006 to 2015. He was the former founder of SAE (SinaAppEngine), and served as the general manager and chief architect. Since 2010, he has led the Sina cloud computing team to engage in technical research and development in cloud-related fields. (Note: SAE is the largest public cloud PaaS platform in China, with 700,000 users.)

Cong Lei has 10 invention patents and is currently a judge of trusted cloud service certification by the Ministry of Industry and Information Technology.

 

 

The impact of cloud on security

 

 

It has been 11 years since Amazon released the EC2 service in 2006. In these 11 years, not only AWS revenue has risen from hundreds of thousands of dollars to more than 10 billion dollars, but more importantly, cloud computing has entered every enterprise. . According to the "2016 Cloud Computing White Paper" released by the Academy of Information and Communications Technology, nearly 90% of enterprises have begun to use cloud computing (including public clouds, private clouds, etc.), which shows that large-scale cloudification is not only a trend for enterprises, but also is a solid fait accompli.

 

The popularity of cloudification also brings many challenges to security, including:

 

Cloudization leads to the failure of traditional security methods based on hardware devices. When I communicated with enterprises, more than one enterprise raised this concern: in the process of going to the public cloud, because the purchased hardware protection cannot be moved to the cloud, they are very worried about business security. Interestingly, they are not worried about traffic layer attacks after they migrate to the cloud, because they believe that products such as high-defense IP on the cloud can solve most problems. Cloudization has led to security gaps in the business layer. This occurs not only in public cloud environments, but also in private cloud environments. Taking the OpenStack Icehouse version as an example, there is still a lack of Web security components that can effectively scale out.

 

Cloudification has greatly reduced the cost of attacking/performing. Cloud is a re-upgrade of the "sharing economy" in the IT field. It has evolved from the earliest IDC rental upgrade to Linux kernel namespace rental. However, this "sharing economy" brings benefits such as cost reduction and convenience to enterprises. Brings the same benefit to the attacker. According to the current market conditions, the cost of renting an elastic IP on the public network can be as low as 1 yuan/day, and the daily cost of renting a computing environment at the hypervisor layer of an IaaS platform is only a few yuan. If it is a computing environment at the container layer, The cost is even lower. With such a low cost, attackers no longer spend as much effort digging and cultivating meat machines as in the past, but can easily have computing network resources for attack in an instant. Taking a famous customer in the Internet recruitment field served by Baishan as an example, an attacker can use tens of thousands of IPs in one day at most to crawl core user resumes at a very low frequency.

 

Cloudization reduces business controllability and greatly increases the risk of being attacked. In fact, the cloud objectively causes the complexity and uncontrollability of the business: a large number of their own or partner businesses run on the same cloud, and any attack on one of them may affect other parts. It is undeniable that the existing hypervisor isolation technology is very mature. Taking CPU as an example, by calculating time slice allocation and inserting various spin locks between execution instructions, the CPU allocation of the executive body can be precisely controlled, and other resources including memory and IO are also available. can be properly controlled. But among all resources, the most vulnerable to isolation is the network, especially the public network. After all, NAT exits, domain names, etc. are difficult to isolate. Therefore, we have to face the reality: while enjoying the dividends of the cloud computing era, facing the The business layer security problems are becoming more and more serious.

 

 

Security Products Need Change

 

 

Unfortunately, many traditional security products have not kept up with this era. The most obvious example is that the firewall 15 years ago relied on setting various policies on the command line; today, 15 years later, everything has changed from setting the policy on the command line to setting the policy on the interface. This has to be said to be a tragedy!



For traditional security products, setting the policy is a pain

I once listened to the evangelist of a famous security manufacturer give a speech, "Buying our products does not mean that your business is safe, you must learn how to configure it!" This sounds reasonable, but unfortunately, most companies The security personnel are not the company's business developers. They don't know which referer the business page should come from, which user-agent request should not be accepted, nor what parameters an interface should accept, or even how the business is for a single user. reasonable frequency range of access. What's even more regrettable is that these traditional security products are very valuable. After you spend millions of dollars, they are likely to be useless, and the saddest part is "you think it works!"

 

Traditional security products must be connected to the middle of the business, which brings great instability. Although some advanced hardware mechanisms can reduce this risk through technology, it is still inevitable that concatenation will bring performance delay + bandwidth bottleneck. Some enterprises initially purchased hardware security products with a throughput of 100Mbps, but when the business suddenly increased, the hardware could not be freely expanded horizontally. What's more troublesome is that once the dimensions of the serial mode analysis become complex (such as when there are more strategies), it is bound to cause access delays to the business; and once the analysis dimensions are small, for example, it degenerates into only limited access frequency within a fixed period of time, and It will cause the recognition error rate to increase. This is an eternal contradiction that cannot be solved by traditional security products.

 

Unfortunately, although there are many problems with traditional security products, many users still suffer silently, and even get used to the work of configuring policies every day. But that doesn't mean reasonable.

 

Amid the inconvenience, there has always been an opportunity for technological innovation! Here comes machine learning!

 

Machine learning is the golden key to solving security problems

 



 The history of machine learning

Machine learning has already arrived. As can be seen from the above figure, the source of the current deep learning, the neural network, has been proposed as early as the 1970s. From the 1980s to this century, machine learning itself has experienced several periods of flat and explosive periods. With the development of data and some hot events (such as AlphaGo's victory over Li Shishi), machine learning has once again entered an explosive period.

 

So what is the relationship between big data and machine learning? This is also linked to deep learning. In theory, deep learning essentially uses multi-layer neural network calculations to replace the feature selection of traditional feature engineering, so as to achieve the effect of classification algorithms that are comparable to or even surpass those of traditional feature engineering. Based on this logic, when there are enough labeled samples (the so-called "big data"), a very powerful classifier can be constructed through deep learning, such as judging which side of a Go game is favorable.

 

AI seems to be very powerful with the current popularity of deep learning, but unfortunately, frankly speaking, the development maturity of AI is far from the level that can replace or approach the human brain. According to the Turing test theory, the problems that AI itself has to solve are nothing more than: recognition, understanding, and feedback.

These three questions are gradually progressed, and a truly intelligent robot can eventually give feedback like a human brain, making it impossible to distinguish whether it is a human or a machine in the Turing test.

 

According to the current development of AI, the progress of "recognition" is currently the best. Whether it is image, voice or video, many manufacturers can achieve a high recognition rate; but "understanding" is not satisfactory, everyone has used Apple Siri, it has not yet reached the level of real dialogue with people; and feedback is even more difficult, which requires constant adaptation on the basis of understanding. The same question may be different depending on the identity, mood, and communication occasion of the other party. The tone of voice responds differently.

 

Therefore, at present, the application of machine learning effect is very good, almost all are identification problems in a specific field, not a general field, such as face recognition, human-machine game (human-machine game is also the identification of a certain game field in essence Question: After learning thousands of chess positions, the machine can automatically identify who is in a position to benefit from a certain position.)

 

Fortunately, most of the problems in the security field are identification problems in specific scenarios, not general scenarios, and do not involve understanding and feedback. You only need to hand over the relevant data to the machine learning system and let it make identification judgments. Can: Safe or unsafe, the reason for the unsafe.

 

Just because the security problem is essentially a recognition problem in a specific field, theoretically speaking, machine learning is very suitable for application in the security field and is the golden key to solve the security problem.

 

 

The Difficulties of Safely Combining Machine Learning

 

 

Although machine learning has existed for a long time, it has not changed the security market for a long time. Products based on "local methods (setting strategies)" still occupy a dominant position. The main reasons are as follows:

 

1. Different from other general fields, the sample labeling cost in the security field is relatively high. For machine learning, it is extremely important to have massive, complete, objective, and accurate labeled samples. The more and more comprehensive labeled samples, the more accurate the trained classifier may be. Obtaining samples (labeling samples) is not easy for all industries, but it is especially difficult in the field of security. For example, the labeling of face recognition can be completed by junior high school students or even elementary school students, but for a security threat event, it requires highly experienced security personnel to complete it, and the cost gap between the two is huge.



  an injection attack

As shown in the figure above, this injection attack has been complicatedly coded for many times, and it is difficult for non-professional personnel to label samples. Therefore, in general scenarios, the reason why there are not many deep learning in the security field is mainly because it is difficult to obtain massive labeled data.

 

2. Different from the general field, the security field has more obvious scene characteristics, and the criteria for judging attacks will vary with the characteristics of the business. Taking the simplest CC attack as an example, 600 times/minute of access may mean a destructive attack to some enterprises, but it is normal access to other enterprises. Therefore, even if there are a large number of labeled samples, the labeled samples of one company may be useless to other companies, which is another important reason why it is difficult to apply machine learning in the security field.

 

3. For traditional text-based attacks, traditional thinking believes that simple feature engineering or even direct regular matching is more effective.

We divide web attacks into two types: behavioral attacks and textual attacks:

     - Behavioral attack: Each request looks normal, but when it is connected into a request chart, problems will be found, such as crawlers, credential stuffing, order swiping, and fleeing. Take the powdering behavior as an example: every request looks normal, but an attacker may use a large number of IPs to register a large number of accounts in a short period of time and follow the same user. Problems can only be discovered when we connect and analyze these behaviors.

    - Text-based attacks: Traditional vulnerability attacks, such as SQL injection, command injection, and XSS attacks, simply treat a request as a piece of text, and identify whether it is an attack based on the characteristics of the text.

When the dimension space of features is low and some dimensions are highly discriminative, a simple linear classifier can achieve good accuracy. For example, we simply formulate some regular rules for SQL injection, which can also be applied to many Scenes. However, such traditional thinking ignores the recall rate. In fact, few people know how much recall rate can be achieved through the regular rules of SQL injection. At the same time, in some scenarios, if the normal interface of the business transmits SQL statements through JSON, this kind of regular rule-based classifier will produce extremely high misjudgments.

However, traditional security vendors have not yet realized these problems.

 

4. Traditional security people don’t understand machine learning. This is an indisputable fact. A large number of security personnel from traditional security companies are good at constructing various vulnerability detections, mining various boundary conditions to bypass, and are good at formulating patch strategies one after another, but they are not good at AI machine learning. This also shows the scarcity and importance of such cross-border talents.

 

It is precisely because of the above reasons that AI-intelligent security products have not appeared for a long time, but no one can deny that users are actually tired of the policy-driven rule model, and they look forward to having a security product that can adapt to most scenarios and can conduct in-depth analysis of behavior or text. A web security product that can achieve high accuracy and recall without complex configuration.

 

So, we redefine web security with AI because we firmly believe that abnormal behavior and normal behavior can be distinguished by feature recognition.

 

 

Redefining Web Security with AI

 

So how to solve the problem of sample labeling in the security field? Machine learning falls into two broad categories: supervised learning and unsupervised learning. Supervised learning requires accurate labeled samples, while unsupervised learning can perform clustering calculations on the feature space without labeling samples. In the security field where labeling is difficult, it is clear that unsupervised learning is a powerful tool.

 

Applied Unsupervised Learning

 

Unsupervised learning does not need to prepare a large number of labeled samples in advance, and can distinguish normal users from abnormal users through feature clustering, thereby avoiding the problem of labeling a large number of samples. There are many ways of clustering, such as distance clustering, density clustering, etc., but the core is still to calculate the distance between two feature vectors. In the field of web security, the data we obtain is often the HTTP traffic or HTTP logs of users. When doing distance calculation, we may encounter a problem: the calculation granularity of each dimension is different, such as HTTP in the vector space of two users. The distance of the 200 return code ratio is the calculation of two float values, and the distance of the request length is the calculation of two int values, which involves the uniform normalization of granularity. There are many techniques in this regard. For example, the Mahalanobis distance can be used instead of the traditional Euclidean distance. The essence of the Mahalanobis distance is to constrain the value through the standard deviation. When the standard deviation is large, the randomness of the sample is large, and the weight of the value is reduced. , on the contrary, when the standard deviation is small, it means that the sample has considerable regularity, and the weight of the value is increased.

 

Unsupervised clustering can use the EM calculation model, which can regard the category, the number of clusters or the silhouette coefficient as a hidden variable in the EM calculation model, and then iteratively calculate to approximate the best result. Eventually we will find that normal users and exceptions are clustered into different clusters, and then subsequent processing can be performed. Of course, this is only an ideal situation. In more cases, normal behaviors and abnormal behaviors are clustered into many clusters, and even some clusters are mixed with normal and abnormal behaviors. At this time, additional skills are required.

 

learning rules

 

The premise of unsupervised clustering is a vector space constructed based on the user's access behavior. The vector space is similar to:

[key1:value1,key2:value2,key3:value3...]

 

There are two issues involved here: "how to find the key" and "how to determine the value".

 

 Finding the right key is essentially a feature selection problem. How to select the most discriminative and representative dimension from the numerous feature dimensions. Why not compute all features together like some DeepLearning? This is mainly due to the computational complexity. Please note: Feature selection is not equivalent to feature dimensionality reduction. Our commonly used PCA principal components and SVD decomposition are only feature dimensionality reduction. In essence, the first few layers of DeepLearning are also a feature dimensionality reduction in a sense.

 

The method of feature selection can be carried out according to the actual situation. Experiments show that random forest is a good choice when there are positive and negative labeled samples. If there are few labeled samples or there are problems with the own samples, Pearson distance can also be used to select features.

 

Ultimately, the user's access behavior will become a set of features, how to determine the value of the feature? Take the most important feature - visit frequency as an example, how high is the visit frequency worthy of our attention? This requires us to learn for each business scenario in order to determine the value of these keys.

 

There are two main types of learning rules:

1. Behavioral rules: automatically find out the key points of the path, according to the state transition probability matrix, based on the power method calculation principle of PageRank, the maximum eigenvalue of the state transition matrix of the website path represents its key path (key convergence point and key divergence). point), and then follow the key points to learn the user's path access rules.

2. Text rules: For API, you can learn its input and output rules, such as the number of input parameters, the type of each parameter (string or number or email address, etc.), and the distribution of parameter lengths. Any dimension will be learned. Probability distribution function, which can then be used to calculate its proportion in the population. Using Chebyshev theory can tell us that these values ​​are abnormal for even the most uncertain random distributions. For example: if the username parameter in GET /login.php?username= is statistically calculated, the average length is 10 and the standard deviation is 2. If a user enters a username with a length of 20, then the user's input is in the overall It belongs to the minority behavior that accounts for less than 5% of the group.

Through feature selection and behavior and text rule learning, we can construct a complete and accurate feature space to vectorize user access, and then perform unsupervised learning.

 

Make the system smarter

 

If a system does not have human participation, it cannot become smarter and smarter. For example, AlphaGo needs to continuously strengthen itself in the game against human masters. In the security field, although complete sample labeling is impossible, we can use the principle of semi-supervised learning to select representative behaviors and hand them over to professional security personnel for judgment. After evaluation and correction, the entire system will become smarter. The correction of security personnel can be combined with reinforcement learning and ensemble learning. For the case that the algorithm judges accurately, the parameter weight can be increased, and vice versa.

A similar idea appeared in one of the best papers of CVPR 2016, the top international conference on artificial intelligence, "AI2: Training a big data machine to defend". The startup team of MIT proposed an AI2 system based on semi-supervised learning, which can Participate in making security systems safer and smarter.

 

Redefining Web Security

 

Based on the above points, we can basically outline the basic elements of AI-based web security:



 AI Web Security Technology Stack

As can be seen from the figure, all algorithms are contained within the real-time computing framework. The real-time computing framework requires that the input, calculation, and output of the data stream are all in real time, so that the system can respond quickly when a threat event occurs. However, the requirements of real-time computing have also increased many challenges and difficulties. Some problems that are not a problem in traditional offline mode will suddenly become difficult problems in real-time computing. For example, for the simplest median calculation, it is not easy to design a median algorithm that can guarantee the accuracy under the condition of real-time stream input. T-digest is a good choice, which can be limited to O(K ) of memory usage. There are also algorithms that can achieve a relatively accurate median with an O(1) memory footprint.

 

To sum up, we can see that the use of AI to achieve web security is an inevitable trend. It can subvert traditional security products based on the policy configuration mode and achieve accurate and comprehensive threat identification. However, constructing AI-based security products is itself a complex project, which involves feature engineering, algorithm design and verification, and stable and reliable engineering implementation.

 

 ATD Deep Threat Identification System

Baishan has been exploring AI-based web security, and officially launched the ATD (Advanced Threat Detection) product in July 2017, which can accurately identify and block various behavioral or text attacks, including crawler, malicious registration, collision In just half a year, it has accumulated more than 30 large and medium-sized enterprise customers. Practice has proved that machine learning is indeed effective in web security, such as:

 

 A Top 3 recruitment website in China has been crawling resumes for a long time. These malicious crawlers are very intelligent. They completely simulate normal users in fields such as User-agent and referer, and embed PhantomJS, which can execute JavaScript scripts to make traditional JS jump. The defense method is completely ineffective. These crawlers use a large number of elastic IPs to crawl at a very low frequency. According to statistics, a single client can be as low as less than ten times a day, and traditional security products completely lose their defense capabilities against this. ATD based on machine learning can accurately distinguish low-frequency crawlers from normal user behavior through feature vector modeling. It has been verified that the accuracy rate is as high as 99.98%.

 

A domestic Top 3 live broadcast platform has a large number of malicious swiping points and rankings. This behavior undermines the fairness of the platform and essentially damages the interests of the platform. The evil gangs register a large number of trumpets in batches in advance, and rush the ranking when needed. These behaviors are obviously incapable of traditional security products. Although some emerging security products can be solved, they require a large number of customized rules and are less versatile. Machine learning algorithms just make up for the above shortcomings. Through behavior analysis, key paths and rules can be calculated, and then algorithms such as subgraph recognition are used to analyze malicious gangs, and finally output ID accounts. After user verification, the accuracy rate of ATD is as high as 99%, and the recall rate is more than 10 times higher than that of traditional security products.

 

In short, AI-based web security is an emerging technology field. Although it is still in the development stage, it will eventually replace the traditional security products driven by policy and become the cornerstone of ensuring enterprise web security.

<!--EndFragment-->

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326329033&siteId=291194637