Whether based on the policy of rule matching or the complex security analysis model, there are a lot of false alarms generated by security devices, which is a quite common problem. One of the important reasons is that each customer's application scenarios and data are more or less different. Based on the fixed judgment rules, the data with statistical fluctuations are judged rigidly, which is prone to misjudgment.
In the absence of continuous human intervention and manual optimization, the false alarm rate of policies and models will not improve with the accumulation of data. That is to say, by tagging the alarm, the security analyst can transfer the professional experience to the intelligent algorithm, and automatically get the feedback to the strategy and model, so that it can make a more accurate judgment on the security event. In this paper, the method of continuous optimization of machine learning by using expert experience is introduced to analyze and learn the alarm data twice, so as to significantly reduce the false alarm rate of security threat alarm.
In order to reduce the false alarm rate, there are basically two technical approaches:
Modify the strategy and model according to the specific situation of different customers to improve the adaptability of the strategy or model;
Regularly (such as once a month) enter the secondary manual analysis for alarms, and adjust the policy and model parameter configuration according to the analysis results.
These two methods can reduce the false alarm rate. But the first one has no adaptive ability. Whether it works depends on the actual situation. The second effect will be better, but it is very time-consuming and labor-consuming, and the probability of error is very high due to the manual on-site intervention and adjustment of strategies and models.
MIT researchers [1] introduced a method that takes the alarm log marked by security analysts as the training data set, enables machine learning algorithm to learn expert experience, continuously optimizes the analysis algorithm, realizes automatic recognition of false alarm, and reduces the false alarm rate (hereinafter referred to as "label transfer experience method"). This process of transforming the professional intelligence of security analysts into algorithm analysis ability will make the analysis algorithm more accurate with the accumulation of data. Then gradually get rid of manual intervention and improve the operation and maintenance efficiency. As shown in the figure below:
Next, we will introduce the implementation mechanism based on the scenario data of "frequent access security threat warning" simulation.
What is frequent access model? The logic is relatively simple: in a period of time (such as 1 minute), an attacker's number of visits to the system is significantly higher than that of ordinary visitors. This warning rule can be based on simple threshold value or statistical distribution of the probability of divorce. Based on this, we first simulate some alarm data that has been labeled by security analysts. According to the practical application experience, we try to simulate the data close to the actual scene. The following picture:
Introduction to simulation data:
A total of 20 days of alarm data were simulated, from 2017-01-01 to 2017-01-20. The data of the first 10 days is used to train the model, and the data of the last 10 days is used to measure the performance of the model;
Each alarm shall be labeled with whether it is false alarm or not. Red indicates false alarm and blue indicates accurate alarm.
Assumptions about simulation data:
False positives are gathered in a certain period of time, and the range of simulation data is assumed to be 18:00-19:00. In the practice of security operation and maintenance, there is indeed a certain period of time, due to business logic or system reasons, false positives increase. Therefore, the above assumption is reasonable, and the alarm time can be used as an effective eigenvalue. But not all false alarms are gathered in this time period, and not all alarms in this time period are false alarms;
Most false positives come from a group of different IPs. So access source IP is also a useful feature value;
No data is perfect, so ~ 9% noise is added to the simulation data. That is to say, no matter how perfect the intelligent model is, the false alarm rate will not be less than 9%.
These assumptions are also relatively reasonable in the actual application scenarios. If the false alarm is completely random, then the intelligent model can not capture the signal of false alarm. So these reasonable assumptions help us to simulate real data and verify our machine learning model.
The code implementation of the simulation data is briefly described as follows:
The following figure shows the visualization results of dimension reduction analysis with PCA, and the obvious classification can be seen:
Red indicates false alarm and blue indicates correct alarm. The dimension reduction analysis based on the set eigenvalue can get two aggregations, that is, there is a clear distinction between false positives and non false positives, that is to say, false positives are regular, not completely random, so they can be captured by machine learning.
Brief code implementation:
Based on the simulation data, the goal we want to achieve is to reduce the false alarm rate through continuous reinforcement of machine learning. So our strategy is:
Data of one day of training 2017-01-01, data of 10 days of testing 2017-01-11 to 2017-01-20;
Training data for two days 2017-01-01 to 2017-01-02, testing data for 10 days 2017-01-11 to 2017-01-20;
By analogy, we can see whether the false alarm rate in the test data can be improved continuously by learning more and more data.
The brief code is as follows:
This security threat scenario is relatively simple, we do not need too many eigenvalues and massive data, so the machine learning model chooses random forest, we also try other complex models, and the results are not different. The test results are as follows:
When the training data is more and more, the false alarm rate in the test data is reduced from more than 20% to 10%. Through continuous self-learning of alarm data and labels, many false alarms can be eliminated. As mentioned earlier, 9% noise is introduced into the data, so the false alarm rate will not be continuously reduced.
In our machine learning model, we use four main eigenvalues:
Srcip, access source IP
Time of day
Visits
Destip, accessed IP
The following figure shows the importance of eigenvalues in the model:
In accordance with our expectation, srcip and time of day are the best eigenvalues to distinguish the false alarm.
In addition, because the random forest model and most machine learning models do not support the learning of categorical variables, we binary the srcip and destip. The brief code is as follows:
summary
In this paper, a set of simulation experimental data and random forest algorithm are used to verify the effectiveness of the "tag transfer experience method" in theory. That is to say, by marking the alarm log with effective or false alarm by the security analysis expert, the expert's knowledge and skills are transformed into the analysis ability of machine learning model. Compared with other methods, this method does not need manual intervention after the completion of automatic learning, and it will be more accurate to eliminate false positives with the accumulation of data.
For details, please refer to our GitHub source code: https://github.com/ailpha/ml-reduce-false-alerts
reference:
[1] Kalyan Veeramachaneni,Ignacio Arnaldo , “AI2 : Training a big data machine to defend”
*The author of this paper: airpha big data of Anheng, reprint please indicate that it is from freebuf.com