Aquaboutic | Focus Security Research | Vulnerability Exploit | POC


distributed web shell detection system based on machine learning

Posted by muschett at 2020-02-29

0x01 why write this tool?

For a long time, in some small and medium-sized enterprises, colleges and universities, as well as non internet companies with weak technical ability, they will inevitably suffer from the attack of webshell. In addition to the huge interest temptation of the black market, most websites have more or less suffered from the invasion of webshell, and with the deformation and concealment of various webshell technologies, the traditional detection methods are difficult to achieve accurate killing, so-called "one step higher, one step higher". In this game, the detection technology is constantly upgrading, from the static detection technology based on feature base matching to the current mainstream dynamic detection technology: detection based on log analysis, detection based on traffic analysis, detection based on behavior analysis and detection based on statistical analysis, etc; On the other hand, all kinds of WAF bypass technology, encryption deformation technology and special secret web shell integrated with business of webshell make the anti checking technology develop to a certain level. Here ( there is an open-source webshell collection project, which collects thousands of different kinds of webshells.

However, after carefully looking for and testing some small webshell tools, I found that there is almost no good open-source webshell detection tool on the Internet. Of course, some Internet companies with strong technology have done a good job in this aspect of detection technology, such as Tencent, Alibaba cloud, etc. because they have massive data, Security Threat Intelligence Analysis Based on big data and machine learning makes their detection in this aspect have certain effect, but they have not opened their products. In the previous online search, the author found several papers on the technology of webshell detection, so based on the theoretical analysis of these fragmentation, I plan to make a distributed webshell detection platform to help some small and medium-sized enterprises to do the timely detection and verification of webshell as much as possible. Due to the limited technology and lack of experience, there will inevitably be problems. I hope that everyone passing by will put forward their ideas and give some suggestions.

0x02 discussion on webshell detection technology

After analyzing the function modules in the source code of most webshell, the author summarizes the relevant characteristics of webshell

(1) There are command execution functions called by the system, such as Eval, system, cmd_shell, assert, etc;

(2) There are file operation functions called by the system, such as fopen, fwrite, readdir, etc;

(3) There are database operation functions, calling the system's own stored procedures to connect the database operation;

(4) It has deep self concealment and camouflage, and can hide in web source code for a long time;

(5) There are many derivatives, which can bypass detection by defining encryption and decryption functions, using XOR, string inversion, compression, truncation and reorganization;

(6) There are few access IP, few access times, isolated pages, traditional firewall can not intercept, no system operation log;

(7) The payload traffic is generated and recorded in the web log.

Therefore, according to the above characteristics of webshell, a variety of methods can be used for detection. After analyzing the current mainstream detection technology, the author also roughly summarizes the following commonly used detection methods:

(1) Based on static feature detection

(2) Detection based on flow analysis

(3) Log based analysis and detection

(4) Detection based on behavior analysis

(5) Test based on statistics

Next, a brief analysis of these common detection technology means, which is also the core detection technology used by the author to write this tool.

(1) Detection technology based on static characteristics

The detection method is to detect the key words, high-risk functions, file modification time, file permissions, file owners and the association with other files. But the premise is to build a malicious string feature library first, and maintain this feature library.

For example, the key words of detection are: "group specific Damascus, Trojan horse, PHP \ s, rebound, and execute" and the high-risk functions are: "Wscript. Shell, shell. Application, eval(), excel(), set server, run(), exec(), shellexcel()". At the same time, confirm the modification time, permission and owner of the web file. In order to prevent the attacker from writing the access back door in the site source file directly after invasion, it is necessary to hash sample all source files and add them into the feature library for later scanning and comparison.

This method can quickly detect the abnormal source files out of the feature library mismatch, and can locate the abnormal code segments in the source code, which is convenient for human audit. But the disadvantages are also obvious: it is easy to misreport, and it is unable to detect encrypted or specially processed webshell files. At the same time, it is necessary to maintain the eigenvalue matching library. However, it is impossible to detect the secret web shell accurately, because the secret web shell usually has the similar characteristics with the normal web script file.

(2) Detection technology based on flow analysis

In the way of traffic (Gateway) detection, we first need to "visualize" the traffic and detect the payload network traffic generated by webshell during the access process. After a certain amount of payload accumulation and customization of relevant rules, and then combined with other detection processes, a set of Web shell detection engine based on traffic analysis is formed, which is embedded in the existing gateway device or cloud device to realize the deep analysis and killing of Web shell.

This method is based on the processing model of big data. Generally, the (HTTP) traffic is mirrored in the form of bypass on the core route / exchange, and then massive data is processed through cloud computing platforms such as Hadoop to detect abnormal traffic restoration attack scenarios. Therefore, it is imperative to introduce intrusion detection technology based on Hadoop cloud computing platform. There are a lot of related materials about this research in China. You can take a look at this article (large-scale network traffic analysis based on Hadoop) to make a simple understanding. On the other hand, the analysis model of machine learning must be established. After a certain period of traffic accumulation and model building, the webshell analysis engine can identify the abnormal traffic independently. Because there are some uncertainties in this way, it has to be detected offline.

In terms of availability, this detection method can detect in real time, locate hosts and intruders quickly, and restore attack scenarios. For example, it can be used as traffic gateway type detection or combined with IDs and other devices to prevent in real time. In terms of disadvantages, the first is the deployment cost of traffic image, the second is the complexity of building Hadoop cloud computing platform and machine learning model (R & D cost). Some encrypted payloads may not be detected temporarily (of course, they can be detected by machine learning).

(3) Detection technology based on log analysis

Log analysis is a kind of analysis method of forensics and prediction. It tells a complete attack story (when, where, who, what and why) that has happened, is happening and will happen in the future. Web shell detection using log analysis is to trace back the attack events according to the determined attack events, and then use the event characteristics to prevent the same attack in a certain period of time in the future. The detection method is to find the exception log first and then the attack log. The whole process is divided into two steps: webshell extraction + webshell confirmation. In the case of large environment, Hadoop is also used to analyze logs.

The extraction of webshell is based on the following features: access feature (main feature), path feature (auxiliary feature), time feature (auxiliary feature), payload feature (auxiliary feature), behavior feature.

The confirmation of webshell is basically based on its web page features (content / structure, visual). Extract and confirm the visitor characteristics (IP / UA / cookie), payload characteristics, time characteristics and associated search of webshell, sort the search results by time and restore the events.

For the detailed analysis idea and model, please refer to this article: webshell detection log analysis

(4) Detection technology based on behavior analysis

Behavior based detection technology involves the parsing process of source files in the system environment. For example, it involves the behaviors related to the file system in the system, such as file reading and writing, creation and deletion, etc.; it involves the behaviors related to the network, such as socket monitoring (acting as socket5 agent, etc.), TCP / UDP / HTTP request sending (DDoS attack), etc.; it involves the behaviors related to the database reading and writing, such as the database search, modification, full database backup (de Library), etc.; it involves the system importance Configuration (such as Windows registry, startup profile, etc.).

Take PHP as an example, based on the behavior analysis of the execution process of PHP script, you can write PHP extension module according to the execution mechanism of PHP, filter and block the related abnormal operation behavior. The other way is to use honeypot technology, put the site source code into honeypot, and analyze its behavior characteristics to detect abnormal operation behavior.

Behavior based detection is technically difficult to implement, and it is difficult to achieve centralized detection and monitoring when the business runs in a cluster scale. However, it has certain research significance in the experimental environment.

(5) Detection technology based on statistics

The statistical analysis of webshell is mainly based on the difference between webshell script and normal source code. According to the above analysis of the characteristics of webshell, statistical analysis technology can be carried out from the following aspects.

(1) The coincidence index of a file (IC for short) is a method to determine whether a file is encrypted.

(2) Information entropy, an abstract concept in mathematics, is understood as the probability of occurrence of a specific information (the probability of occurrence of discrete random events). The more orderly a system is, the lower the entropy of information; conversely, the more chaotic a system is, the higher the entropy of information. Information entropy can also be said to be a measure of the degree of system ordering.

(3) The longest word in the file usually forms a string through some encrypted webshell such as Base64, which is not the case for normal source code.

(4) The essence of file compression ratio is to eliminate the imbalance in the distribution of specific characters, and optimize the length by assigning short codes to high-frequency characters, while long codes correspond to low-frequency characters. For Base64 encoded files, non ASCII characters are eliminated. In fact, the characters of Base64 encoded files will show smaller distribution imbalance and larger compression ratio.

(5) The first is to match the feature function and code, and the second is to match the feature value in a specific webshell.

0x03 distributed webshell detection technology

Through the analysis and research of the above various webshell detection technologies, the author believes that in today's distributed cluster network environment, the use of big data technology for massive analysis and processing, combined with intelligent machine learning model, will be the future development direction of the core technology in this area.

Of course, the author is also the first time to try to do this, which inevitably leads to lack of experience, so I also want to exchange this technology with you.

First of all, we must consider the distributed deployment environment and adopt the agent deployment mode. The detection method of webshell can only start from static special value detection, log analysis detection and statistical analysis detection (the early stage is the exploration stage). The design framework of the whole tool was initially conceived as follows:

Next, according to my original design pattern, I will talk about the functional modules and composition of the distributed web shell detection platform.

In general, at the beginning of the design, the functions considered by the author are as follows:

(1) On the agent, the scanning results are transmitted to the server (static detection results, statistical analysis results, log filtering analysis results). Note that the results here are preliminary scanning results, which need to be further analyzed and compared by the server, and finally displayed to the user interface visually through the Web terminal.

(2) The configuration files on the agent are updated and configured in the way of server-side distribution, which is convenient for centralized configuration management and later maintenance.

(3) Before each static feature scanning, the agent needs to update the local feature matching library by pulling from the server. In this way, you only need to maintain the special library at the server (WEB) end in the later stage.

(4) File upload function. When the uncertain script file is detected, it needs to be identified and detected manually. At this time, the suspicious script on the agent can be downloaded to the local manager through the browser by clicking the download file on the web side.

(5) Control instruction execution, built-in some common processing instructions in the system, and batch processing execution, convenient for centralized and unified management directly through the web.

(6) Remote shell. This is a common additional function module as a centralized management software. As long as the purpose is to prevent some unforeseen situations (compare if the normal login and other functions of the system are damaged after the webshell invasion, the remote shell still has a certain role at this time). In order to prevent this function from being abused, it is better to set the permission assignment, only the top administrator has this function.

(7) Email notification function. After making relevant rules through the web end, if it is confirmed that webshell is detected, it will be sent to the administrator's mailbox in time through the mail function. In addition, a report should be generated for the overall status monitored every day and sent to the administrator by email.

(8) Timing task function. Considering that the service server may be busy in a specific period of time, if it does scanning detection at this time, it may affect the service. In addition, if it performs scanning tasks in a single period of time, the server may also have some pressure. At this time, the function of timed task is needed to carry out relevant scanning and detection at a specific time, and notify the administrator of the detection results by email.

(9) To be discovered...

The next time, we will analyze the design ideas of static feature base design, log analysis and statistical analysis. In this paper, I will only write the distributed web shell detection platform is some design ideas to discuss with you, but also an attempt and practice of sharing process. Welcome to pay attention and exchange.

Attachment: Lecture sharing distributed webshell detection.pdf