Proxy pool, a small integrated tool of proxy IP capture + evaluation + storage + presentation, can automatically collect and detect available proxies, score them, and add Web presentation and interface.
install
1. Drag from GitHub and drop the code in the web directory.
The web server can be installed quickly under UNIX / linux using https://github.com/teddysun/lamp.
Phpstudy can be used for rapid deployment under windows.
2. Create a new database proxy in mysql, import the proxy.sql file, and modify the database password in include / config.inc.php.
3. At this time, if you visit http: / / IP: port, you should see the proxy web presentation interface
4. Install python2 dependency Library
5. Configure database connection information and other parameters in the PY proxy task / config.py file.
Use
Under the PY proxy task directory, there are two programs, proxy get.py and proxy check.py. The former is responsible for capturing IP every day and storing it in the database, while the latter is responsible for cleaning and evaluating the IP in the database.
proxy_get.py
proxy_check.py
After that, according to the default configuration, the two programs perform the grabbing and evaluation work respectively every day, and run on the server for a long time.
brief introduction
The original code is here: https://github.com/chunkinglu/proxy
https://github.com/chungminglu/Proxy
I modified part of the code, improved the parsing code of some extraction agents, and added the web presentation and web interface to facilitate other program calls.
I changed my web page from another scanner, https://github.com/tidesec/wdscanner/, which may have some useless code not deleted.
https://github.com/TideSec/WDScanner/
Several functions of the program:
1. Every day, we can catch the latest IP data hiding from multiple proxy IP websites.
2. The filtered IP will be stored in the database.
3. The IP stored in the database must be tested every day. There is a rejection and scoring mechanism. The IP that fails many times will be deleted and each IP will be scored. Finally, we can get stable and low response time high-quality IP according to the score ranking.
The web display is shown as follows:
The web interface is as follows:
Parameter setting
1. You can set the proxy evaluation parameters in the PY proxy task / config.py file.
2. In addition to the database configuration parameters, the main parameters used are as follows:
Useless time and success rate are used together. When the use time of an IP is less than 4 & & success rate is less than 0.8 (taking into account the short-term and long-term detection performance of IP), the IP is removed.
USELESS_TIME
SUCCESS_RATE
ip
USELESS_TIME < 4 && SUCCESS_RATE < 0.8
Time ﹣ out ﹣ penalty: when an IP fails in a certain detection and fails to meet the conditions of the previous one (for example, timeout occurs for the first time after 100 times of detection), set a penalty term of response ﹣ time, which is 10 seconds here.
TIME_OUT_PENALTY
response_time
Check "time" interval. This is set to check the availability of each IP in the database every 12 hours.
CHECK_TIME_INTERVAL
strategy
1. Every day, the following five proxy IP websites capture the highest level of hidden IP data:
Mimi
66ip
Xici
Cn-proxy
Kuaidaili
2. N rounds of screening
The collected ip set will go through n rounds of connection tests with interval of T. for each IP, it must pass the N rounds of tests before entering the database. If there are fewer IP addresses entering the database on the same day, the system will pause for a period of time (one day) to catch again.
3. IP evaluation criteria in database
In the process of detection, the cumulative timeout times > useless_time & & success rate < success_rate will be eliminated. Score = (success rate + test times / 500) / AVG response time the original consideration is score = success rate / AVG response time, that is, score = success rate / average response time. Considering that the old IP that has passed 100 times of detection is more valuable than the new IP, the number of detection times is also introduced into the score.
USELESS_TIME
SUCCESS_RATE
score = (success_rate + test_times / 500) / avg_response_time
score = success_rate / avg_response_time