Entries filed under Mobile

Blocking Aggressive Chinese Mobile Browser Bots

Posted on November 17, 2019

More spam from China… this time, it’s mobile… or is it? Recently, I’ve been seeing some domains on our system getting as many as 500 visits per 15 minutes from unique Chinese IP addresses. The common theme, besides geographic location? All of them appear to be masquerading themselves as on mobile devices or mobile browsers:

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/ NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/ Mobile Safari/537.36

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

A typical request from any of these  “mobile” user agents appears like this in my nginx and apache server access logs, many of the requests were very random as well: – – [17/Nov/2019:16:33:24 -0800] “domain.com” “GET /index.php?s=22f97a8b556c4cc64429201399e5b7db&showuser=19 HTTP/1.1″ 200 “-” “Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0″ “-“

This is a major nuisance for any session tables that track online lists and is an overall waste of resources. I started with IP addresses and firewalled those first. The problem with that approach as I realized, is that there are thousands of IP addresses involved involved in whatever ‘this’ Chinese scraping / crawling / botting is. A couple of hours of log collection yielded over 2,000 china based IP addresses with completely different ISPs and subnets. That is when I began to look toward the user agent for a solution here. Luckily, the user agents do have some “unique” components… I narrowed them down to:

MicroMessenger/ — This is WeChat, a chinese chat program’s built-in web browser.

Mb2345Browser/9.0 — mb2345browser identifies as the mobile browser for Chinese web directory 2345.com

LieBaoFast/4.51.3 — I was not able to find any information on what LieBaoFast, the name certainly sounds chinese.

MQQBrowser/6.2 — The mobile version of QQ browser, hence the “M”, a web browser used in mainland china

zh_cn — Chinese language charset tag

zh-CN — Chinese language charset tag

This proved to be useful. You can learn a lot from studying traffic like this and seeing what user agents are delivering traffic. For example, having used WeChat myself before, there is simply no way that the built-in chat applications web browser is sending so many thousands of requests out to random pages on a site. This is clearly fake traffic and I have learned to identify it from years of combing through logs and watching how “normal people” my web applications.

So I crafted the following simple nginx if conditionals to block the first 4 elements if they are matched in the user agent. They have not shown a shred of valid traffic on our systems… so the possibility for false positives in our case is going to be very low as those access methods are either bogus or extremely obscure. In doing so, this also won’t block any of the very small amount of valid traffic from Chinese IP address ranges, nor popular web browsers.

The conditionals for blocking the user agent elements that the mobile bots and crawlers are masquerading with are as follows:

You can also use the | to block multiple strings instead of breaking it up into individual conditionals. For example, if you wanted to block zh-cn and zh_cn here is an example that will accomplish this.

Blocking the unique, obsecure user agents of these bots / crawlers / scrapers is the most efficient approach at this time and will catch most of the traffic. I’m unsure why there are multiple thousands of IP addresses involved in this, and I’ll update this blog post if further research yields new information.