Articles

How to detect and manage web bots?

6 Mins read
Web Bots

What is Bot Traffic?

Bot traffic refers to any traffic that doesn’t originate from human users to a website or platform. Although the term ‘bot traffic’ might seem harmful at first, it’s important to remember that there are also good bots with beneficial purposes.

In fact, some bots are essential for the success of our website, like Googlebot or Bingbot. However, there are also bots that are malicious in nature, which are used for various purposes from initiating DDoS attacks, content/data scraping, or even launching data thefts.

Around 40-50% of total internet traffic is bot traffic, and a lot of them come from bad bots. This is why detecting traffic from bad bots, differentiating them from good bots and/or legitimate users, and managing the traffic are the concern of many businesses.

Current Challenges in Detecting Malicious Bot Traffic

As we have briefly discussed above, there are two layers to the challenges in detecting malicious bot traffic: differentiating between bot traffic and legitimate human traffic and distinguishing between good bots and bad bots.

Distinguishing bots from human users alone has become a pretty complex task. Especially bad bots are evolving rapidly with bad developers using the latest technologies faster than ever before. Also, these malicious bots are purposely designed to evade the traditional bot detection systems. Discerning between these bots with good bots is even more difficult.

With that being said, internet bots have evolved dramatically in recent years, and we can classify these bots (especially bad bots) into four ‘generations’:

  • First-generation bots or gen-1: built with basic scripting tools and mainly perform basic automated tasks like scraping, form spam, and carding. Mitigating them used to be simple since they often use inconsistent UAs(user agents), and they used to make thousands of addresses from just one or two IP addresses.
  • Second-generation bots or gen-2: mainly operate through website development, and so they are often called ‘web crawlers’. They are now also relatively easy to detect due to the presence of specific JavaScript firing and iframe tampering.
  • Third-generation bots or gen-3: the gen-3 bots allow what we know as low-and-slow DDoS attack, but can be used for identity theft, API abuse, and other applications. They are fairly difficult to detect based on device and browser characteristics and would require proper behavioral and interaction-based analysis to identify.
  • Fourth-generation bots or gen-4: currently the newest iteration of bots, and can perform human-like interactions like non-linear mouse movements and can also change their IP addresses. Advanced detection methods, often involving the use of AI and machine learning technologies are required in detecting these bots.

The latest generation of bots (gen-4) is very hard to differentiate from legitimate human users and basic bot detection technologies are no longer sufficient. So, how can we detect and manage these bots properly? Let us discuss it in the next section below.

How to Detect Bot Traffic? Differentiating Between Bots and Human Visitors

Here we will tackle the first layer of the challenge: how to detect bot traffic and distinguish them from human traffic.

Related Read: Microsoft Healthcare Bot now generally available on Azure Marketplace

Fortunately, we can use any analytics tools that can analyze your website traffic. Google Analytics, for example, is a good place to start. Then, we can check the following metrics:

  • Traffic spikes

If you see any surge in traffic for a particular day up to a week-long, it can be a sign of bot traffic. Typically your traffic should grow steadily over time according to your marketing performance. If, for example, you’ve seen an improvement in SERP ranking, you can (and should) expect an increase in traffic. The same thing can be said when there’s a new product launch. So, if there’s any spike without any correlation to your activities, you should take a closer look at this time period.

  • Traffic sources

Another important metric to look at. The ‘healthy’ traffic can come from a variety of channels according to your marketing and promotional activities (organic search traffic, direct traffic, social media traffic, paid campaigns, and referral traffic). However, bot traffic commonly comes from direct traffic consisting of new unique users and sessions.

Also, suspicious hits from single IP addresses are the most basic form of bot activities and should be the easiest to detect (and manage). You should also take notes when there is increased activity on your site from locations you don’t cater to, or if you see hits from various other languages than your primary website language.

  • Bounce rate

An abnormally high bounce rate over a period of time, as well as a surge in new sessions, can be a major sign of bot traffic. Also, a sudden and unnatural drop of bounce rate (to below 25% or below your usual percentage) might be a sign of malicious bot activity.

  • Website/server performance

A significant slowdown of your website performance might be a sign that your servers are stressed out due to bot traffic.

Regularly monitoring these metrics can be an effective way of detecting bot traffic and activities. However, this is mainly a manual process that is not only time-consuming and labor-heavy but is generally ineffective in differentiating between good and bad bots and mitigating the activity of only malicious bots. For that, we’d need a different approach as we’ll discuss below.

Bot Detection Techniques

There are several types of bot detection techniques to distinguish bad bots or good bots:

Behavioral Detection

This detection focuses on analyzing behaviors commonly done by human users. For example, non-linear mouse movements, certain habits in typing, browsing speed, and so on. By analyzing these behaviors, the detection system will predict whether the traffic is a human or a bot.

Above, we have mentioned that gen-4 bots are really good at mimicking human behaviors, but advanced behavioral detection tools can still detect the difference. Here are some common activities tracked in behavioral detection approach:

  • Mouse movements (non-linear, randomized vs linear, patterned)
  • Mouse clicks (bots might have certain uniformed rhythms)
  • Scroll
  • Key pressed
  • Total number of requests during a session
  • The number of pages seen during a session
  • The order of pages seen and the presence of a pattern
  • The average time between pages
  • Whether certain resources are blocked (some bots block resources not useful for their missions to save bandwidth)

While behavioral detection is still mainly used to differentiate between bot and human traffic, it is also effective in recognizing bad bots, since malicious bots tend to perform certain behaviors (i.e. when they are performing data scraping, we can be sure that it’s a bad bot).

Fingerprinting Detection Technique

In fingerprinting-based detection, the detection system aims to obtain information about the browser and device used to access the website to detect any common signature carried by bad bots. The fingerprinting system usually collects multiple attributes and analyze whether they are consistent with each other. This is done to check the presence of spoofing or modifications.

Here are the common approaches in fingerprinting bot traffic:

  • Browser fingerprinting: the main approach is to check the presence of attributes added by headless (modified) browsers like PhantomJS, Nightmare, Puppeteer (headless Chrome), Selenium (for Firefox), and others. However, advanced bot developers can remove these attributes.
  • Checking browser consistency: checking the presence of certain features that should or should not be there in a (pretended) browser. This can also be done by executing certain JavaScript challenges.
  • Checking OS consistency: similar to the above, but here we aim to check the consistency of OS claimed in the UA (user agent).
  • Inconsistent behavior: checking the consistency of features in a browser compared to a headless browser, specifically to check whether the browser is in headless/modified mode.
  • Red pills: checking whether the browser is running a virtual machine, which is a huge telltale sign that it is a bot.

Using CAPTCHA

Most of us should be familiar with the concept of CAPTCHA. The idea behind the concept of CAPTCHA is that the challenge presented should be (very) easy for human users but very difficult to complete by bots or automated programs.

As we all know, image and audio recognition are popular in CAPTCHA applications. However, recent advancements in audio and image recognition have caused these approaches to be quite obsolete to an extent.

End Words

There is no one-size-fits-all approach to bot detection, and each technique has its own benefits and drawbacks depending on the specific use case. Some can work better in discerning a certain type of bots, while some others might be the better approach in differentiating between bots and legitimate traffic.

However, today’s gen-4 bots can imitate human behaviors very well, and they are also distributed with various sophisticated methods and so IP-based detection is now no longer available. Advanced bot detection and protection software is now necessary if you want to safeguard your system from various cybersecurity threats related to malicious bot activities like DDoS, identity theft, data/content craping, and others.

Suggested Reading: How cutting-edge robotics technology can be advantageous to manufacturing industry?

About Author: Mike Khorev is an SEO expert and marketing consultant.

Leave a Reply

Your email address will not be published. Required fields are marked *

× 3 = 27