Blog

How to tell if scrapers are eating your bandwidth

Posted by:

Dino Kukic

June 12, 2026

Bunny Shield makes identifying and blocking bots trivial. However, if you aren’t using it and have noticed anomalies in your bandwidth, we’ve prepped this guide to help you determine whether the reason is a pop star tweeting about your website or, unfortunately, a surge in bot traffic that's now costing you real money.

Not all bots are the bad guys

Before you start blocking things, it's worth understanding what's actually out there. The term "bot" covers everything from the crawler that helps get your content in Google to the script copying your entire product catalog to sell to a competitor. Treating them all the same is how you accidentally remove your site from search results.

Here's the roughly sorted landscape.

Search engine crawlers

Googlebot, Bingbot, and, yeah, mostly Googlebot. They crawl and index your content and then serve it for relevant queries in the search engine results page (SERP). For the most part, you want these, and blocking them might harm your site unless you don’t need traffic coming from search engines at all. One caveat here is that Bingbot increasingly feeds Copilot, Applebot feeds Apple Intelligence, and Google's crawl feeds Gemini. So they are being used for other things as well, and if those use cases aren't an issue, this is actually a good thing. Your site is crawled once (periodically) for multiple purposes. It become an issue when you want one, but don’t want the other.

AI search bots

OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot, and others. These fetch your page in real time to answer a user's current question, often with a citation back to you. They're more like search crawlers than training crawlers because they can send real visitors your way.

AI training crawlers

GPTBot, ClaudeBot, Meta-ExternalAgent, and the rest. They collect content to train large language models. They can use real bandwidth and generally don't send traffic back the way a search crawler does, though that may be changing as AI assistants increasingly cite and link their sources. Whether that's worth the bandwidth depends on how you value having your content represented in the models.

Agentic and assistant bots

These are bots completing a task on behalf of a specific user, such as booking, buying, comparing, summarizing, and more. These often represent a real person with real intent, so blocking them can mean blocking a customer. The category is immature and hard to identify cleanly.

Marketing intelligence crawlers

AhrefsBot, SemrushBot, and similar. They crawl your site to build the commercial SEO datasets behind tools marketers use to track rankings, find backlinks, and size up competitors. The value to you is indirect and slightly circular. These crawlers help build the same SEO datasets you might use to analyze competitors. Whether that’s worth the crawl is a judgment call.

FacebookExternalHit, Twitterbot, LinkedInBot, Slackbot, Discordbot, etc. They fetch a page to build the preview card when someone shares your link. These are triggered by real people sharing your content. If you block them, your links may render as ugly bare URLs everywhere. They're generally low volume but potentially high value.

Commercial / vertical scrapers

Price-comparison engines, job aggregators, review aggregators. Whether you want them depends entirely on your business. A price comparison bot is great if you want to be compared and a problem if a competitor is using it to undercut you.

Research / archival crawlers

The Internet Archive's crawler, academic datasets, and preservation projects. Usually benign and often a public good. Common Crawl is the one that complicates the picture a little bit because it's a long-running open dataset used widely in research, but it's also been a common source of training data for language models. This means that site owners who want to limit AI training sometimes block its crawler (CCBot) too, even though its purpose is broader than that.

Monitoring bots

Uptime checkers (Pingdom, UptimeRobot), performance monitors (Catchpoint, Datadog) and your own health checks. You want the ones you recognize, especially your own. However, those are usually low volume and mostly harmless.

Malicious bots

Content thieves, hostile price and data scrapers, credential-stuffing, and brute-force bots, vulnerability scanners, spam bots, and inventory-hoarding scalpers. This is usually what you want to identify, and the only hard part is detecting them because they actively try to look like everything else on this list.

So the most important thing is that any of these can be faked**.** A request claiming to be Googlebot might be a scraper wearing a sheep’s skin.

How to diagnose it from your logs

Everything below runs against standard access logs. Just adjust field positions for your log format. Also, all of these are artificially generated examples, so real logs will have a little bit more nuance.

Growth (or lack of) in analytics sessions and conversions

Before you dig into the logs, you can just look at your bandwidth compared to analytics sessions. For example:

	Last month	This month	Change
Bandwidth	2.1 TB	3.8 TB	+81%
Analytics sessions	48,200	49,100	+2%
Conversions	1,840	1,810	-2%

Bandwidth is up 81% while sessions remain roughly flat. Most analytics tools run via JavaScript, and most bots don't execute JavaScript, so they're invisible to your analytics but still consume bandwidth.

Request rate per client

Find out who's making the most requests:

awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

  84213 47.128.44.19
  61887 47.128.44.20
  58122 47.128.44.21
   9043 66.249.66.1
   2287 81.150.12.4
    412 81.150.12.4
    389 92.40.177.22
    301 213.205.241.9
    288 51.171.38.4
    274 78.149.203.11

The top three IPs sit in one tight block, 47.128.44.x, and each is making 50,000-80,000 requests, while your actual visitors trail off into the low hundreds. The 9,043 requests from 66.249.66.1 are from Googlebot (we’ll go through its legitimacy later). The three at the top are a coordinated scrape from a single subnet.

One thing worth doing prior to actually eyeballing this is calculating the median requests per IP first, as this will give you a baseline for what traffic normally looks like on your website. You want to identify clients making 10 or 100 times the median.

The asset-loading signature

This is one of the cleanest tells you have. When a real browser loads a page, it fetches the HTML and then everything the page references, such as CSS, JavaScript, fonts, and images. A scraper usually grabs only the HTML.

Pull everything requested by a suspect IP:

grep "^47.128.44.19" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -10

For the scraper, it's all pages, no assets:

   2104 /products/12841
   2103 /products/12842
   2101 /products/12843
   2099 /products/12844

Now run the same command against a real visitor, and the shape is completely different:

     14 /products/12841
      9 /css/main.a3f9.css
      9 /js/app.8c21.js
      7 /fonts/inter.woff2
     22 /images/product-12841-thumb.webp
      6 /api/cart/count

The real browser pulls the page plus its stylesheet, scripts, fonts, images, and a cart API call, so practically everything needed to actually render and use the page. The scraper pulls pages and nothing else. That ratio, HTML-only versus HTML-plus-assets, is hard to fake convincingly because faking it means doing real rendering work, which defeats the point of scraping cheaply.

No JavaScript execution

This is closely related, but it's worth confirming separately. Most scrapers don't run JavaScript at all, so they never hit the endpoints your frontend fires after the page loads, such as analytics scripts, lazy-loaded API calls, and tracking pixels.

Count the client-side instrumentation hits from your suspect:

grep "^47.128.44.19" access.log | grep -E "/api/|/track|/analytics|/beacon" | wc -l

For a scraper, this comes back as 0. A real browser session fires these constantly, so anything above zero, often well into the dozens per session, is a strong signal of real browser activity.

Sequential / systematic URL patterns

Humans browse associatively. They follow what interests them, jump around, double back, and so on. Scrapers walk in straight lines.

Look at a suspect's requests in time order:

grep "^47.128.44.19" access.log | awk '{print $4, $7}' | head -12

The scraper sweeps through IDs in perfect order, about two per second, with no pauses:

[10:42:01] /products/12841
[10:42:01] /products/12842
[10:42:02] /products/12843
[10:42:02] /products/12844
[10:42:03] /products/12845
[10:42:03] /products/12846
[10:42:04] /products/12847
[10:42:04] /products/12848
[10:42:05] /products/12849
[10:42:05] /products/12850
[10:42:06] /products/12851
[10:42:06] /products/12852

A real visitor's path looks nothing like that:

[10:41:55] /
[10:42:03] /products/winter-jacket
[10:42:31] /products/winter-jacket?color=navy
[10:42:58] /cart
[10:43:12] /products/wool-scarf
[10:44:40] /products/winter-jacket

Homepage, a product, a variant of that product, a different product, the cart, and then back to the first one, with irregular gaps of eight to forty seconds where a person was actually reading.

User-agent distribution

Now look at what everything is claiming to be:

awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn | head -12

 204417 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
 184992 python-requests/2.31.0
  61003 Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)
  44781 Mozilla/5.0 (compatible; ClaudeBot/1.0; +mailto:claudebot@anthropic.com)
  29550 Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
  18204 Go-http-client/2.0
  12876 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
   9043 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
   3001 Scrapy/2.11 (+https://scrapy.org)
   2422 curl/8.4.0
    880 (empty)

Reading down the list:

The 204k "Chrome" entries look human, but this line almost certainly hides spoofed bots because impersonating Chrome is the easiest disguise there is.
184k python-requests. That's a scripting library, and having a Python HTTP client as the second-busiest "user" is a lot.
GPTBot, ClaudeBot, and AhrefsBot: bots that honestly declare themselves. You can decide what to do with each by name.
Go-http-client, Scrapy, curl, and the blank entry are almost all automation. A lot of curl requests could be individual people looking to pull something from the terminal.

The honest bots label themselves, and the lazy scrapers don't bother to hide. The tricky ones might be hiding inside that first "Chrome" line, which is why user agent header is just one of the things to check.

Verifying identity with a DNS lookup

When a request claims to be a known crawler, you can verify its identity. The standard method is a reverse-then-forward DNS check.

For a request claiming to be Googlebot:

$ host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

$ host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

The IP reverse-resolves to a googlebot.com hostname, and that hostname forward-resolves back to the same IP. The round trip matches, and the domain is correct, so it’s a verified Googlebot.

Now here's a request that also claimed to be Googlebot, from one of our scraper IPs:

$ host 47.128.44.19
19.44.128.47.in-addr.arpa domain name pointer ec2-47-128-44-19.ap-southeast-1.compute.amazonaws.com.

It claimed to be Googlebot, but the IP reverse-resolves to an AWS EC2 instance in Singapore, not googlebot.com. Google does not crawl from EC2 instances. The user-agent was identical to the real Googlebot's, but it’s not actually Google’s.

Published IP ranges

Most major crawlers publish the IP ranges they operate from, which gives you a faster check than a DNS round trip at scale. OpenAI publishes theirs as a JSON file:

$ curl -s <https://openai.com/gptbot.json> | head
{
  "creationTime": "2025-10-30T11:00:00.000000",
  "prefixes": [
    {
      "ipv4Prefix": "132.196.86.0/24"
    },
    {
      "ipv4Prefix": "172.182.202.0/25"
    },
    {
      "ipv4Prefix": "172.182.204.0/24"
    ...
  ]
}

If a request claims to be GPTBot but its source IP isn't in OpenAI's published ranges, then it isn't GPTBot. The same approach works for Anthropic, Perplexity, Microsoft, and Google. The only maintenance cost is keeping the lists current, since they change.

ASN lookup

Look at where traffic originates by network, not just by individual IP:

awk '{print $1}' access.log | sort -u | asn-lookup | sort | uniq -c | sort -rn | head

  189442  AS16509  Amazon-AES
   72103  AS14061  DigitalOcean
   41996  AS24940  Hetzner
   38201  AS5089   Virgin Media
   31774  AS2856   BT
   22018  AS5607   Sky UK

The top three sources by volume are all data centers while real consumer audiences come from residential and mobile ISPs like Virgin, BT, and Sky. So if you are seeing the most requests from AWS and Hetzner, these are likely not real users. One important thing to mention is that legitimate crawlers like Googlebot also run from data centers, so this signal isn't conclusive on its own. However, you can identify legit crawlers in different ways.

Geographic and timing anomalies

Requests by country:

This one depends entirely on your situation, but the logic is simple: if you're a UK e-commerce store and your single largest source of traffic is Singapore, by a wide margin, you're almost certainly looking at bots rather than a sudden surge of overseas customers.

Timing tells the same kind of story. Here's a suspect's requests per hour across a day:

00:00  ████████████████████  5,012
01:00  ████████████████████  4,998
02:00  ████████████████████  5,031
03:00  ████████████████████  4,987
...
13:00  ████████████████████  5,004
...
23:00  ████████████████████  5,019

An almost flat line with roughly 5,000 requests an hour at 3 a.m. and 1 p.m. alike. Human behavior has some daily rhythm that makes traffic quieter overnight, build through the morning, and reach its peak in the afternoon and evening. A perfectly flat 24-hour line is a machine crawling at a constant rate. Real human traffic would look more like this:

00:00  ████  890
03:00  ██  410
08:00  ████████████  3,100
13:00  ████████████████████  5,200
20:00  ██████████████████  4,600
23:00  ████████  2,000

The issue with this approach

Every check above happened after the fact, on traffic you've already paid for. The scraper in these examples may have run 200,000+ requests before you grepped your way to it. IPs rotate, published ranges change, new crawlers appear, and the ones trying to hide adapt to whatever you're filtering on. So it’s a lot of work, and most of it is reactive.

So the real goal here is to understand your traffic well enough to make deliberate calls: which bots you want, which you don't, and where to draw the line. Once you can see clearly what's hitting your site, you're in a position to do something about it.

How to tell if scrapers are eating your bandwidth

Dino Kukic

June 12, 2026

Not all bots are the bad guys

Search engine crawlers

AI search bots

AI training crawlers

Agentic and assistant bots

Marketing intelligence crawlers

Commercial / vertical scrapers

Research / archival crawlers

Monitoring bots

Malicious bots

How to diagnose it from your logs

Growth (or lack of) in analytics sessions and conversions

Request rate per client

The asset-loading signature

No JavaScript execution

Sequential / systematic URL patterns

User-agent distribution

Verifying identity with a DNS lookup

Published IP ranges

ASN lookup

Geographic and timing anomalies

The issue with this approach

bunny.net

Products

Developers

Support

How to tell if scrapers are eating your bandwidth

Dino Kukic

June 12, 2026

Not all bots are the bad guys

Search engine crawlers

AI search bots

AI training crawlers

Agentic and assistant bots

Marketing intelligence crawlers

Social / link-preview bots

Commercial / vertical scrapers

Research / archival crawlers

Monitoring bots

Malicious bots

How to diagnose it from your logs

Growth (or lack of) in analytics sessions and conversions

Request rate per client

The asset-loading signature

No JavaScript execution

Sequential / systematic URL patterns

User-agent distribution

Verifying identity with a DNS lookup

Published IP ranges

ASN lookup

Geographic and timing anomalies

The issue with this approach