Robots.txt and Web crawlers
Web crawlers are programs that navigate and parse the contents of websites. Companies that run web-search engines need to crawl the web to obtain data to build their search index; this is what search engines like Google, Bing, or DuckDuckGo continuously do.
Even if you do not run a web-search engine, but you'd like to crawl a site, you better follow its rules, or you risk getting banned from it.
Robots exclusion standard or
protocol, or **robots.txt**
for short, is a standard that defines how websites communicate with web
crawlers (also called spiders or bots) which resources may be crawled
and which may not.
The main reason for setting such rules is that blindly crawling entire
websites has undesirable consequences: it increases load on the servers
and on the network; in some cases to the point that it hampers the
experience of other legitimate users. So the bots may want to follow
robots.txt
to avoid getting banned.
The websites, on the other hand, may want to use robots.txt
communicate how they allow to be crawled and to tell which resources are
crawlable and which not.
Some resources on the site might be private or irrelevant to crawlers,
and a site may want to exclude them from crawling. This can also be done
with robots.txt
.
However, one must be aware that such rules are merely a non-enforceable
suggestion. So while a nice crawler may honor robots.txt
, a
bad-behaving one may likely ignore them or do exactly the opposite.
The syntax of robots.txt
The file robots.txt
is a plain-text file that is hosted on the webroot
of a website. For instace, the robots file for Bunny.net is hosted at a
predictable location: https://bunny.net/robots.txt.
The contents of the said file are, at the time of writing, the following.
User-agent: *
Allow: /
Sitemap: https://bunny.net/sitemap.xml
Host: https://bunny.net
Every subdomain should have their own robots.txt
. For instance,
https://w3.org has one for the root domain
https://www.w3.org/robots.txt, and another for the list domain
https://lists.w3.org/robots.txt.
Next, we cover the directives of the robots exclusion standard.
User-Agent
directive
The User-Agent
directive is used to specify instructions for specific
bots.
Generally, the term user-agent denotes a piece of software that acts
on behalf of a user. In this case, the User-Agent
denotes the name of
the crawler that also denotes the name its owner. For instance, the
following bot names belong to the most known search engines:
Googlebot
Bingbot
DuckDuckBot
If you want to address all bots, set the User-Agent
to a wildcard
value denoted by asterix: User-Agent: *
.
Disallow
directive
The Disallow
directive specifies which resources are not to be
crawled. It can be used in many ways.
To disallow crawling a particular resource.
Disallow: /a-particular-resource.html
To disallow crawling a whole directory, including its subdirectories.
Disallow: /directory-name/
To disallow crawling entirely.
Disallow: /
To allow access to the entire site set the directive to an empty string.
Disallow:
Comments
The last official Robots exclusion protocol directive is a single line
comment. The comment is started by using the the pound sign #
. For
instance:
User-Agent: DuckDuckBot # when the crawler is from DuckDuckGo search engine
Disallow: # allow access to the entire site
User-Agent: ABotNonGrata # tell a search bot from an engine we do not like
Disallow: / # it is not allowed to crawl
Needless to say that nothing prevents the ABotNonGrata
to actually
crawl the site.
Additional unofficial directives
Directives User-Agent
and Disallow
and writing comments are the only
official Robots exclusion standard directives.
However, there are a few other ones that are not officially recognized, but are still acknowledged by most bots.
Allow
directiveDirective
Allow
specifies a resource that is allowed to crawled. This is commonly used together with aDisallow
directive that typically disallows a larger set from crawling, but theAllow
directive exempts the resources from being disallowed.In the following, crawling
/folder/allowed_resource.html
is allowed, but crawling anything else from/folder/
is not.Allow: /folder/allowed_resource.html Disallow: /folder/
Crawl-delay
directiveDirective
Crawl-delay: value
is used to rate-limit the crawler.The interpretation of
value
varies between bots. Some regard it as the number of milliseconds (e.g. Yandex) the crawler should wait between sending subsequent requests. Others regard it as the number of seconds. Some, like GoogleBot, do not recognize it at all.Host
directiveSome crawlers support the
Host: domain
directive that allows websites with multiple mirrors to list their preferreddomain
for crawling.Sitemap
directiveDirective
Sitemap: URL
specifies aURL
to a website's sitemap in XML.The sitemap contains all resources that are available for crawling together with their metadata. Here's again the
robots.txt
from Bunny.net.User-agent: * Allow: / Sitemap: https://bunny.net/sitemap.xml Host: https://bunny.net
On https://bunny.net/sitemap.xml we find a set of URLs, each with some additional metadata, such as the modification date and priority. Here we list only a small excerpt.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"> <url> <loc>https://bunny.net/</loc> <lastmod>2022-08-02T20:28:29+00:00</lastmod> <priority>1.00</priority> </url> <url> <loc>https://bunny.net/stream/</loc> <lastmod>2022-08-02T20:29:38+00:00</lastmod> <priority>0.90</priority> </url> </urlset>
A nice-behaving crawler can simply look at these URLs and process them directly without having to parse HTML pages and look for links. Consequently, such crawling inflicts the minimum amount of load on the website.
Conclusion
The Robots exclusion standard, or robots.txt
, is a set of rules that
should be followed when crawling a website with a bot or a spider.
While a website has no guarantee that a crawler will honor such rules, a
robot that crawls the site in accordance with the robots.txt
will
inflict a tolerable load and is unlikely to get banned.