InsideDarkWeb.com

What's the proper way to handle Allow and Disallow in robots.txt?

I run a fairly large-scale Web crawler. We try very hard to operate the crawler within accepted community standards, and that includes respecting robots.txt. We get very few complaints about the crawler, but when we do the majority are about our handling of robots.txt. Most often the Webmaster made a mistake in his robots.txt and we kindly point out the error. But periodically we run into grey areas that involve the handling of Allow and Disallow.

The robots.txt page doesn’t cover Allow. I’ve seen other pages, some of which say that crawlers use a “first matching” rule, and others that don’t specify. That leads to some confusion. For example, Google’s page about robots.txt used to have this example:

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Obviously, a “first matching” rule here wouldn’t work because the crawler would see the Disallow and go away, never crawling the file that was specifically allowed.

We’re in the clear if we ignore all Allow lines, but then we might not crawl something that we’re allowed to crawl. We’ll miss things.

We’ve had great success by checking Allow first, and then checking Disallow, the idea being that Allow was intended to be more specific than Disallow. That’s because, by default (i.e. in the absence of instructions to the contrary), all access is allowed. But then we run across something like this:

User-agent: *
Disallow: /norobots/
Allow: /

The intent here is obvious, but that Allow: / will cause a bot that checks Allow first to think it can crawl anything on the site.

Even that can be worked around in this case. We can compare the matching Allow with the matching Disallow and determine that we’re not allowed to crawl anything in /norobots/. But that breaks down in the face of wildcards:

User-agent: *
Disallow: /norobots/
Allow: /*.html$

The question, then, is the bot allowed to crawl /norobots/index.html?

The “first matching” rule eliminates all ambiguity, but I often see sites that show something like the old Google example, putting the more specific Allow after the Disallow. That syntax requires more processing by the bot and leads to ambiguities that can’t be resolved.

My question, then, is what’s the right way to do things? What do Webmasters expect from a well-behaved bot when it comes to robots.txt handling?

Webmasters Asked by Jim Mischel on February 23, 2021

3 Answers

3 Answers

One very important note: the Allow statement should come before the Disallow statement, no matter how specific your statements are. So in your third example - no, the bots won't crawl /norobots/index.html.

Generally, as a personal rule, I put allow statements first and then I list the disallowed pages and folders.

Correct answer by Vergil Penkov on February 23, 2021

Google expanded the documentation for robots.txt for user agents that support the Allow directive. The rule that Googlebot uses (and what Google is trying to make standard) is that the longest matching rule wins.

So when you have:

Disallow: /norobots/
Disallow: /nobot/
Allow:    /*.html$
Allow:    /****.gif$
  • /norobots/index.html is blocked because it matches two rules and /norobots/ is longer (10 characters) than /*.html$ (8 characters).
  • /nobot/index.html is allowed because it matches two rules and /nobot/ is shorter (7 characters) than /*.html$ (8 characters).
  • /norobots/pic.gif is allowed because it matches two rules and /norobots/ is equal length (10 characters) to /****.gif$ (10 characters). Google's spec says that the "less restrictive" rule should be used for rules of equal length, ie. the one that allows crawling.

Answered by Stephen Ostermiller on February 23, 2021

Here's my take on what I see in those three examples.

Example 1
I would ignore the entire /folder1/ directory except the myfile.html file. Since they explicitly allow it I would assume it was simply easier to block the entire directory and explicitly allow that one file as opposed to listing every file they wanted to have blocked. If that directory contained a lot of files and subdirectories that robots.txt file could get unwieldy fast.

Example 2
I would assume the /norobots/ directory is off limits and everything else is available to be crawled. I read this as "crawl everything except the /norobots/ directory".

Example 3
Similar to example 2, I would assume the /norobots/ directory is off limits and all .html files not in that directory is available to be crawled. I read this as "crawl all .html files but do not crawl any content in the the /norobots/ directory".

Hopefully your bot's user-agent contains a URL where they can find out more information about your crawling habits and make removal requests or give you feedback about how they want their robots.txt interpreted.

Answered by John Conde on February 23, 2021

Add your own answers!

Related Questions

Forwarding a subdomain to a Minecraft Server

2  Asked on February 26, 2021 by goodpie

     

Crafting a regex for mod_rewrite with more than nine parameters

2  Asked on February 20, 2021 by mike-no-longer-here

 

How Does Google Search Find Pages On Web Site

2  Asked on February 17, 2021 by glez

   

Mongo Atlas outbound data costs

0  Asked on January 30, 2021 by josh-hansen

     

How to Remove URLs from Google Search Engine

6  Asked on January 9, 2021 by web-trainings

     

Is there an alternative to .htaccess?

3  Asked on December 29, 2020 by bob-ortiz

       

How to de-index pages from google using robots.txt

2  Asked on December 27, 2020 by genadinik

 

Do CSS libraries like Bootsrap hurt SEO and Google rankings?

1  Asked on December 25, 2020 by kovcs-gergely

       

cPanel incorrectly identifying remote DNS server type

1  Asked on December 11, 2020 by christopher-h

   

Ask a Question

Get help from others!

© 2021 InsideDarkWeb.com. All rights reserved.