Thread: Google Robots.txt Specifications

    Google Robots.txt Specifications

    Hi, spotted this in my travels recently, might be of interest to some of you guys.

    Controlling Crawling and Indexing
    Controlling Crawling and Indexing - Google Code

    Robots.txt Specifications

    This document details how Google handles the robots.txt file that allows you to control how Google's website crawlers crawl and index publicly accessible websites.
    Robots.txt Specifications - Controlling Crawling and Indexing - Google Code


    Little discussion:
    Google's Current Specifications for Robots Directives Sitemaps, Meta Data, and robots.txt

    Just keep in mind that Google uses extensions with the robots.txt file that are not valid with many other spiders, such as the Allow directive and the use of an asterisk wildcard (*) with a directive's arguments. These are not part of the official standard for the robots.txt file.

    You should set up a Google section in robots.txt if you want to use the Google extensions. If you set up a Google section, Google will not recognize other spider directives on the page, so you need to repeat every area in your site that you do not want Google to index.

    Examples. The first example (from the Google page) is actually incorrect because not all spiders recognize the Allow directive. Yahoo recognizes it, but it looks like Bing does not. This is a good example of Google viewing the Internet through a mirror.

    User-agent: *
    Allow: /
    Disallow: /*.php
    Wondered why all of a sudden, I am getting a lot of visits to robot.txt on some of my sites.
    Have no robot.txt though.

    Should I really have one?

