fluidthoughts developers' guild

fluid funk

howto / robots-txt

How to use robots.txt

We don't serve their kind here.

What?

Your droids. They'll have to wait outside. We don't want them here.

Background: What is it?

A robots.txt file lists files and directories which should be ignored on your site by web-indexing spiders.

Perhaps you don't want a section of your site indexed, or a utility on your site can generate infinite pages based upon GET variables, or perhaps an input form isn't appropriate for search engines to reference.

Unfortunately, this is a de-facto standard, and there is no guarantee that all current and future robots will use it. Consider it a common facility that the majority of robot authors offer the web community to protect their servers against unwanted accesses by their robots.

Examples:

You can implement this functionality by adding a file named "robots.txt" to the top of your document root. An example file might look like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/

You can also include a meta statement in the page head itself.

<meta name="robots" content="noindex, nofollow">

<quote>
The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.

Note that currently only a few robots implement this.

In this simple example:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

a robot should neither index this document, nor analyse it for links.
</quote>

Caveats

Don't confuse this with your security model. Adding a file to your robots.txt file may cause google to ignore parts of your site, but it won't prevent your server from getting hacked.

This caveat discusses the wisdom of not indicating precisely the files that you don't want users to know about (of course this can be dumb - either way, security through obscurity is an idiot's trick)

<quote>
...since the robots.txt file is accessible to everyone it should not be used to hide specific files or directories on your server.

For example, if you're trying to stop search engines from indexing a file named "list_of_my_passwords.txt" and a folder with sensitive information named "secrets_folder", adding their full names as follows should be avoided whenever possible.

Instead, move your sensitive files and directories into a sub directory and exclude that sub directory by itself. As in the following example, excluding a non-specific directory name such as "folder_a" is a better solution.

If you're unable to reorganize your directory structure, yet have a strong need to exclude certain directories from indexes, use only partial names in the robots.txt file. Although this may not be the best solution, it will at least make it almost impossible to guess full directory names. For example, to exclude "secrets_folder" and "list_of_my_passwords.txt" use following names (given that there aren't any other files or directories in the web root starting with those characters).
</quote>

References:

The latest version of the "The Web Robots Pages" can be found on http://www.robotstxt.org/wc/robots.html

A robots.txt syntax checker: http://tool.motoricerca.info/robots-checker.phtml

 

$Id: robots-txt.html,v 1.6 2006/06/06 17:53:57 willn Exp $