Some people believe
that they should create different pages for different search engines, each
page optimized for one keyword and for one search engine. Now, while I
don't recommend that people create different pages for different search
engines, if you do decide to create such pages, there is one issue that
you need to be aware of.
These pages, although
optimized for different search engines, often turn out to be pretty
similar to each other. The search engines now have the ability to detect
when a site has created such similar looking pages and are penalizing or
even banning such sites. In order to prevent your site from being
penalized for spamming, you need to prevent the search engine spiders from
indexing pages which are not meant for it, i.e. you need to prevent
AltaVista
from indexing pages meant for
Google
and vice-versa. The best way to do that is to use a robots.txt file.
You should create a
robots.txt file using a text editor like Windows Notepad. Don't use your
word processor to create such a file.
Here is the basic
syntax of the robots.txt file:
User-Agent: [Spider
Name]
Disallow: [File Name]
For instance, to tell
AltaVista's
spider, Scooter, not to spider the file named myfile1.html residing in the
root directory of the server, you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell
Google's
spider, called Googlebot, not to spider the files myfile2.html and
myfile3.html, you would write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put
multiple User-Agent statements in the same robots.txt file. Hence, to tell
AltaVista
not to spider the file named myfile1.html, and to tell
Google
not to spider the files myfile2.html and myfile3.html, you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent
all robots from spidering the file named myfile4.html, you can use the *
wildcard character in the User-Agent line, i.e. you would write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use
the wildcard character in the Disallow line.
Once you have created
the robots.txt file, you should upload it to the root directory of your
domain. Uploading it to any sub-directory won't work - the robots.txt file
needs to be in the root directory.
I won't discuss the
syntax and structure of the robots.txt file any further - you can get the
complete specifications from
here.
Now we come to how the
robots.txt file can be used to prevent your site from being penalized for
spamming in case you are creating different pages for different search
engines. What you need to do is to prevent each search engine from
spidering pages which are not meant for it.
For simplicity, let's
assume that you are targeting only two keywords: "tourism in Australia"
and "travel to Australia". Also, let's assume that you are targeting only
three of the
major search engines:
AltaVista,
HotBot
and
Google.
Now, suppose you have
followed the following convention for naming the files: Each page is named
by separating the individual words of the keyword for which the page is
being optimized by hyphens. To this is added the first two letters of the
name of the search engine for which the page is being optimized.
Hence, the files for
AltaVista
are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for
HotBot
are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for
Google
are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier,
AltaVista's
spider is called Scooter and
Google's
spider is called Googlebot.
A list of spiders for
the major search engines can be found
here.
Now, we know that
HotBot
uses
Inktomi
and from this list, we find and Inktomi's spider is called Slurp.
Using this knowledge,
here's what the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above
lines in the robots.txt file, you instruct each search engine not to
spider the files meant for the other search engines.
When you have finished
creating the robots.txt file, double-check to ensure that you have not
made any errors anywhere in it. A small error can have disastrous
consequences - a search engine may spider files which are not meant for
it, in which case it can penalize your site for spamming, or, it may not
spider any files at all, in which case you won't get top rankings in that
search engine.
An useful tool to check
the syntax of your robots.txt file can be found
here. While it will help you
correct syntactical errors in the robots.txt file, it won't help you
correct any logical errors, for which you will still need to go through
the robots.txt thoroughly, as mentioned above.