The Robots Text File Or How To Get Your Site Properly Spidered, Crawled, Indexed By Bots

0 Comments

So you heard about someone stressing the importance of the robots.txt file, or noticed in your website’s logs that the robots.txt file is causing an error, or somehow it is on the very top of the top visited pages, or, you read some article about the death of the robots.txt file and about how you should not bother with it ever again. Or maybe you never heard of the robots.txt file but are intrigued by all that talk about spiders, robots and crawlers. In this article, I will hopefully make some sense out of all of the above.

There are many folks out there who vehemently insist on the uselessness of the robots.txt file, proclaiming it obsolete, a thing of the past, plain dead. I disagree. The robots.txt file is probably not in the top ten methods to promote your get-rich-fast affiliate website in 24 hours or less, but still plays a major role in the long run.

First of all, the robots.txt file is still a very important factor in promoting and maintaining a site, and I will show you why. Second, the robots.txt file is one of the simple means by which you can protect your privacy and/or intellectual property. I will show you how.

Let’s try to figure out some of the lingo.

What is this robots.txt file?

The robots.txt file is just a very plain text file (or an ASCII file, as some like to say), with a very simple set of instructions that we give to a web robot, so the robot knows which pages we need scanned (or crawled, or spidered, or indexed – all terms refer to the same thing in this context) and which pages we would like to keep out of search engines.

What is a www robot?

A robot is a computer program that automatically reads web pages and goes through every link that it finds. The purpose of robots is to gather information. Some of the most famous robots mentioned in this article work for the search engines, indexing all the information available on the web.

The first robot was developed by MIT and launched in 1993. It was named the World Wide Web Wander and its initial purpose was of a purely scientific nature, its mission was to measure the growth of the web. The index generated from the experiment’s results proved to be an awesome tool and effectively became the first search engine. Most of the stuff we consider today to be indispensable online tools was born as a side effect of some scientific experiment.

What is a search engine?

Generically, a search engine is a program that searches through a database. In the popular sense, as referred to the web, a search engine is considered to be a system that has a user search form, which can search through a repository of web pages gathered by a robot.

What are spiders and crawlers?

Spiders and crawlers are robots, only the names sound cooler in the press and within metro-geek circles.

What are the most popular robots? Is there a list?

Some of the most well known robots are Google’s Googlebot, MSN’s MSNBot, Ask Jeeves’s Teoma, Yahoo!’s Slurp (funny).  消毒機械人 One of the most popular places to search for active robot info is the list maintained at

Why do I need this robots.txt file anyway?

A great reason to use a robots.txt file is actually the fact that many search engines, including Google, post suggestions for the public to make use of this tool. Why is it such a big deal that Google teaches people about the robots.txt? Well, because nowadays, search engines are not a playground for scientists and geeks anymore, but large corporate enterprises. Google is one of the most secretive search engines out there. Very little is known to the public about how it operates, how it indexes, how it searches, how it creates its rankings, etc. In fact, if you do a careful search in specialized forums, or wherever else these issues are discussed, nobody really agrees on whether Google puts more emphasis on this or that element to create its rankings. And when people don’t agree on things as precise as a ranking algorithm, it means two things: that Google constantly changes its methods, and that it does not make it very clear or very public. There’s only one thing that I believe to be crystal clear. If they recommend that you use a robots.txt (“Make use of the robots.txt file on your web server” – Google Technical Guidelines), then do it. It might not help your ranking, but it will definitely not hurt you.

There are other reasons to use the robots.txt file. If you use your error logs to tweak and keep your site free of errors, you will notice that most errors refer to someone or something not finding the robots.txt file. All you have to do is create a basic blank page (use Notepad in Windows, or the most simple text editor in Linux or on a Mac), name it robots.txt and upload it to the root of your server (that’s where your home page is).

 

Leave a Reply

Your email address will not be published. Required fields are marked *