Some webmasters do not want their sites to show in search engines while others want. All these can be done with the help of robots.txt
This is a text file that is written and can be intepreted by Spiders/Crawlers to abide in rules given to it, and in robots.txt it refers to a particular spider/crawler by its UA (User Agent) in directing it where to crawl in site and where to not. All robots.txt must be saved in the root directory and saved with robots.txt and any spider visiting your site must first look for http://yoursite.com/robots.txt before accessing the site
Spiders or called Crawlers are programs sent by Search engines to index pages and take results gotten from pages to the search engine, Spiders are also used by H*ckers in getting Email address for spamming, and all browser, spiders/crawlers have a unique User Agent used for surfing the net.
User Agents are just like an Identity Card used by browsers and Spiders in surfing the net in order to be recognised, example of a Firefox browser User Agent is shown below:
Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0
And example of a Google User Agent is shown below:
Googlebot
Now with the use of user agent we can manage the crawling of our site either referring to a specific spider or to all.
User-agent: Googlebot
Disallow:
With the above code, we are referring to Google to index the whole site. the User-agent is referring the Googlebot(which is Google Spider UA) and Disallow: is saying yes to crawling your site
User-agent: *
Disallow: /
The above code also is referring to all Spiders which is the meaning of the “*” and Disallow: / interprets to the spider not to craw any content of the site.
You can specify a directory you want it not to crawl by using Disallow: /dir_name
All these are from my own understanding, experience and knowledge, any mistake you can put me right by using the comment below.
Hope it helps!!