How Robots.txt and Meta Tags Affect Search Engine Crawling

Googlebot

If you are concerned about the privacy of your website and you do not want the search engine crawlers or bots to crawl certain pages of your website, then “Robots.txt” is the one-stop solution that will keep the crawlers away from the ‘No Entry’ zone.

Webmaster’s Note: This is a guest post by Sarah Bruce

Confused? Probably, you are wondering about the need of keeping the search engine bots away from the pages, when everyone wants their website to be indexed in the search engines. Sure.

Reason for stopping the bots from entering certain pages of a website

noindex

If yours is an e-commerce website and you store your database on it, would you like to disclose the database of your clients’ information to the entire world? Definitely not! But, if you do not take any precautionary measures to indicate the crawlers not to crawl those pages with vital information, then search engine spiders will crawl them eventually and index those pages in the search engine results. From there, anybody can view the detail of your clients and use it unethically, to put you and your clients in a position of legal nightmare.

To avoid such disaster, you should use robots.txt.‘Robots.txt’ plays the similar role as a bouncer in a club. Like how bouncers do not allow certain guests to enter private sections of the club, so does robots.txt. Consider it as a file which includes the directories that shouldn’t be entered by specific or all crawlers.

Now, this question arises: Are your pages safe with robots.txt?

Search Engine crawlers are built from artificial intelligence and before visiting any page of the website, these bots look out for the existence of robots.txt file, where they can see the pages that they are prevented from accessing.

Don’t worry about search engine bots violating the robots.txt file of your website. If they do so, they have to face severe legal consequences, which is why they have no option but to respect your robots.txt file.

The Bad news is that there are malicious spammers who also make use of robots to crawl the website’s ‘private’ pages, which you pretty much can’t do anything about. So, it is highly recommended to use firewalls, encryption methods, password protection and other security services besides robots.txt.

In and out of ‘robots.txt’!

Not everyone needs robots.txt. Unless you have some serious content in your website, which you do not want anybody to look into, there is no mandatory need to upload a robots.txt file and not even an empty one.

Robots.txt file contains a set of instructions for the search engine crawlers, as in the files and directories that are not supposed to be crawled. A noteworthy point here is that this file should be installed in the highest level directory of your website because crawlers search for robot.txt file in the root domain of your website and not in any sub-domain.

For example, http://www.abc.com/robots.txt is a valid location, but http://www.abc.com/mysite/robots.txt is invalid.

How to create a robots.txt file?

There are two important parts of a robots.txt file:

Google-SpiderUser-agent: It symbolizes a search engine bot. You can indicate either all the search engine bots or a specific bot.

Disallow: This is the field, which allows or disallows the search engines to crawl specific files or directories.

If you want all search engines not to crawl a directory, then use a * on the User-Agent section then follow the directory name with a forward slash:

  1. User-agent: *

Disallow: /directoryA/

If you want particularly, Bingbot not to crawl a directory, then follow the directory name with a forward slash:

  1. User-agent: Bingbot

Disallow: / directoryA /

If you want all search engines not to crawl the complete website, then:

  1. User-agent: *

Disallow: /

If you want to restrict the search engine bots from crawling a page, then:

  1. User-agent: *

Disallow: /abc_file.html

Google uses many bots, such as Googlebot-Image and Googlebot-Mobile, however the conditions applied to Googlebot will be applied to all, but the case is not vice-versa. You can set specific rules for the specific bots, as well.

To block an image from Google Images, use the following:

  1. User-agent: Googlebot-Image

Disallow: /images/ watch.jpg

To remove all images from Googlebot Images, use:

  1. User-agent: Googlebot-Image

Disallow: /

If you want to block a specific file type, for example—.png, then:

  1. User-agent: Googlebot

Disallow: /.png

You can be certain of the pages not crawled by search engine bots, if you have indicated them in your robots.txt. However, if the URLs of those pages are found in other pages of your website, then there’s a certain narrow chance that those pages will also be indexed.

To avoid such kind of trouble, it is recommended that you use ‘robots meta tag’, to restrict any kind of access to the specific page. Let us dig out little information about robots Meta tag, to understand it better.

Robots Meta Tag: In Depth

Index’ and ‘noindex’ are the two major instructions of a Meta tag, as it allows you to have a control on the indexing page-by-page. If you do not want the search engine bot to index a specific page, then put the following Meta tag at the head section of your page:

<meta name=”robots” content=”noindex”>

If you do not want a specific bot to index a page, for example—Googlebot, then:

<meta name=”Googlebot” content=”noindex”>

Search engine crawlers will only crawl the pages that they are allowed to. But, if they find the links on other pages, they may not overlook those URLs and end up in indexing those pages. It is not necessary that the bots will index the pages, where you have used the Meta tag to ‘index’. However, the certain thing is that search engine bots will abruptly drop the pages, which are asked to “noindex”, even if they have been linked to other pages.

Remember that if you have included a ‘noindex’ meta tag in a page, but that page is not included in the robots.txt, search engine bots will crawl that page and the moment it comes across ‘noindex’ tag, it will drop it.

There could be a possibility that despite of adding a ‘noindex’ Meta tag, the page still appears in the search result. Don’t panic – the reason could be: the crawlers didn’t appear back to crawl your page since you have added the Meta tag. It will be definitely removed the next time the crawler crawls your page.

To speed up the index removal process, you can also make use of Google’s URL removal tool.

Final Touch: Test your robots.txt file through Google Webmaster Tools

This test is advised to be performed on a ‘Test robots.txt’ tool, before you upload the robots.txt file in your website’s root domain. This test will give you the actual result, as it reads the website as Googlebot does.

Performing this test is a plus, as you will know if the robots.txt file is blocking or permitting a page, accidentally. Accordingly, you can fix the problems, if any found. Let us see, how to use the tool:

¨     Click on the website that you want to check, in the Webmaster Tools home page.

¨     Under ‘Health’ section, click ‘Blocked URLs.

¨     ‘Test robots.txt’ tab must be selected, by default. If it is not, then click on the tab.

¨     You need to copy the content of your robots.txt file and paste it in the first box.

¨     Copy and paste the sites that need to be tested in the ‘URLs’ box

¨     List the user-agents in the ‘User-agents’ box.

Do remember that you cannot make any change from within the tool, but you need to edit the content of the robots.txt file.

From Zero to a Thriving SEO CompanyWe're sharing everything on our Journey to $50K in monthly revenue.

You'll want to get in. Promise.

We guarantee 100% privacy. Your information will not be shared.

Monthly Revenue
  • http://shroomz81.wix.com/seomaniac/ SEO Maniac

    It means that we should use our site’s robot.txt properly. We must also abide every search engine’s webmaster’s set of rules and guidelines because they have different types of search engine crawlers.

  • http://www.obiteljskizivot.com Daniel

    So should I basically disallow robots of going thru the folder of my wp instalation altogether, as there I have all user information for our websites forum? Or should I only protects the DB file alone? tnx

  • http://www.madmadrasi.net mad.madrasi

    Is there any code to prevent crawlers from going through the archive pages, but only the posts (for Blogger)?

    • http://h3sean.com Sean

      I assume that archive pages are mostly nofollowed or noindexed by default especially if it’s a CMS platform like Blogger :)

      • http://www.madmadrasi.net mad.madrasi

        Actually on Blogger it doesn’t – I mean archive pages *are* indexed by Google crawler by default. After my original comment, fingering around with blogger settings ended up with customs robots txt settings in Blogger – it should solve the problem, I think.
        (And a post for my blog too …)
        :-)
        Thanx.

  • http://www.sharpefit.com/ Bob

    Great information, the robot.txt file has always scared the heck out of me! I adjusted a few of my settings because it looked like it was crawling the admin pages and it really seemed to slow down crawl rates. Possibly that will be a win/win after all thanks to the robot adjustment.

  • http://www.kerrickscollectibles.blogspot.com/ Laurence Kerrick

    Definitely something I should know. I run an small online store and my customer’s information is stored in a sql database on site. I have written a robots.txt file to make sure that information does not get crawled. Thank you for the advice.

  • http://forums.techiesay.com/ Fred

    Yup, robots.txt is a useful file to instruct search engine crawlers, and tell them what to crawl and skip. But, this is something one should deal carefully with. I have even seen newbies who simply put useragent: * disallow: / and end up making their whole root accessible to crawlers. The above line of code will result in no crawling / indexing, and hence no traffic from search engines.

    Its good, but should be handled carefully. Overall, a nice read. Cheers.

  • http://pinoyfused.com Dangem

    I really thought that simple meta tags are enough. Thanks to this I know now the essentials of robot.txt and how vital it is on improving my site. Thank you so much sir! I hope you can share more about things like this very rare and unique.

  • http://www.infiniteskills.com Colin Boyd

    Setting the Robots TXT file is my first job after I create a website. A couple of years ago I was having real issues with Google, it would stop indexing my site each visit, after investigation and a lot of time I found this was down to a JPEG that was corrupt, I replaced the image and blocked the images folder and the problem was fix. Telling Google what not to index is also very important

  • http://www.fingerspot.com Mesin Absensi

    Sarah, i was studying robots.txt to blogspot. Do you have this information?
    Thank’s for share. :)