How Robots.txt and Meta Tags Affect Search Engine Crawling

Googlebot

If you are concerned about the privacy of your website and you do not want the search engine crawlers or bots to crawl certain pages of your website, then “Robots.txt” is the one-stop solution that will keep the crawlers away from the ‘No Entry’ zone.

Webmaster’s Note: This is a guest post by Sarah Bruce

Confused? Probably, you are wondering about the need of keeping the search engine bots away from the pages, when everyone wants their website to be indexed in the search engines. Sure.

Reason for stopping the bots from entering certain pages of a website

noindex

If yours is an e-commerce website and you store your database on it, would you like to disclose the database of your clients’ information to the entire world? Definitely not! But, if you do not take any precautionary measures to indicate the crawlers not to crawl those pages with vital information, then search engine spiders will crawl them eventually and index those pages in the search engine results. From there, anybody can view the detail of your clients and use it unethically, to put you and your clients in a position of legal nightmare.

To avoid such disaster, you should use robots.txt.‘Robots.txt’ plays the similar role as a bouncer in a club. Like how bouncers do not allow certain guests to enter private sections of the club, so does robots.txt. Consider it as a file which includes the directories that shouldn’t be entered by specific or all crawlers.

Now, this question arises: Are your pages safe with robots.txt?

Search Engine crawlers are built from artificial intelligence and before visiting any page of the website, these bots look out for the existence of robots.txt file, where they can see the pages that they are prevented from accessing.

Don’t worry about search engine bots violating the robots.txt file of your website. If they do so, they have to face severe legal consequences, which is why they have no option but to respect your robots.txt file.

The Bad news is that there are malicious spammers who also make use of robots to crawl the website’s ‘private’ pages, which you pretty much can’t do anything about. So, it is highly recommended to use firewalls, encryption methods, password protection and other security services besides robots.txt.

In and out of ‘robots.txt’!

Not everyone needs robots.txt. Unless you have some serious content in your website, which you do not want anybody to look into, there is no mandatory need to upload a robots.txt file and not even an empty one.

Robots.txt file contains a set of instructions for the search engine crawlers, as in the files and directories that are not supposed to be crawled. A noteworthy point here is that this file should be installed in the highest level directory of your website because crawlers search for robot.txt file in the root domain of your website and not in any sub-domain.

For example, http://www.abc.com/robots.txt is a valid location, but http://www.abc.com/mysite/robots.txt is invalid.

How to create a robots.txt file?

There are two important parts of a robots.txt file:

Google-SpiderUser-agent: It symbolizes a search engine bot. You can indicate either all the search engine bots or a specific bot.

Disallow: This is the field, which allows or disallows the search engines to crawl specific files or directories.

If you want all search engines not to crawl a directory, then use a * on the User-Agent section then follow the directory name with a forward slash:

  1. User-agent: *

Disallow: /directoryA/

If you want particularly, Bingbot not to crawl a directory, then follow the directory name with a forward slash:

  1. User-agent: Bingbot

Disallow: / directoryA /

If you want all search engines not to crawl the complete website, then:

  1. User-agent: *

Disallow: /

If you want to restrict the search engine bots from crawling a page, then:

  1. User-agent: *

Disallow: /abc_file.html

Google uses many bots, such as Googlebot-Image and Googlebot-Mobile, however the conditions applied to Googlebot will be applied to all, but the case is not vice-versa. You can set specific rules for the specific bots, as well.

To block an image from Google Images, use the following:

  1. User-agent: Googlebot-Image

Disallow: /images/ watch.jpg

To remove all images from Googlebot Images, use:

  1. User-agent: Googlebot-Image

Disallow: /

If you want to block a specific file type, for example—.png, then:

  1. User-agent: Googlebot

Disallow: /.png

You can be certain of the pages not crawled by search engine bots, if you have indicated them in your robots.txt. However, if the URLs of those pages are found in other pages of your website, then there’s a certain narrow chance that those pages will also be indexed.

To avoid such kind of trouble, it is recommended that you use ‘robots meta tag’, to restrict any kind of access to the specific page. Let us dig out little information about robots Meta tag, to understand it better.

Robots Meta Tag: In Depth

Index’ and ‘noindex’ are the two major instructions of a Meta tag, as it allows you to have a control on the indexing page-by-page. If you do not want the search engine bot to index a specific page, then put the following Meta tag at the head section of your page:

<meta name=”robots” content=”noindex”>

If you do not want a specific bot to index a page, for example—Googlebot, then:

<meta name=”Googlebot” content=”noindex”>

Search engine crawlers will only crawl the pages that they are allowed to. But, if they find the links on other pages, they may not overlook those URLs and end up in indexing those pages. It is not necessary that the bots will index the pages, where you have used the Meta tag to ‘index’. However, the certain thing is that search engine bots will abruptly drop the pages, which are asked to “noindex”, even if they have been linked to other pages.

Remember that if you have included a ‘noindex’ meta tag in a page, but that page is not included in the robots.txt, search engine bots will crawl that page and the moment it comes across ‘noindex’ tag, it will drop it.

There could be a possibility that despite of adding a ‘noindex’ Meta tag, the page still appears in the search result. Don’t panic – the reason could be: the crawlers didn’t appear back to crawl your page since you have added the Meta tag. It will be definitely removed the next time the crawler crawls your page.

To speed up the index removal process, you can also make use of Google’s URL removal tool.

Final Touch: Test your robots.txt file through Google Webmaster Tools

This test is advised to be performed on a ‘Test robots.txt’ tool, before you upload the robots.txt file in your website’s root domain. This test will give you the actual result, as it reads the website as Googlebot does.

Performing this test is a plus, as you will know if the robots.txt file is blocking or permitting a page, accidentally. Accordingly, you can fix the problems, if any found. Let us see, how to use the tool:

¨     Click on the website that you want to check, in the Webmaster Tools home page.

¨     Under ‘Health’ section, click ‘Blocked URLs.

¨     ‘Test robots.txt’ tab must be selected, by default. If it is not, then click on the tab.

¨     You need to copy the content of your robots.txt file and paste it in the first box.

¨     Copy and paste the sites that need to be tested in the ‘URLs’ box

¨     List the user-agents in the ‘User-agents’ box.

Do remember that you cannot make any change from within the tool, but you need to edit the content of the robots.txt file.

Share on:
Sean Si

About Sean

is a Filipino motivational speaker and a Leadership Speaker in the Philippines. He is the head honcho and editor-in-chief of SEO Hacker. He does SEO Services for companies in the Philippines and Abroad. Connect with him at Facebook, LinkedIn or Twitter. Check out his new project, Aquascape Philippines.