Robots.txt Update: Google Updated Robots.txt Testing Tool
Do you have contents on your web site that you just don’t need Google bots to crawl and index? which means, un-crawled and un-indexed won’t seem on the results page. Let’s admit it. Not all pages in our web site ar crawl-worthy. the answer is simple: robots.txt!
Wondering what is the use of Robots.txt file in a website? I have seen a lot of confusion related to robots.txt file, and intern this creates SEO issues on your website. In this article, I will share everything you need to know about robots.txt file, and also I will share some links which will help you to dive deep into this topic.
Brief about robots.txt
Having spiders, bots or crawlers (they are just the same) frequently crawl and index your site’s content is advantageous to your website. You might ask: What is the sense of putting contents that you don’t want Google to crawl in the first place? Good question. There are instances when you don’t want these bots to index contents that you don’t want to be indexed in the first place. For example, your website has two versions of a page whereby one is a browser-friendly version and the other is printer-friendly version. Definitely, you don’t want a bot to crawl the latter which when it does, it will be counted as a duplicate content. Google penalizes content duplication.
Google’s robots.txt testing tool
Robots.txt files are critical for your site’s search engine optimization (SEO). Unfortunately, the files can do more harm on your website than good. The file may block Google spiders from crawling and indexing important pages. In June 2014, Google launched its updated testing tool for the easy detection of these kinds of error. Also, it will be easier for the users to test and maintain the robots.txt file.
With the use of the tool, you will be able to see the current robots.txt file. You can test your URLs to determine whether they are allowed or disallowed for crawling. Changes can be made through the tool as well, and testing these changes as you go with the entire process is also possible. Once you have changed the status of the URLs (from allow to disallow or vice versa), you may simply upload the new file version to the server for the changes to take effect.
According to John Mueller, “I recommend you try it out, even if you’re sure that your robots.txt file is fine. Some of these issues can be subtle and easy to miss. While you’re at it, also double-check how the important pages of your site render with Googlebot, and if you’re accidentally blocking any JS or CSS files from crawling.”
What is the use of Robots.txt file on a Website?
Let me start from the basics, all the search engines have bots to crawl a website. Crawling and indexing are two different term, and if you wish to get in-depth about it, you can read: Google Crawling and indexing. When a search engine bot (Google bot, Bing bot, 3rd party search engine crawlers), come to your site following a link or following site map link submitted in webmaster dashboard, they follow all the links on your blog to crawl and index your site.
Now, these two files Sitemap.xml and Robots.txt, resides at the root of your domain. As I mentioned, bots follow robots.txt rules, to determine the crawling of your website. Here is the usage of robots.txt file:
When a search engine bots come on your blog, they have a limited resources to crawl your site. If they can’t crawl all the pages on your Website in given resources, they will stop crawling, and this will hamper your indexing. Now, at the same time there are many part of your website, that you don’t want search engine bots to crawl. For example, your Wp-admin folder, your admin dashboard or other pages, which are not useful for search engines. Using robots.txt, you are directing search engine crawlers (bots), to not to crawl such area of your website. This will not only speed up crawling of your blog, but will also help in deep crawling of your inner-pages.
One of the biggest mis-conception about Robots.txt file is, people use it for noindexing. Do remember, Robots.txt file is not for do-index or no-index, it’s just to direct search engine bots to stop crawling certain part of your blog. For example, if you look at ShoutMeLoud Robots.txt file (WordPress platform), you will clearly understand, what part of my blog I don’t want search engine bots to crawl.
How to check your Robots.txt file?
As I mentioned, Robots.txt file resides at the root of your domain. You can check your domain robots.txt file at www.domain.com/robots.txt. In most of the cases ( specially in WordPress platform), you will see a blank robots.txt file. You can also check your domain Robots.txt file using GWT by going to Google webmaster tool > Under site configuration> Crawler Access.
The basic structure of your robots.txt to avoid duplicate content should be something like this
This will prevent robots to crawl your admin folder followed by feeds, trackbacks, comment feeds, pages and comments. Do remember, Robots file only stops crawling but doesn’t prevent indexing. Google uses noindex tag for not indexing any posts or page of your blog and you can use Meta robots plugin or WordPress SEO by yoast to add noindex in any individual posts or a part of your blog. For effective SEO of your domain, Website, blog , I suggest you to keep your category, tags pages as noindex but dofollow. You can check ShoutMeLoud robots file here.
Robots.txt file is just use to stop crawling part of your blog.
Robots.txt file should not be used for noindexing, instead No-index meta tag should be used.
Some caveats on the limitations of robots.txt
Google warns that there are risks of using robots.txt as a URL blocking method. Consider using other mechanisms to make sure that the URLs you don’t want to be crawled and indexed are not findable on the web.
Make sure that the private information is safe and secured
Commands included in the robots.txt file are not hard rules. Some crawlers, other than Google bots, may not necessarily obey these commands. To give you an example, you might find the disallowed page in Yahoo, but cannot find it in Google. This only means that there are consequences in blocking such URLs.
Just to make sure you may use other methods along with using robots.txt such as password-protecting private files on the server itself. Also, make sure that there are no backlinks to the contents in the disallowed URL. Google may still crawl and index this by crawling and indexing other websites linking back to that page. Thus, the disallowed URL will still appear on the results page.
Finally, use the right syntax for each spider and crawler. Some crawlers may disobey the commands because it interprets the instructions in a different way. You should determine the proper syntax first so that the crawler will understand the commands.
Buy www.HamariZameen.In – Hamari Zameen.In Domain For Sell
Buy Hamari Zameen.In Domain HamariZameen.In on Offer Price. Buy HamariZameen.In Hamari Zameen.In Domain For Sell.
Popular Search Keywords For this Article:
- seo blank robots txt (4)
- facebook crawler robots txt (3)
- google update robots txt (1)
- robots txt Leave a Reply comment -captcha (1)
- search engines will index and follow only links to allowed domains 1200 (1)
- search engines will index and follow only links to allowed domains amigos (1)
- update a robots txt withought cpanel (1)