Block Spiders to Crawl and Index Duplicate Content
If you have a large site, there are possibilities that it might automatically generate duplicate content in different pages without your consent. For example, if you have a blog, a post where the maximum number of comments exceeded, that you set in admin control panel, will auto0matically create another page with the same content to make space for new comments. Similarly, if you have a forum, the single post links generate the duplicate contents for threads.
Let me clarify the last example. When spiders crawls a thread, the also crawls the single posts. When they crawl the single posts referred by the single post links found on the threads, they detect duplicate content.
Other on page options (Like: Print This Page, Post Listing Order, and Mobile Version) also generates duplicate content for a page.

Canonical URL
It’s known that Search Engines hate duplicate content and impose penalty on that page having the duplicate content. If several cases found on a site, the whole domain may receive penalty. Apparently, you may be penalized for your own content.
There comes the Canonical URL to avoid such issues. It’s a tag introduced in 2007 which tells the Search Engines if a page is a part of another page. The tag is not visible in normal as it has to be in the head section of a page (between <head> and </head> ). A typical canonical tag should look like <link rel=”canonical” href=”URL” />.
Putting it on the pages manually requires a lot of time if you have lots of pages, but there are ways to put them automatically. I’ll write a follow up post on how to do that for some popular scripts.
There are other ways to avoid crawling and indexing unwanted pages by using a robots Meta tag. It’s also effective but not as the canonical url. Just put the <meta name=”robots” content=”noindex,nofollow” /> at the head section of the page which you do not want to be crawled and indexed.
Here is a video from Matt Cutts on how Google detects and treats duplicate content. It seems they are using far more advanced technology on their result pages to filter out duplicate content.





















