Avoid duplicate content

Search engines detect duplicate content and delist the copied pages, so make sure your content is unique.

Search engines try to provide diversity in their results so they omit results that are broadly the same. They realize that user don’t want the results page spammed with identical entries, nor do they want to waste resources on crawling content that exists elsewhere. The other reason while content duplication is policed aggressively by search engines is that many spam sites use duplication to funnel more traffic to the same page, so that several hundred identical sites are set up with different domain names to optimize for keyword variants.

Duplicate content is detected when the same material is available via multiple URLs, and there’s no guarantee which URL will be indexed and which will be ignored. Incidentally, the content doesn’t have to be identical to be considered a duplicate – search engines have algorithms that measure the similarity of two sets of data, so copying a reasonable quantity of information from one page to another may be considered a duplication. The same is true for common resources, such as images, PDFs and video, which are covered in Google’s 2006 patent on detecting duplication.

In the book, we discuss a range of ways that content can be duplicated either by accident or on purpose, and how to address these.