Query Level Duplicate Content – The Solution

Justin AldridgeJanuary 7, 2013

Posted in:

Having unique content and a good link profile are the two main essentials for getting decent search engine rankings.

But what happens if Google interprets your content as not totally unique? There are three ways of looking at duplicate content:

A: Identical content on different websites
B: Identical content on different websites but with the same HTML structure on each website
C: Elements of copied content mixed in with unique content

If we look at A to begin with, this is a very common occurrence online. For example, when you write a press release and get it syndicated, the same release appears on different websites but they are all just copies of the same release.

Google does a pretty good job in handling this and tends to just serve up one or two copies of it from the websites it deems should actually get the benefit of the content of the press release.

In case B, this can happen if, for example, a company has more than one website for different locations; they are all the same in terms of content and structure and maybe just the domain is different. Google does a pretty good job, generally, of sorting this out in the results too.

But case C is a little more complicated. If using the example of the press release, say I read it and got all excited about it and decided to blog about it; I would write about and maybe include chunks of text from the press release. My input is unique but I’ve also copied content from the other website.

Is my blog post therefore unique or copied? What if I only copied a sentence from it?

Something Changed

Up until 2 or 3 weeks ago I felt that Google was also doing a pretty decent job in working out how to handle this sort of situation until that is, that an important page on a client’s website vanished into obscurity.

It was out of the blue, the page had been fine for months but all of a sudden it went. The page hadn’t changed so I can only assume that it was Google tweaking its algorithm, as it seems to do more and more every day.

Fortunately I managed to find it again. Google had deemed it to be too similar to another result on page 1 for the given search term and hidden it behind the dreaded message right at the end of all the search results:

In order to show you the most relevant results, we have omitted some entries very similar to the 503 already displayed.
If you like, you can repeat the search with the omitted results included.

Hmmm, the page was unique, not copied from anywhere so why on Earth did Google think it wasn’t worth including in its search results any more?

The Search Term

The first thing I noticed was that the page was ranking fine for other similar key phrases. It was just the all-important two-word phrase for which it was being considered “too similar” to other pages already chosen to rank on page 1.

It was obvious at this point that Google wasn’t considering the page to be a duplicate of anything, as it was ranking just fine for other terms, but it had some similarity issues at the point when the query for the two word search term was being made.

This is a query level duplicate content issue.

Patent Comes to the Rescue

So I did a bit of digging and discovered this Google patent. Detecting query-specific duplicate documents.

Noticing it was rather long and heavy I made myself cup of tea, got some biscuits out and started to digest it (the patent, not the biscuits).

Ninety minutes later I’d made a list of notes and determined the following:

Google is checking for duplication at the query level which is why the page ranks for other stuff.
It finds references to the key phrase in the text and selects the sentence or paragraph that it’s in.
It then compares those “chunks” of text from the page with similar “chunks” it takes from other pages on other websites that it’s going to rank for that key phrase to determine similarities.

There is much more to it than that but with just those three points the solution is simple if you have been affected by this part of the algorithm.

The Solution

Just remove all instances of the key phrase from the body of the text! In doing so the algorithm can no longer select the portions of the page associated to the key phrase and therefore has nothing to compare against with those from other pages.

What I did was change the way the key phrase appeared on the page. I used alternative ways to describe it instead.

I then used the “Fetch as Googlebot” function in Google Webmaster Tools to ask Google to re-crawl the page. If my method was correct the page would rank instantly again once the page was re-crawled as Google determines for duplicity at query time.

And sure enough it worked beautifully 🙂

It was late at night when I made the changes to the page (took just five minues), I went to bed and in the morning the page had been re-crawled and was once again back on page 1 ranking for its two word key phrase in the same position it had previously enjoyed.

Some extra notes

Note that I was focusing on an important two-word key phrase here. If you read the patent you’ll begin to understand that it doesn’t just look for that key phrase exactly but also looks for the words in the query in close proximity to each other on the page. It can’t look for each one on its own either as otherwise duplicity would be very high when looking at individual occurrences of each on a page. Different factors are applied depending on the number of words in the query. You may have to juggle the words around or remove them until you’ve made it different enough so that it can do the comparison.

I took the drastic approach of removing all instances of the exact key phrase from the page (apart from in the title tag, metas and main heading) as I wanted to see if the theory worked. If I’d wanted to sit it out and remove or change one at a time then that would have helped to find out exactly which occurrence s) of the key phrase was the actual cause of the problem. You can choose to do it the fast or slow way, they both lead to the same outcome.

Some people may say that getting into this problem in the first place is brought about by not having enough unique content on a page, but that’s not strictly true. It’s important to understand that although Google’s algorithm does a great job in the complex field of duplicate content, sometimes perfectly legitimate pages and websites get affected for the wrong reasons. This happens all of the time with many aspects of the algorithm and frequent major updates.

In my example here of a two-word key phrase in a very competitive niche, it’s likely that there will be some similarities in some of the text, that’s inevitable; there’s only so many ways you can write about the same subject. If you had a page about the new Lamborghini Aventador there’s only so many ways you could write about that fact that it’s name comes from a Spanish fighting bull and that it has a V12 engine. There are always going to be similarities and inevitably, pages with good content can sometimes disappear from the results as a consequence of this.

So there you have it. How to get your site back into the search results if Google thinks some of your content is too similar to others.

Author: Justin Aldridge