What Is Duplicate Content And Should You Worry About It?

Sep 27

09:12

2007

Donald Saunders

Webmaster have been debating the topic of duplicate content for a long time, but exactly how do we define duplicate content and does having it really matter?

The argument over exactly what duplicate content is and whether duplicate content is a problem continues to rage and there is little sign that it is going to go away. So exactly how do you define duplicate content and is it a problem?

It is generally felt that duplicate content is important and, though one well known and highly respected SEO expert recently expressed the opposite view, even a cursory peek at the mass of material that has been written on this subject in recent months will clearly show that this is a minority view.

However, if we agree with the view that duplicate content does matter, then just how should we define duplicate content? For example, if I compose an article for an article directory and then alter that same article for submission to a second article directory how are the search engines going to evaluate my two articles and decide whether they contain duplicate content? The truth is that we don't know, but here are this webmaster's thoughts.

When checking for duplicate content was first undertaken by the major search engines it was a simple case of viewing one web page as a whole against another and there was no attempt to start dissecting the pages and comparing individual page elements. Back then it was possible to use identical content and just add an introductory and concluding paragraph to one of the pages and that would be enough to escape any duplicate content penalty. Sadly for many webmasters those days are now a distant memory.

Today, the major search engines divide up the two pages to allow them to examine individual elements and it is here that we find the core of the present argument. Most people agree that attention is now largely restricted to the main content of a page rather than the structure of the page. A large number of webmasters make use of templates to build their pages which set the structure of each page including such things as headers, footers and menus. This is generally felt to be accepted and the major search engines do not count this as being duplicate content. What the major search engines are concerned about is the main content contained in the body of the page. But exactly how do they examine this page content?

Some people believe that this checking is undertaken at 'block' level (examining individual sentences or paragraphs), but other people contend that filters search for phrases or even individual words. None of us really knows of course although it would seem reasonable to assume that the most likely basis of examination would be to make use of either sentence or phrase matching.

Sentence matching is reasonably clear-cut and simply involves breaking both pages down into chunks defined by the punctuation on the page. For example, look at this sentence:

It is reasonably easy to get a good deal on a camera, providing you know how to haggle.

This would be viewed as either a single sentence or two sentences, depending on whether you use the traditional definition of a full-stop as being the end of a sentence or adopt an elastic approach and make use of other punctuation marks, such as commas.

Phrase matching is a little bit more complex. What is the definition of a phrase? Should it have 2 or 3 or 4 or 20 words?

Just for the moment let's assume that we are going to define a phrase as 3 words. In this case the following phrases would all be seen as duplicate content if they appeared on two pages which were being compared:

You can getDid you knowTake a lookIn those daysOne way toDay to dayThe answer isAt that timeIn the end

All of these phrases are ordinary day to day phrases which could be used on pages about dog training, learning to play bridge, making money online or any other subject you care to mention. Now there are a few people who contend that the major search engines do examine pages at this level. For example, when I questioned the staff for one particular duplicate checker (Dupecop) about how their system examined duplicate content they replied saying:

"DupeCop compares both individual words and 3-word phrases. It also ignores all punctuation and scans across sentences"

It was no surprise therefore that when I Your guess would be as good as mine.

Over the years I have written and published literally hundreds of articles and have closely watched the results in terms of duplicate content penalties, as far as any of us can do so. Upon the basis of my own experience I am happy that filtering is not conducted down to the level of 3 or 4 word phrases but ends at the sentence level. Consequently, providing you alter your articles down to sentence level, you should have no problem in avoiding the content filters. In fact, even if a couple of your sentences are duplicated you will still be fine.

Article "tagged" as: