Why you should stop worrying about avoiding the duplicate content penalty
Posted on September 21, 2007 at 8:47 am
Ok, so it seems like everyone and anyone starting a blog or "optimizing" their blog is concerned about duplicate content penalties from Google and so have devised a an entire slew of remedies from adding all kinds of disallow statements to their robots.txt files to installing SEO-optimized duplicate-content-curing plugins for WordPress, etc.
And I’m no special person, I’ve got over 30 lines in my robots.txt file to block Google from my WP- folders, my archive pages, my tag pages, and lots more! I also have the SEO WordPress plugin installed that helps prevent "supplemental results" by adding the NOINDEX meta tag to my category and archive pages. Basically, the only pages that I allow Google to access are the actual permalinks URLs for my posts and my static pages.
That’s it! Nothing else! If you perform a site:www.online-tech-tips.com search in Google, you’ll see it’s just my articles and nothing else.
Now when I first implemented this, I thought that I was doing something that would help my rankings in Google considering it would be avoiding getting thrown into the supplemental results. However, over the last few months, I’ve been asking other bloggers like Lorelle and Amit about what kinds of steps they have taken to prevent duplicate content and was shocked by the responses.
Here was Lorelle’s response to my question:
Do I? Or does WordPress.com? This is a WordPress.com blog. You’ll have to talk to them about their robots.txt.
The duplicate content issue is one that bloggers have taken WAY out of control. Duplicate content is natural on blogs. Don’t stress over it. The issue is related specifically to evil doers who use duplicate content for their splogs, and stealing content from other blogs or copying content from their splogs across to their other splogs. It’s to tackle the evil, not the normal blogger.
For some reason I was thinking that such big bloggers would have been all over these "issues". So I decided to perform a site: search on a couple of big name blogs like ProBlogger.net, CopyBlogger.com, Lifehacker.com, and SEOMoz.com. Well it was pretty interesting what I came across. All of these sites get thousands of visitors a day from the search engines and yet just about everything is indexed by Google including archive pages, category pages, tag pages, and comments!
So after doing this, I became even more curious as to whether my 30 line robots.txt is really necessary! What kind of robots.txt file are these guys using? So here’s what mine looks like as of right now:
User-agent: Googlebot
Disallow: */feed*
Disallow: */rss*
Disallow: */trackback*
Disallow: */wp-admin
Disallow: */wp-content
Disallow: */wp-includes
Disallow: *wp-login.php
Disallow: */20*
Disallow: */comments*
Allow: */category/*/page/*
Disallow: /page*
Disallow: */search*
Disallow: */?s*
Disallow: */?p*
Disallow: */index.php?p*
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.cgi$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*.php*
Disallow: */trackback*
Disallow: /*?*
Disallow: /z/
Disallow: /wp-*
Disallow: */tag/
Disallow: */stats*
Disallow: */cgi-bin*
Allow: /wp-content/uploads/
User-agent: Googlebot-Image
Allow: /*
Sitemap: http://www.online-tech-tips.com/sitemap.xml
Now let’s take a look at a few from the big bloggers! So here’s what the robots.txt file looks like for the following sites:
Problogger.net
User-agent: *
Disallow:
LifeHacker.com
User-Agent: Googlebot
Disallow: /index.xml$
Disallow: /excerpts.xml$
Allow: /sitemap.xml$
Disallow: /*view=rss$
Disallow: /*?view=rss$
Disallow: /*format=rss$
Disallow: /*?format=rss$
Sitemap: http://lifehacker.com/sitemap.xml
CopyBlogger.com
User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/
TechCrunch.com
User-agent: *
Disallow: /*/feed/
Disallow: /*/trackback/
Mashable.com
User-agent: *
Disallow: /feed
Disallow: /*.xml$
Disallow: /*/feed/
Disallow: /*/trackback/
Ok, so as you can see from the above list, EVERYONE’s list is a hell of a lot shorter than mine and my list was created by reading through all kinds of posts talking about how everything must be blocked or disallowed. Well, obviously if the top bloggers are not worrying about duplicate content than why should I be! Actually, it seems like maybe it’s even helping them in some kind of way.
So before you go installing lots of plugins that prevent Google from indexing your site completely, remember two things:
1. Doesn’t seem like any of the really popular blogs are doing anything about it and
2. The supplemental results database no longer exists in Google anyway!
My next step is to remove all of my the disallow statements from my robots.txt file and see what happens! Any one else try this yet?
Also, another observation that may be obvious, but warrants a mention is the fact that all of these people write GREAT content and a LOT of it. So you can do all the optimizing you want, but unless you have really good content that people will link to, bookmark, and visit again, it’s not really going to matter!
Tell me what you think in the comments!
[tags]duplicate content penalty, google duplicate content filters, avoid duplicate content[/tags]
» Filed Under Blogging
Related Posts
- A complete list of search engine friendly (SEO) WordPress plugins for your Blog
- My first post on WordPress and I’m loving it!
- A complete list of anti-plagiarism and content-theft fighting WordPress plugins
- Subscribe to the RSS feed of a category, author or tag in WordPress
- How to remove a web page from Google index and other search engines
One question regarding duplicate content please ?
I write for some more sites
especially techtoday one of my really good friend
I need to ask that I directly copy and paste from my site to his
SO will it panelize me or him??????
thx
Well it depends. If you write the content on your site and immediately post it on his site, the site that will be penalized will be the one that Google indexes LAST. So if the Google bot indexes your Page1.html, let’s say, first and then goes to his site and see the same content, his site will be penalized. But if it’s the other way around, you will be penalized.
Basically, the content should only be on one person’s site because no matter how you do it, only one will be in the main index.
hmm
I immediately post in his site
So wht if I do a bit of change in that article and then post it??????
Your changes should be significant, minor changes won’t really help. Actually, it would be much smarter to write the article and have it posted on ONE site and then have the other site link back to that article with good keywords in the link. That way both sites will be getting high quality back links, which is one of the most important factors in Google’s ranking algorithm. Don’t worry about having the content on both sites.
hi, andar here, i just read your post. i like very much. agree to you, sir.