© 2009 Stephan M Spencer Netconcepts Duplicate Content & The Canonical Tag By Stephan Spencer, President & CEO, Netconcepts
© 2009 Stephan M Spencer Netconcepts The Canonical Tag Influences your sitelinks in Google
© 2009 Stephan M Spencer Netconcepts Duplicate Content Mitigation Is not just about removing competing duplicate pages It’s about recovering the leaked PageRank too
© 2009 Stephan M Spencer Netconcepts PageRank Leakage Noindexed or disallowed pages (via robots.txt directives or robots meta tags) still accumulate PageRank If the page is allowed (via robots.txt) but meta robots noindexed, it also passes PageRank Thankfully, when obeyed, the canonical tag aggregates PageRank
© 2009 Stephan M Spencer Netconcepts Tools for Collapsing Duplicates The Canonical Tag –Great new addition to the SEO's arsenal, but not your best weapon –Works best when used in concert with other signals 301 Redirect –A much more absolute/automatically obeyed signal –Use instead of (or in addition to) the canonical tag
© 2009 Stephan M Spencer Netconcepts Tools for Collapsing Duplicates XML Sitemaps –Include only your canonical versions in your feed –Used as a canonicalization signal by Google Rel=Nofollow –On links pointing to the noncanonical versions –Nofollowed links aren’t even used for discovery by Google Meta Robots Nofollow –blocks the flow of PageRank
© 2009 Stephan M Spencer Netconcepts PageRank Leakage Scenarios Robots.txt disallow the dup page = PageRank is leaked to the duplicate, & it can show up in the SERPs Meta robots noindex (or Robots.txt noindex) the dup = PageRank is leaked, won’t show up in the SERPs Rel=nofollow on links to the dup = PageRank can still accumulate through other links & it can still be indexed Meta robots nofollow the dup = PageRank that accumulates on the dup cannot be passed on
© 2009 Stephan M Spencer Netconcepts PageRank Leakage Scenarios XML Sitemaps file only includes the canonical version = only used as a hint, dups may still be indexed Canonical tag pointing to canonical version on all dups = only used as a hint, dups may still be indexed 301 all dups to the canonical version = removes dups, may have unintended side effects (e.g. breaking your site’s sorting capability) Conditional 301 = removes dups, high risk
© 2009 Stephan M Spencer Netconcepts Canonical Tag Has Serious Limitations It doesn't work cross-domain –Only within the domain. Cross-subdomain is supported though –This is by design, to thwart the element's use by spammers –Thus you can't use it to reduce dup content to typo domains that you own It's only a hint, not an absolute directive –Google sometimes chooses not to follow it even though it clearly should –So it's not nearly as strong of a signal as a 301
© 2009 Stephan M Spencer Netconcepts Canonical Tag Misfires NorthernSafety.com Wikipedia
© 2009 Stephan M Spencer Netconcepts An Example in the Wild Many thousands of non-canonical URLs of northernsafety.com are indexed, despite use of the canonical tag For example, click on the listings on m/products/+inurl:protective-clothing and compare those URLs to what's listed as the canonical URL in the link tag in the HTML source of these pages Canonical tags have been in place for several months
© 2009 Stephan M Spencer Netconcepts
What To Do? So if the Canonical Tag can’t (yet) be trusted to work, what to do in addition / instead? Some scenarios to consider... –Pagination –Faceted navigation –Affiliate or Click-tracked URLs –Near-duplicates –Country-specific versions on the same domain –Manufacturer-supplied product copy
© 2009 Stephan M Spencer Netconcepts Pagination Excessive pagination dilutes “crawl equity”, causing numerous pages of product listings to not get crawled. Reduce # of pages in pagination system to improve crawlability & indexation Next/Previous vs. page number list vs. Show All Consider disallowing “View All” links and forcing spiders through subcat pages (the keyword-rich path). Display as many products per page as possible (max 120) within 150K file size. Fewer products per subcat = fewer pagination pages to crawl at subcat level for max product indexation 1-3 pages pagination = useful for sending different keyword signals?
© 2009 Stephan M Spencer Netconcepts Faceted Navigation Faceted navigation, a.k.a. guided navigation, provides clickable product inventory breakdowns, by brand, color, price range, etc. By doing so it creates into a huge number of permutations for the spiders to follow. Problem exacerbated with clickable, resortable column headings Nofollow all links leading to low (SEO) value facets, e.g. facets that do price range breakdown, re-sorting and re-pagination Or collapse near-dup facets (canonical tags or revise link URLs) Optimize URLs, title tags, etc. of high-value facets in an automated, scalable fashion (e.g. using GravityStream)
© 2009 Stephan M Spencer Netconcepts
Affiliate URLs Rarely do they help your SEO, because 302 not 301 Run affiliate program in-house; use 301 and/or “canonical” tags. don't 301 conditionally. Canonical tag isn't necessary if doing 301 Third-party affiliate solutions (like Commission Junction) have a vested interest in not “playing ball” –Canonical tag won't help. PageRank lost at the 302. Examples of affiliate networks that pass the PageRank to the merchant: LinkConnector, DirectTrack
© 2009 Stephan M Spencer Netconcepts Click-Tracked URLs Here’s how to 301 static URLs with a tracking param appended to its canonical equivalent (minus the param) –RewriteCond %{QUERY_STRING} ^source=[a-z0-9]*$ –RewriteRule ^(.*)$ /$1? [L,R=301] And for dynamic URLs... –RewriteCond %{QUERY_STRING} ^(.+)&source=[a-z0-9]+(&?.*)$ –RewriteRule ^(.*)$ /$1?%1%2 [L,R=301]
© 2009 Stephan M Spencer Netconcepts Click-Tracked URLs Need to do some fancy stuff with cookies before 301ing? Invoke a script that cookies the user then 301s them to the canonical URL. –RewriteCond %{QUERY_STRING} ^source=([a-z0-9]*)$ –RewriteRule ^(.*)$ /cookiefirst.php?source=%1&dest=$1 [L] Note the lack of a R=301 flag above. That’s on purpose. No need to expose this script to the user. Use a rewrite and let the script send the 301 after it’s done its work.
© 2009 Stephan M Spencer Netconcepts Legacy URLs Got legacy dynamic URLs you’re trying to phase out after switching to static URLs? 301 them... –RewriteCond %{QUERY_STRING} id=([0-9]+) –RewriteRule ^get_product.php$ /products/%1.html? [L,R=301] Switching to keyword URLs and the script can’t do anything with the keywords if passed as params? Use RewriteMap and have a lookup table as a text file. –RewriteMap prodmap txt:/home/someusername/prodmap.txt –RewriteRule ^/product/([0-9]+)$ ${prodmap:$1} [L,R=301]
© 2009 Stephan M Spencer Netconcepts Legacy URLs What would the lookup table for the above rule look like? –1001 /products/canon-g10-digital-camera –1002 /products/128-gig-ipod-classic DBM files are supported too. Faster than text file. You could use a script that takes the requested input and delivers back its corresponding output. –RewriteMap prodmap prg:/home/someusername/mapscript.pl –RewriteRule ^/product/([0-9]+)$ ${prodmap:$1} [L,R=301]
© 2009 Stephan M Spencer Netconcepts Other Common Issues Non-www and typo domains –(The example mentioned earlier...) –RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC] –RewriteRule ^(.*)$ [L,R=301] HTTPS –(If you have a separate secure server, you can skip this first line) –RewriteCond %{HTTPS} on –RewriteRule ^catalog/(.*) [L,R=301]
© 2009 Stephan M Spencer Netconcepts Other Common Issues If trailing slash is missing, add it –RewriteRule ^(.*[^/])$ /$1/ [L,R=301] –WordPress handles this by default. Yay WordPress!
© 2009 Stephan M Spencer Netconcepts Conditional 301s? Risky territory! Read Redirects: Good, Bad & ConditionalRedirects: Good, Bad & Conditional To selectively redirect bots that request URLs with session IDs to the URL sans session ID: –RewriteCond %{QUERY_STRING} PHPSESSID RewriteCond %{HTTP_USER_AGENT} Googlebot.* [OR] RewriteCond %{HTTP_USER_AGENT} ^msnbot.* [OR] RewriteCond %{HTTP_USER_AGENT} Slurp [OR] RewriteCond %{HTTP_USER_AGENT} Ask\ Jeeves RewriteRule ^(.*)$ /$1 [R=301,L] browscap.ini provides spiders’ user agents
© 2009 Stephan M Spencer Netconcepts Conditional 301s? Not necessary. Almost always another way (w/o using user agent or IP) In the above example, simply 301 everybody – bots and humans alike – and stop appending PHPSESSID –See for more on this. –If you have to keep session IDs for functionality reasons, you could use a script to detect for whether the session has expired, and 301 the URL to the canonical equivalent if it has.
© 2009 Stephan M Spencer Netconcepts Near Duplicates, But Not Quite? What if you can only optimize one version but not all versions? For example... –Let's say you have implemented a new URL structure and moved content over to the new URLs. The old URLs still pull up the content too, but the templates are different. The new version has better SEO (title tags are more keyword-rich, there are H1 headings, a couple sentences of intro copy, etc.), but it's the same product information. According to Matt Cutts, using the canonical tag to canonicalize the non- optimized version to the optimized version is high risk.
© 2009 Stephan M Spencer Netconcepts Country-specific Versions Country specific versions on the same domain? Create separate "sites" within Google Webmaster Central for each country-specific directory. Then set the Geographical Targeting within each one. Google doesn't view country-specific versions as duplicate content; Google's smarter than that.
© 2009 Stephan M Spencer Netconcepts Manufacturer-Supplied Product Copy Distance yourself from the “thin affiliates”. Augment with substantial amount of unique, valuable content –Customer reviews - trapped/hidden within JavaScript in third- party reviews services like BazaarVoice and PowerReviews –Not “mashups” with Wikipedia, Twitter, & the usual suspects "Uniquify" content. Not sufficient to shuffle the page's content around! Think about overlapping “shingles” –Scaling? Mechanical Turk, yes. Markov chains, no. A nail in the coffin: same titles & meta descriptions
© 2009 Stephan M Spencer Netconcepts
Supplemental Hell? The Supplemental Index still very much exists, and these dups are probably in there. Does Google leave clues about what it considers to be non-canonical / not favored? –If only the Supplemental Result label were still supported! *sigh* –How about spidering activity? PageRank score? Omitted results? Cached date? Cached link missing?
© 2009 Stephan M Spencer Netconcepts Related Resources navigating-mess navigating-mess maximum-seo-impact maximum-seo-impact-12982
© 2009 Stephan M Spencer Netconcepts Thanks! For a free faceted navigation audit, drop me your business card or your request to To contact me: