Tag Archives: search engines

Google and Canonical

I’ve found some more Google weirdness … this time related to how it handles declared ‘canonical’ URL’s.

A canonical URL is a meta tag that you put in a web page that says “This is the correct URL to this particular page”.

Here’s an example of the canonical tag …

<head>
   <link rel="canonical" href="https://example.com/page.php" />
</head> 

However, even when you declare a canonical URL, Google sometimes decides that there is a better URL.

In some cases it’s to another page on your site…

In this case, both pages are pretty similar (actually, they are identical) … but they are two distinct pages and both have their own canonical URL declared.

I noticed at least one case where the canonical URL that Google selected wasn’t even on my site.

Granted, this is the same content … but it’s should not be considered the canonical version of a page on my site.

Unfortunately I really don’t know how to resolve this issue … as Google doesn’t respond to webmaster raised issues related to their search engine functionality.

whitehouse.gov/robots.txt

Jason Kottke pointed out that the whitehouse.gov robots.txt file was changed almost immediately after the inauguration …

It went from this …

User-agent: *
Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /omb/search
Disallow: /omb/query.html
Disallow: /expectmore/search
Disallow: /expectmore/query.html
Disallow: /results/search
Disallow: /results/query.html
Disallow: /earmarks/search
Disallow: /earmarks/query.html
Disallow: /help
Disallow: /360pics/text
Disallow: /911/911day/text
Disallow: /911/heroes/text

… to this …

User-agent: *
Disallow: /includes/

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

One small step towards the new attitude of openness and transparency.