Home :: About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards :: Events
InfoWorldHomeNewsTest CenterOpinionsTechIndex
 SUBSCRIBE  E-MAIL NEWSLETTERS  CAREER CENTER
 

 BLOG MENU

RECENT ENTRIES

ARCHIVES


Powered By
Movable Type 3.17

Email Chad Dickerson


INFOWORLD BLOGS
 Chad Dickerson 
 Ed Foster 
 Jon Udell 
 Bob Lewis 
 Tom Yager 

RSS FEEDS
How this works
 Top News 
 Columnists 
 Tech Watch 
 Test Center Reviews 
 Applications 
 App Development 
 E-Business Solutions & Strategies 
 End-user Hardware 
 Networking 
 Operating Systems 
 Platforms 
 Security 
 Standards & Protocols 
 Storage 
 Telecommunications 
 Wireless 
 Web Services 

IDG LINKS
InfoWorld
ComputerWorld
NetworkWorld
CIO
CSO
CHAD DICKERSON: CTO CONNECTION


October 01, 2004

The essential difficulty of canonical URLs

Earlier this week, Jon Udell posted "Canonical URLs and network effects." Jon referred to a discussion we had when he brought up a URL bifurcation in some InfoWorld content:

The InfoWorld review orginally published at this URL now also appears in the new product guide at this URL. When I mentioned this to Chad Dickerson, he pointed out that even if InfoWorld.com were to enforce a canonical-URL policy internally, our stuff is syndicated out to places we don't control. So for example, the InfoWorld column at this URL also shows up as the Computerworld story at this URL.

And:

When a piece of content is syndicated into a new context, there's no reason why that new context shouldn't have its own canonical URL -- particularly if it adds value in the form of direct or indirect commentary.

As I thought about this issue over the past few days, I realized that a key point was being left out of the equation and it's this: not all original content sources are URL-addressable! In fact, I would go so far as to say there is A LOT of content that falls into this category: think content originating from AP, Reuters, and their ilk. It doesn't take long to find examples. This story on CNN.com is in fact an AP story, but you can find it on Yahoo! Asia as well (just find any CNN.com story with ".ap." in the URL, and search for a key phrase from that story at Yahoo! News, and you'll find endless examples). If you're a blogger and want to point to either of these stories, which one is canonical? Who is "closest to the source"? The answer is "neither."

CNN and Yahoo! presumably have licensing deals with the AP to reproduce these stories in whatever form they come over the wire. This is a very deliberate way of doing business for the AP (which is actually a not-for-profit cooperative). In fact, if you go to the AP home page and click on any story under the "AP News on Media Sites Worldwide" heading, the site will randomly redirect you to a page with member newspapers' branding to get your story (in three tries, I got the Tullahoma News, the San Angelo Standard Times, and fredericksburg.com). The AP presumably doesn't want to serve as the canonical home of the content it produces, because that's not how the AP works as a co-op, and despite the occasional breathlessness of folks like myself claiming that the Internet changes everything, it seems unlikely that the AP will ever become the canonical web site for the news it produces.

Before I go further, I should provide a little background on my experience with news wire content. When I first started working at a big news site several years ago, one of the first development tasks I tackled was writing mountains of Perl code to parse incoming news wire feeds. I eventually wrote most of the feed parsing code for the big sports site that I moved onto. The feeds were literally coming in over a specialized 9600 baud modem hooked into the serial port of a Sun workstation (they probably still are) via a leased line from New Jersey to Atlanta. My job was to convert this content into some reasonable URL-addressable HTML-ized content for the site. This still being the early days of the web, naming conventions were wide open, so my development tasks included coming up with a reasonable URL structure that would match how the feeds came in. For example, if an update to an existing wire story comes in, do I overwrite the existing story at the same URL, or is it a separate story? The environment I was working in was moving quickly enough at the time that I had to make editorial judgments at times with my code. My spec for parsing these news feeds was the ANPA (American Newspaper Publishers Association) Wire Service Transmission Guidelines, which was last updated in 1989. I was working with stuff like this:

Please note that the Associated Press text delimits paragraphs with CR (Carriage Return), LF (Line Feed), HT (Horizontal Tab), followed by 3 SPs (space codes). UPI uses QL (Quad Left), CR, LF, HT, followed by 3 SPs.
Drawing from the wire-parsing experience, I'll illustrate examples of non-URL-addressable content by pointing to the content produced by Sportsticker. If you're a sports fan, you have probably consumed way more Sportsticker content than you realize. Back when I was working in that world, Sportsticker data fed every scoreboard you could imagine: the "crawler" at the bottom of the screen when you watch ESPN, the scoreboards in baseball parks, the scores on CNN Headline News, the scores you hear from radio announcers, and the scores on most web sites. The reach of Sportsticker became clear to me one day when there was a temporary problem with our feed and I noticed that every sports information outlet I checked manifested the problem I was seeing (and yes, the thought that there was essentially a single point of failure for something as critical as sports scores was indeed chilling! Since then, STATS Inc. has grown more popular it seems, but I know of no other sports data services than those two.)

OK, finally to my point about non-URL-addressable content. The Sportsticker service (like all the wire services I dealt with) relied on what is known as a "slug." The Sportsticker spec defines it as such:

The SLUG provides for a maximum of 35 upper and lower case characters to identify the contents of a message. There are three different types of SLUG coding corresponding to text messages, non-text messages, and system messages. Text SLUG’s are in uppercase, while non-text and system SLUG’s are in lower case. The "bc-" in the first three positions (1-3) identify the publishing cycle as both AM and PM. The remaining positions describe the contents or the message, and differ for text and non-text messages.
The slug (combined with the date of transmission if I remember correctly) is the unique identifier for a particular story coming over the wire. In that sense, it functions as a URL of sorts, but in an entirely different namespace -- the namespace of the stream of data coming over that 9600 baud modem. Rather than point to the spec, it's easier to explain what the slug format looks like using an example from the NFL. Note that we would receive "advisories" for sports like the NFL with lists of slugs for particular sports just before the season started.

BC-FBP-STAT-AFCINTS-R
(FBP means "pro football/NFL",STAT means "statistics" -- which mean tabular data to be wrapper in <PRE> tags at the time, and AFCINTS meant "current AFC interception leaders," something that was communicated to us in the NFL advisory)

BC-FBP-LGNS-PHILADELPHROS-R
(similar to above, but LGNS meant "league news," and "PHILADELPHROS" meant "Philadelphia Eagles roster")

When I was immersed in wire parsing, we made the editorial decision to produce URLs that were easy for our users to comprehend, so in the Philadelphia Eagles roster example above, we would produce a URL that looked something like this: /football/nfl/teams/rosters/philadelphia.html. That way, a user who was on this particular page could easily change "philadelphia.html" to "chicago.html" if that user was interested in the Bears. This created a certain amount of overhead in writing our code. Some of our competitors didn't have time for such URL pleasantries, so they pumped out the same content with URLs that were closer to this: /football/nfl/bc-fbp-lgns-philadelphros.html. Not as pretty, but it was the minimal required to prevent namespace collisions when converting the wire slugs to URL-addressable content, and it worked.

So, given the reality of how the wire services work with their customers, what can we do about the problem of canonical URLs that Jon Udell discussed in his blog post? With intentionally non-URL-addressable content still bubbling up to the Internet over "old school" wire services with their own protocols and conventions and with hundreds and thousands of news organizations making separate decisions on how to present that content on their sites, I'm not sure it's an easy task. Posted by Chad Dickerson at October 1, 2004 10:53 AM



ADVERTISEMENT

 HOME  NEWS  TEST CENTER  OPINIONS  TECHINDEX   About InfoWorld :: Advertise :: Subscribe :: Contact Us :: Awards :: Events 

Copyright © 2003, Reprints, Permissions, Licensing