Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register

  Tuesday, January 07, 2003 

Where angels fear to tread

Roger Costello's excellent XML Schema Tutorial includes a detailed breakdown of the ISBN. I've excerpted the documentation (along with Roger's GPL) here. The example also includes a complete ISBN schema, which involves a huge pile of regular expressions. The hyphens, which most book-related Web services ignore, are meant to carve up the address space in a very TCP/IP-like way:


The format of an ISBN is:
1 -- it is always 10 characters long
2 -- it's broken into 4 parts, and these four parts   
     always appear separated with hyphens or spaces.
3 -- the four parts are:
     - group/country identifier
     - publisher identifier
     - number assigned to a specific title in one format 
       (formally called the title identifier)
     - a check digit

For English-speaking countries, part one is 0 or 1, and the publisher id is variable like so:


Country Publisher ID  If number ranges Insert hyphen Block Size
                      are between:     after the: 
----------------------------------------------------------------
0       00.......19         00-19      3rd digit     1,000,000 
0       200......699        20-69      4th digit     100,000 
0       7000.....8499       70-84      5th digit     10,000 
0       85000....89999      85-89      6th digit     1,000 
0       900000...949999     90-94      7th digit     100 
0       9500000..9999999    95-99      8th digit     10 
1       55000....86979      5500-8697  6th digit     1,000 
1       869800...998999     8698-9989  7th digit     100
1       9990000..9999999    9990-9999  8th digit     10 

Costello's complete ISBN schema runs to about 180K, all stuff like this:


<xsd:pattern value="951\s\d([0-9]|\s){5}\d\s[0-9x]">
    <xsd:annotation>
        <xsd:documentation>
            group/country ID = 951 (space after the 3rd digit)
            Country = Finland
            check digit is 0-9 or 'x'
        </xsd:documentation>
    </xsd:annotation>
</xsd:pattern>

Fascinating, but formidable. The inventors of this scheme must have been chagrined to see Amazon and the rest of the book sites discard this carefully designed information architecture. Can't blame them, though. 180K of regular expressions is a lot of overhead. And even if the hyphens were preserved, there would still be a big problem: a fragmented address space in need of some means of coalescence.

Lorcan Dempsey, who is VP for research at OCLC, wrote to let me know that there is an initiative to achieve that coalescence. From the abstract:

OCLC is investigating how best to implement IFLA's Functional Requirements for Bibliographic Records (FRBR). As part of that work, we have undertaken a series of experiments with algorithms to group existing bibliographic records into works and expressions. Working with both subsets of records and the whole WorldCat database, the algorithm we developed achieved reasonable success identifying all manifestations of a work.

Cool!

 


Recent Entries


















































Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS  IT EXEC-CONNECT   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist