Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

  Thursday, January 29, 2004 

Structured search, phase two

The next phase of my structured search project is coming to life. For the new version I'm parsing all 200+ of the RSS feeds to which I subscribe, XHTML-izing the content, storing it in Berkeley DB XML, and exposing it to the same kinds of searches I've been applying to my own content. Here's a taste of the kinds of queries that are now possible:

links from Tim Bray

links from Brent Simmons to InfoWorld.com

books mentioned by AKMA

books, with XQuery in the title, mentioned by Michael Rys

The paint's not dry on this thing yet. I have yet to normalize the dates, and I'm still getting the hang of DB XML, but here are some things that become immediately obvious:

  • Feeds that deliver only partial content are at a disadvantage.

  • HTML Tidy is able to coerce a surprisingly large number of the feeds I take from HTML to XHTML.

  • Once coerced, they're addressable in terms of the elements you find in HTML: links, images, tables, quotes.

Until now, I've thought the major roadblock standing in the way of more richly structured content was the lack of easy-to-use XML writing tools. But maybe I've been wrong about that. If it's going to be practical to XHTML-ize what current HTML writing tools produce, maybe we can make a whole lot more progress than I thought by working toward CSS styling conventions that will also provide hooks for more powerful searching.

At the very least, this will be a nice laboratory in which to experiment with a growing pool of XML content, using a variety of XML-capable databases. My hope, of course, is to offer a service that's as useful to you -- the writers of the blogs I'm reading, aggregating and searching -- as it is to me. And ideally, useful to you in ways that invite you to think about how to make what you write even more useful to all of us. We'll see how it goes.

 


Recent Entries


















































Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist