|
News about Google News
InfoWorld's online folks have long complained about the absence of
InfoWorld news stories from Google News. Here is the most striking
illustration of the problem:
To summarize:
-
Lots of www.infoworld.com articles are in the main Google index.
-
But no www.infoworld.com articles are in the Google News index.
-
However some weblog.infoworld.com articles are in the Google News index.
From these observations I concluded that, for whatever reason, Google
News must have decided that www.infoworld.com is not a news source (although
weblog.infoworld.com is). Which raised the question: How exactly
does Google News decide what qualifies as a news source?
When I asked that question of Google spokesperson Megan Lamb she
offered the following guidelines which, though evidently not
published anywhere, I am reporting here with her permission:
Google strives to be as inclusive as possible, without regard to political
viewpoint or ideology, while also providing a high quality experience for
our users. Some of the things we look for when evaluating news
organizations include:
- The source offers information that is updated regularly.
- It is managed by an organization (not an individual) and includes
organizational information on its site.
- The source does not include hate speech or pornography.
- The source does not allow open posting of content without editorial
review.
- The source's website is technically conducive to inclusion.
Google News has assured InfoWorld that it does, of course, meet these
criteria, and does qualify as a news source. So apparently I was
wrong to conclude that www.infoworld.com was accidentally excluded
from the club. But what could the problem be, then?
According to Google News product manager Nathan Stoll, the omission is
a technical problem rather than an editorial one. The Google News
crawler, he says, is a very different beast from the regular
Google crawler. And while the regular crawler happily includes
our stuff, the news crawler -- for reasons as yet undetermined --
doesn't.
I was surprised to learn this because I've only ever been aware of
three user-agent strings (i.e., crawler signatures) broadcast by
Google bots:
-
GoogleBot (for the main index)
-
GoogleBot-Image (for images)
-
Feedfetcher-Google (for RSS feeds)
There's no separate signature for the news crawler. It identifies
itself as GoogleBot too. Given that the main crawler and the news
crawler use different algorithms for site traversal and page analysis,
according to Stoll, I'd expect them to identify themselves
differently. But perhaps for historical reasons, they don't.
As of today, InfoWorld's problem remains unresolved and is
still under investigation. Arguably it is a conflict of interest
for me to write about this matter, given that its resolution is in
InfoWorld's financial interest and therefore indirectly mine.
But the facts that have emerged about the editorial policy and
technical nature of Google News seem, well, newsworthy. And since I
am evidently a news source I thought I should pass them along.
|