Free Newsletters
Technology & Business Daily

InfoWorld
Log-in | Register

  Friday, July 11, 2003 

SpamBayes update

I've been comparing notes with Tom Yager, who notices lately that spammers' use of nonsense words, especially in Subject: headers, seems to be effective against the Bayesian filter in OS X's Mail.app. I checked, and SpamBayes is (so far) unaffected by this ploy. One of the cool things about SpamBayes is its ability to reveal how it analyzes messages. See below for its take on a message that has the Subject: line "Jon nezinyunyane inflechies" and a bunch of angle-bracketed garbage in the text.

Evidently SpamBayes ignores all the garbage tags, but since these serve as word delimiters, it winds up seeing a bunch of word fragments -- like 'innov' and 'ative' -- which it finds suspicious. And as the spam counts indicate, it has seen these fragments before, so over time their discriminatory power should only grow, not diminish.

It's also fascinating to look at the handling of the giveaway phrase "Multi-Trillion Dollar Market." "Multi-Trillion" does not appear on SpamBayes' list of interesting tokens, though "dollar" and "market" do. That SpamBayes makes no effort to correlate these adjacent words seems like an obvious limitation, and yet it is (so far) continuing to perform spectactularly well for me despite that.

Update: As I keep forgetting for some reason, and as Giorgio Valoti reminds me, Mail.app uses latent semantic analysis, not the Bayesian technique.

Spam Score: 1

word                                spamprob         #ham  #spam
'*H*'                               0                   -      -
'*S*'                               1                   -      -
'jon-'                              0.0313807          38      1
'from:addr:jonathan'                0.06584             3      0
'noheader:mime-version'             0.267816         3682   1332
'there'                             0.357648         1865   1027
'web'                               0.359379         1678    931
'noheader:reply-to'                 0.398404         8311   5444
'reply-to:none'                     0.398404         8311   5444
'your'                              0.607781         3493   5354
'now'                               0.609287         1198   1848
'header:Date:1'                     0.614892         5565   8789
'header:From:1'                     0.616075         5536   8787
'live'                              0.617098          227    362
'subject:Jon'                       0.628519          123    206
"you've"                            0.635875          294    508
'potential'                         0.637791          171    298
'header:Received:6'                 0.639839          738   1297
'url:com'                           0.643368         3651   6515
'must'                              0.651722          330    611
'year.'                             0.654359          135    253
'proto:http'                        0.657505         4086   7759
'area'                              0.657698          183    348
'break'                             0.668753           75    150
'skip:m 10'                         0.671444          690   1395
'header:Return-Path:1'              0.698156         3807   8710
'join'                              0.72811           152    403
'market.'                           0.736942           76    211
'serious'                           0.779225           64    224
'header:Message-Id:1'               0.782263         1220   4336
'sell'                              0.789965           88    328
'life'                              0.850081          110    618
'walls'                             0.86821            15     99
'url:htm'                           0.870883          144    962
'to:addr:jon'                       0.871895           87    587
'url:index'                         0.877064          152   1074
'dollar'                            0.917399           19    211
'url:173'                           0.934783            0      3
'unique,'                           0.949503            5     97
'subject:\xe9'                      0.95032             1     23
'pac'                               0.96723             3     94
'independence'                      0.969427            3    101
'earn'                              0.969434           11    352
'000'                               0.971088            3    107
'url:61'                            0.974407            1     46
'url:133'                           0.97619             0      9
'margin'                            0.977151            2     94
'$100,'                             0.977616            2     96
'url:41'                            0.981928            0     12
'ing'                               0.982844            2    126
'infor'                             0.987666            1     97
'ailability.'                       0.994822            0     43
'rofit'                             0.994822            0     43
'ative'                             0.994938            0     44
'innov'                             0.994938            0     44
'aking'                             0.99505             0     45
'busin'                             0.99505             0     45
'ited'                              0.99505             0     45
'pture'                             0.99505             0     45
'azing.'                            0.995156            0     46
'lim'                               0.995156            0     46
'ame'                               0.99545             0     49
'che'                               0.99545             0     49
'message-id:@vampiress.zzn.com'     0.995627            0     51
'from:addr:vampiress.zzn.com'       0.99579             0     53
'rica'                              0.996014            0     56
'ess'                               0.996937            0     73
'amearn'                            0.997592            0     93
'amed'                              0.997592            0     93
'fina'                              0.997592            0     93
'kage.'                             0.997592            0     93
'ncial'                             0.997592            0     93
'ney!'                              0.997592            0     93
'dre'                               0.997667            0     96
'mation'                            0.997738            0     99
'to:name:jon'                       0.998132            0    120
'to:addr:songline.com'              0.998351            0    136

Message Stream:

Date: Thu, 10 Jul 2003 21:08:32 -0800
Subject: Jon nezinyunyane inflechies
X-Sender: Cole Nickol <jonathan@vampiress.zzn.com>
</pre>
<pre>
<ioIVkQkR><OlpyCsaP><MSvpk>
<dHEnke><LmKTnRTowC><rrRsACDfT>
<oxwCx><fRmtryOnGq><UUkBB>
<XCKfvJ><cNPVNIa><qOaXN>
<pJIdrYdnW><ipvub><TigRpgXSHL>
<jOMnGnk><fAdDuDwhKg><vtWrFAnErq>
<kSYDLiOeO><PMcHf><lmkJTSd>
<JRFwXqJrU><aFFNaxP><PcdrQ>
<interesses><nenahospodari><interesses>
<p><font face="Trebuchet MS">Jon-</font></P>
<p></p>
<p> <nenahospodari><interesses></p>
<p><font face="Trebuchet MS">Ca<NLsQcKSV>pture Your Dre<bSVhiD>amEarn
Fina<interesses>ncial Independence</font></p>
<p><font face="Trebuchet MS">You can now for the first
time,<nenahospodari>
<BJxQNXNYyQ>own a busin<Pqktbw>ess in your area with the most unique,
<KvynRHwW>innov<FkEhUrwj>ative product in Ame<SatnmBdYiP>rica today. Work
le<mQvuxWo>ss a week with the potential to earn
$100,<interesses>000 a year. There is no sell<QgBARjeQ>ing and not
ML<gtdyhs>M. Join a Multi-Trillion Dollar Market.</font></p>
<p><font face="Trebuchet MS">The p<KChHi>rofit margin is
am<SIGOVYIdkA>azing.<nenahospodari></font></p>
<p><font face="Trebuchet MS"><nenahospodari>Break down the walls and live
this life you've only dre<interesses>amed about.</font></p>
<p><font face="Trebuchet MS">Lim<QkWkUgv>ited av<UPeOWb>ailability.
for Your Fr<interesses>ee infor<RCbUkNPVg>mation
pac<afkLWbCAa>kage.</font></p>
<p><font face="Trebuchet MS"><oCgNBti><a
href="http://61.173.41.133/3a22147895.com/index.htm">-web-site-</a></font>
</p>
<p><nenahospodari></p>
<p><font face="Trebuchet MS">Y<rcMWMN>ou must che<bkvKm>ck this out if you
are serious about m<TQYVM>aking mo<interesses>ney!</font></p>
<nenahospodari>
<p></p>
<p></p>
<p><font face="Trebuchet MS">O<interesses>pt o<yxXafJRXuQ>ut at
web<interesses>site<interesses></font></p>

 

The starship and the canoe

I am unfortunately not able to be at this year's open source conference, but I've been reading about it at the O'Reilly Network site and also on Phil Windley's site where he is, as has become his habit, doing a spectacular job of blogging the show.

Today Phil writes about a talk by George Dyson featuring materials from the archives of the Institute for Advanced Study. I'd love to have been there!

If you've never read Kenneth Brower's The Starship and the Canoe, by the way, it's a real treat. First published 25 years ago (!), it's a father/son biography of Freeman Dyson and George Dyson. The "starship" in the title refers to Project Orion, an effort by Freeman Dyson and Ted Taylor to produce a nuclear-bomb-propelled interplanetary vessel. The "canoe" refers to a seagoing kayak that George, then a tree-dwelling rebel living in British Columbia, was building. I wonder how the Dysons felt about the book then, and I wonder how they feel about it now.

 

ACM Queue

Tim Bray's videoblogging experiment today points to a wonderful Conversation between Dave Patterson and Jim Gray. Tim writes:

Patterson and Gray are both pretty famous in our profession, but neither is as famous as he deserves to be.
I'll second that. I once got Jim (along with Jeri Edwards) to write an article for BYTE that's still a great introduction to the subject of TP monitors.

The interview Tim points to is featured in a magazine I'd not heard of before, ACM Queue, which seems to have started quite recently (March 2003). The articles appear online, but I couldn't find an RSS feed, so I made one. Ah. Life is good!

 


Recent Entries


















































Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist