fighting spam paper & links / naive bayes / anybody ?

[1/16] from: jjmmes::yahoo::es at: 25-Aug-2002 18:07

You might find this link interesting : http://www.paulgraham.com/spam.html It describes filtering spam with naive bayes and also has some links at the bottom of the page related to fighting spam. Is anybody interested or working on naive bayes algorithms ? Thanks

[2/16] from: greggirwin:mindspring at: 25-Aug-2002 14:45

Hi Jose, << Is anybody interested or working on naive bayes algorithms ? >> Someone posted that on an IOS server and I think Brett Handley has taken an initial crack at it. Maybe he'll jump in here. --Gregg

[3/16] from: brett:codeconscious at: 26-Aug-2002 11:30

> Someone posted that on an IOS server and I think Brett Handley has taken

> initial crack at it. Maybe he'll jump in here.

*jump* *stumble* *ahem* Yes I had a go. I naively tried to translate the LISP code (without knowing LISP) from the Paul Graham article into some REBOL code and just put it together into something that ran. I did not give a lot of thought to make a nicely structured and fast solution - I just wanted to understand what was going on. I ran it on a small set of spam and good emails - and it worked beautifully until I realised that my logic was different to Paul's. :^) Then I fixed it and it didn't work so good :^( Paul Graham quoted 4000 messages, I only worked with a couple of hundred good emails and 14 bad (all I've kept) so with such a low sample size it is likely that my tests of the filter will be suspect. It would be nice if a LISP knowledgeable person could check that my implementation of the logic reasonably follows Paul's. I've uploaded my prototype script on to my site at the address below, be warned it is not thoroughly tested and I'm certainly not letting it be final arbiter of my email just yet: http://www.codeconscious.com/rebol/mlscripts/spam-filter.r Regards, Brett.

[4/16] from: greggirwin:mindspring at: 25-Aug-2002 20:39

Hi Brett, << I ran it on a small set of spam and good emails - and it worked beautifully until I realised that my logic was different to Paul's. :^) Then I fixed it and it didn't work so good :^( >> I hope the broken one, that worked better, is still available for comparison. :) Thanks for chiming in! --Gregg

[5/16] from: gchiu:compkarori at: 26-Aug-2002 16:19

>Paul Graham quoted 4000 messages, I only worked with a >couple of hundred >good emails and 14 bad (all I've kept) so with such a low >sample size it is >likely that my tests of the filter will be suspect.

I get over 50 spam a day. do you want me to send them to you :) I've got all sorts of filters running on my IMAP server, but they still get thru :(

>It would be nice if a LISP knowledgeable person could >check that my

<<quoted lines omitted: 4>>

>letting it be final >arbiter of my email just yet:

A working Rebol(tm) implementation would be great. Where are you storing the hash tables? -- Graham Chiu

[6/16] from: brett:codeconscious at: 26-Aug-2002 19:41

> I get over 50 spam a day. do you want me to send them to > you :)

Eeek! That's very generous of you Graham ;^) Still it could be worthwhile collecting them for a bit and then zipping them as attachment to me.

> Where are you storing the hash tables?

At present my little prototype is not storing them. It would be fairly straight forward to store them so that they can be incrementally updated, but the problem is any change in the tokenising process could stuff up the data. So it makes sense to work on making the tokenising process stable, perhaps more intelligent (and fast) and then to record the tokensing method with the hash information. It's a bit of a side project for me, but then I don't get 50 spam emails a day thankfully :^) Regards, Brett.

[7/16] from: brett:codeconscious at: 26-Aug-2002 19:32

> << I ran it on a small set of spam and good emails - and it worked > beautifully until I realised that my logic was different to Paul's. :^) > Then I fixed it and it didn't work so good :^( >> > > I hope the broken one, that worked better, is still available for > comparison. :)

The broken program calculated the "interesting" words to be the words with the *highest* 15 probabilities of being spam - instead of calculating the highest variance from 0.5. So with small sample sizes it looked like it was working really well, but I doubt that it would work in all cases. Probably worth testing against a good set of emails. Regards, Brett.

[8/16] from: al:bri:xtra at: 26-Aug-2002 22:27

Brett wrote:

> ...the problem is any change in the tokenising process could stuff up the

data. So it makes sense to work on making the tokenising process stable, perhaps more intelligent (and fast) and then to record the tokensing method with the hash information. I've been getting spam email that's encrypted? and only decrypts? in my email program. This kind of spam could be a problem. Once I get another of this kind, I'll look more closely at the data, but not the content, bleargh! You don't want to know and neither do I!! Andrew Martin ICQ: 26227169 http://valley.150m.com/

[9/16] from: anton:lexicon at: 27-Aug-2002 0:32

I don't think it would be a problem for the baysian algorithm, as it would still get tell-tale signs from the headers. Anton.

[10/16] from: gchiu:compkarori at: 4-Sep-2002 18:50

On Mon, 26 Aug 2002 11:30:30 +1000 "Brett Handley" <[brett--codeconscious--com]> wrote:

>It would be nice if a LISP knowledgeable person could >check that my

<<quoted lines omitted: 5>>

>arbiter of my email just yet: > http://www.codeconscious.com/rebol/mlscripts/spam-filter.r

there's a sourceforge project on this now http://sourceforge.net/projects/spamprobe/ -- Graham Chiu

[11/16] from: gchiu:compkarori at: 10-Sep-2002 21:45

On Mon, 26 Aug 2002 22:27:26 +1200 "Andrew Martin" <[Al--Bri--xtra--co--nz]> wrote:

>I've been getting spam email that's encrypted? and only >decrypts? in my >email program. This kind of spam could be a problem.

Do you have any further information on this? I can't imagine why spam would be encrypted ... after all they want you to read it! -- Graham Chiu

[12/16] from: carl:cybercraft at: 10-Sep-2002 23:46

On 10-Sep-02, Graham Chiu wrote:

> On Mon, 26 Aug 2002 22:27:26 +1200 > "Andrew Martin" <[Al--Bri--xtra--co--nz]> wrote:

<<quoted lines omitted: 5>>

> I can't imagine why spam would be encrypted ... after all > they want you to read it!

He's got a point Andrew. (: I've noticed a few big (100k+) spams arriving laterly, but I've deleted them on the server without looking at them. Can't remember what their subject line was. Will download the next one if there's any more... Oh, yes, I remember now - they were pretending to be a "returned mail" error, yet they weren't anything I'd sent, this being on an email address I don't use for sending emails. -- Carl Read

[13/16] from: gchiu:compkarori at: 11-Sep-2002 8:15

On Mon, 26 Aug 2002 11:30:30 +1000 "Brett Handley" <[brett--codeconscious--com]> wrote:

I've been playing around with Brett's code. It took 17 mins to tokenise 770 sample spam messages, and 516 "good" messages ( email was first pulled from my email server, and then saved to local storage before starting the test). I ended up with 34052 unique tokens from the good mail, and 60516 tokens from the spam. I then ran a test on the same body of good and bad emails. The script detected one of the "good" email as being spam, and looking at that email, I found that I had incorrectly misclassified that message as good whereas it was infact spam! The script only detected 604 of the 770 as being spam. I suspect others will have better results than this. My email is already heavily filtered - I have about 40 filters running on my mail server, so the tests were run on messages that had got thru the filters. Also, a lot of what I consider spam actually looks like my good mail. The only significant changes I made to Brett's code were to strip out attachments before tokenising the message. However, I need to still decode text/html base64 encoded messages and tokenise them rather than discarding these attachments. -- Graham Chiu

[14/16] from: carl:cybercraft at: 11-Sep-2002 14:32

On 10-Sep-02, Carl Read wrote:

> On 10-Sep-02, Graham Chiu wrote: >> On Mon, 26 Aug 2002 22:27:26 +1200

<<quoted lines omitted: 11>>

> what their subject line was. Will download the next one if there's > any more...

There was, but it wasn't encrypted. It consisted of some HTML, a jpg and a 96k "audio/x-midi" file which I didn't attempt to listen to. How dull. -- Carl Read

[15/16] from: al:bri:xtra at: 11-Sep-2002 17:03

Graham wrote:

> > I can't imagine why spam would be encrypted ... after all they want you

to read it! Carl wrote:

> He's got a point Andrew. (:

:) When I look at the headers and content of the spam email as plain text, it seems to be gibberish (may be url-encoded?). When viewing through my email client, the client "un-decodes" and displays the colourful content of the spam. I've been deleting the horrible ones (you don't want to know) and haven't got any examples yet. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[16/16] from: gchiu:compkarori at: 15-Sep-2002 23:19

On Mon, 26 Aug 2002 11:30:30 +1000 "Brett Handley" <[brett--codeconscious--com]> wrote:

>I've uploaded my prototype script on to my site at the >address below, be >warned it is not thoroughly tested and I'm certainly not >letting it be final >arbiter of my email just yet: > > http://www.codeconscious.com/rebol/mlscripts/spam-filter.r >

I've taken Brett's code from the IOS server ( I'm not sure it's the same as the one above ), and created a "web service" out of it just so that you can see what it does. http://207.8.27.211/spam/index.html Just paste into the box a complete email with all the headers, and "test" it to see if it is considered spam or not. The database I'm using is from 2597 good email, and 876 spam. At the moment it does not update itself ie. does not learn, as I have to consider the issue of file locking etc. What I would like to do, is to tokenise the email locally, and just send the tokens to the web service ( perhaps SOAP or a Rugby service ). Trouble is I don't know whether what I consider spam is what others consider spam. I would be interested to see what results people get. -- Graham Chiu

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted