fighting spam paper & links / naive bayes / anybody ?
[1/16] from: jjmmes::yahoo::es at: 25-Aug-2002 18:07
You might find this link interesting :
http://www.paulgraham.com/spam.html
It describes filtering spam with naive bayes and also
has some links at the bottom of the page related to
fighting
spam.
Is anybody interested or working on naive bayes
algorithms ?
Thanks
[2/16] from: greggirwin:mindspring at: 25-Aug-2002 14:45
Hi Jose,
<< Is anybody interested or working on naive bayes algorithms ? >>
Someone posted that on an IOS server and I think Brett Handley has taken an
initial crack at it. Maybe he'll jump in here.
--Gregg
[3/16] from: brett:codeconscious at: 26-Aug-2002 11:30
> Someone posted that on an IOS server and I think Brett Handley has taken
an
> initial crack at it. Maybe he'll jump in here.
*jump* *stumble* *ahem*
Yes I had a go. I naively tried to translate the LISP code (without knowing
LISP) from the Paul Graham article into some REBOL code and just put it
together into something that ran. I did not give a lot of thought to make a
nicely structured and fast solution - I just wanted to understand what was
going on. I ran it on a small set of spam and good emails - and it worked
beautifully until I realised that my logic was different to Paul's. :^)
Then I fixed it and it didn't work so good :^(
Paul Graham quoted 4000 messages, I only worked with a couple of hundred
good emails and 14 bad (all I've kept) so with such a low sample size it is
likely that my tests of the filter will be suspect.
It would be nice if a LISP knowledgeable person could check that my
implementation of the logic reasonably follows Paul's.
I've uploaded my prototype script on to my site at the address below, be
warned it is not thoroughly tested and I'm certainly not letting it be final
arbiter of my email just yet:
http://www.codeconscious.com/rebol/mlscripts/spam-filter.r
Regards,
Brett.
[4/16] from: greggirwin:mindspring at: 25-Aug-2002 20:39
Hi Brett,
<< I ran it on a small set of spam and good emails - and it worked
beautifully until I realised that my logic was different to Paul's. :^)
Then I fixed it and it didn't work so good :^( >>
I hope the broken one, that worked better, is still available for
comparison. :)
Thanks for chiming in!
--Gregg
[5/16] from: gchiu:compkarori at: 26-Aug-2002 16:19
>Paul Graham quoted 4000 messages, I only worked with a
>couple of hundred
>good emails and 14 bad (all I've kept) so with such a low
>sample size it is
>likely that my tests of the filter will be suspect.
I get over 50 spam a day. do you want me to send them to
you :)
I've got all sorts of filters running on my IMAP server,
but they still get thru :(
>It would be nice if a LISP knowledgeable person could
>check that my
<<quoted lines omitted: 4>>
>letting it be final
>arbiter of my email just yet:
A working Rebol(tm) implementation would be great.
Where are you storing the hash tables?
--
Graham Chiu
[6/16] from: brett:codeconscious at: 26-Aug-2002 19:41
> I get over 50 spam a day. do you want me to send them to
> you :)
Eeek! That's very generous of you Graham ;^)
Still it could be worthwhile collecting them for a bit and then zipping them
as attachment to me.
> Where are you storing the hash tables?
At present my little prototype is not storing them. It would be fairly
straight forward to store them so that they can be incrementally updated,
but the problem is any change in the tokenising process could stuff up the
data. So it makes sense to work on making the tokenising process stable,
perhaps more intelligent (and fast) and then to record the tokensing method
with the hash information.
It's a bit of a side project for me, but then I don't get 50 spam emails a
day thankfully :^)
Regards,
Brett.
[7/16] from: brett:codeconscious at: 26-Aug-2002 19:32
> << I ran it on a small set of spam and good emails - and it worked
> beautifully until I realised that my logic was different to Paul's. :^)
> Then I fixed it and it didn't work so good :^( >>
>
> I hope the broken one, that worked better, is still available for
> comparison. :)
The broken program calculated the "interesting" words to be the words with
the *highest* 15 probabilities of being spam - instead of calculating the
highest variance from 0.5. So with small sample sizes it looked like it was
working really well, but I doubt that it would work in all cases. Probably
worth testing against a good set of emails.
Regards,
Brett.
[8/16] from: al:bri:xtra at: 26-Aug-2002 22:27
Brett wrote:
> ...the problem is any change in the tokenising process could stuff up the
data. So it makes sense to work on making the tokenising process stable,
perhaps more intelligent (and fast) and then to record the tokensing method
with the hash information.
I've been getting spam email that's encrypted? and only decrypts? in my
email program. This kind of spam could be a problem.
Once I get another of this kind, I'll look more closely at the data, but not
the content, bleargh! You don't want to know and neither do I!!
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[9/16] from: anton:lexicon at: 27-Aug-2002 0:32
I don't think it would be a problem for the
baysian algorithm, as it would still get tell-tale
signs from the headers.
Anton.
[10/16] from: gchiu:compkarori at: 4-Sep-2002 18:50
On Mon, 26 Aug 2002 11:30:30 +1000
"Brett Handley" <[brett--codeconscious--com]> wrote:
>It would be nice if a LISP knowledgeable person could
>check that my
<<quoted lines omitted: 5>>
>arbiter of my email just yet:
> http://www.codeconscious.com/rebol/mlscripts/spam-filter.r
there's a sourceforge project on this now
http://sourceforge.net/projects/spamprobe/
--
Graham Chiu
[11/16] from: gchiu:compkarori at: 10-Sep-2002 21:45
On Mon, 26 Aug 2002 22:27:26 +1200
"Andrew Martin" <[Al--Bri--xtra--co--nz]> wrote:
>I've been getting spam email that's encrypted? and only
>decrypts? in my
>email program. This kind of spam could be a problem.
Do you have any further information on this?
I can't imagine why spam would be encrypted ... after all
they want you to read it!
--
Graham Chiu
[12/16] from: carl:cybercraft at: 10-Sep-2002 23:46
On 10-Sep-02, Graham Chiu wrote:
> On Mon, 26 Aug 2002 22:27:26 +1200
> "Andrew Martin" <[Al--Bri--xtra--co--nz]> wrote:
<<quoted lines omitted: 5>>
> I can't imagine why spam would be encrypted ... after all
> they want you to read it!
He's got a point Andrew. (:
I've noticed a few big (100k+) spams arriving laterly, but I've
deleted them on the server without looking at them. Can't remember
what their subject line was. Will download the next one if there's
any more...
Oh, yes, I remember now - they were pretending to be a "returned mail"
error, yet they weren't anything I'd sent, this being on an email
address I don't use for sending emails.
--
Carl Read
[13/16] from: gchiu:compkarori at: 11-Sep-2002 8:15
On Mon, 26 Aug 2002 11:30:30 +1000
"Brett Handley" <[brett--codeconscious--com]> wrote:
>Paul Graham quoted 4000 messages, I only worked with a
>couple of hundred
>good emails and 14 bad (all I've kept) so with such a low
>sample size it is
>likely that my tests of the filter will be suspect.
I've been playing around with Brett's code.
It took 17 mins to tokenise 770 sample spam messages, and
516 "good" messages ( email was first pulled from my email
server, and then saved to local storage before starting
the test).
I ended up with 34052 unique tokens from the good mail,
and 60516 tokens from the spam.
I then ran a test on the same body of good and bad emails.
The script detected one of the "good" email as being spam,
and looking at that email, I found that I had incorrectly
misclassified that message as good whereas it was infact
spam!
The script only detected 604 of the 770 as being spam.
I suspect others will have better results than this. My
email is already heavily filtered - I have about 40
filters running on my mail server, so the tests were run
on messages that had got thru the filters. Also, a lot of
what I consider spam actually looks like my good mail.
The only significant changes I made to Brett's code were
to strip out attachments before tokenising the message.
However, I need to still decode text/html base64 encoded
messages and tokenise them rather than discarding these
attachments.
--
Graham Chiu
[14/16] from: carl:cybercraft at: 11-Sep-2002 14:32
On 10-Sep-02, Carl Read wrote:
> On 10-Sep-02, Graham Chiu wrote:
>> On Mon, 26 Aug 2002 22:27:26 +1200
<<quoted lines omitted: 11>>
> what their subject line was. Will download the next one if there's
> any more...
There was, but it wasn't encrypted. It consisted of some HTML, a jpg
and a 96k "audio/x-midi" file which I didn't attempt to listen to.
How dull.
--
Carl Read
[15/16] from: al:bri:xtra at: 11-Sep-2002 17:03
Graham wrote:
> > I can't imagine why spam would be encrypted ... after all they want you
to read it!
Carl wrote:
> He's got a point Andrew. (:
:)
When I look at the headers and content of the spam email as plain text, it
seems to be gibberish (may be url-encoded?). When viewing through my email
client, the client "un-decodes" and displays the colourful content of the
spam. I've been deleting the horrible ones (you don't want to know) and
haven't got any examples yet.
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[16/16] from: gchiu:compkarori at: 15-Sep-2002 23:19
On Mon, 26 Aug 2002 11:30:30 +1000
"Brett Handley" <[brett--codeconscious--com]> wrote:
>I've uploaded my prototype script on to my site at the
>address below, be
>warned it is not thoroughly tested and I'm certainly not
>letting it be final
>arbiter of my email just yet:
>
> http://www.codeconscious.com/rebol/mlscripts/spam-filter.r
>
I've taken Brett's code from the IOS server ( I'm not sure
it's the same as the one above ), and created a "web
service" out of it just so that you can see what it does.
http://207.8.27.211/spam/index.html
Just paste into the box a complete email with all the
headers, and "test" it to see if it is considered spam or
not.
The database I'm using is from 2597 good email, and 876
spam.
At the moment it does not update itself ie. does not
learn, as I have to consider the issue of file locking
etc. What I would like to do, is to tokenise the email
locally, and just send the tokens to the web service (
perhaps SOAP or a Rugby service ). Trouble is I don't
know whether what I consider spam is what others consider
spam.
I would be interested to see what results people get.
--
Graham Chiu
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted