Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Paul Graham on Fighting Spam 690

Ramakrishnan M writes "Paul Graham, the Lisp Guru is back with a great technique to fight spam. It is based on trust matric, and he claims, only 5 out of 1000 spams got leaked out of this system with 0 false positives. Worth looking at."
This discussion has been archived. No new comments can be posted.

Paul Graham on Fighting Spam

Comments Filter:
  • by Anonymous Coward on Friday August 16, 2002 @12:16PM (#4083060)
    Create an E-Mail address called, say, spam@example.net.

    Put a link to it on your website, but tell people not to use it for anything, E.G.

    <a href="mailto:spam@example.net">Spam trap - don't use me</a>

    Then, it'll get harvested along with all the others on your site. That mail box will fill up with spam, and nothing else.

    What good is that? Well, you've got a ready-made list of messages to filter *out* of your other mail boxes!

    So, just write a script that checks each inbound E-Mail against the spam list. If it matches, you *know* it's either:

    1. Spam

    or

    2. An E-Mail that somebody has also sent to the "Don't use me" address.

    In either case, you don't want to read it, so it gets auto-deleted. Nice.

    Oh, I think I'll patent this, and not tell any of you about the royalty I'm going to charge in 15 years time. Hahahahahahaha!!!

    Oh, by the way, first post, first post... NOT!
  • by Dimensio ( 311070 ) <darkstar@LISPiglou.com minus language> on Friday August 16, 2002 @12:21PM (#4083113)
    Spammers will try to work around filters, as they don't care that no one wants their crap. Further, filtering it doesn't solve the bandwidth situation, as the lines are still tied up with the bits running through the system until it hits the filter.

    There is only one good solution for spam: killing spammers. It should be done, and it should be done brutally and painfully. When known criminal spammers like Ralsky (who ran a child pornography site at one point) are brutally murdered, others may think twice before firing up "EmailBlaster 2002".
  • by Bazzargh ( 39195 ) on Friday August 16, 2002 @12:23PM (#4083131)
    Here's how: the spam should be written as a 'multipart/alternative' with an html version of the spam as the primary alternate. The text version contains an innocuous message intended to pass the statistical spam filter. The spam message is entirely contained as an /image/ within the html. The text of the spam becomes invisible to the reader but not to the poor schmuck who gets the email.

    I'm guessing here that the inclusion of a single image tag in the html is unlikely to trigger the spam filter, and supplying a wealth of evidence that the email is 'not' spam in the unseen alternate text will let the letter through.
  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Friday August 16, 2002 @12:26PM (#4083151)
    Comment removed based on user account deletion
  • by xipho ( 193257 ) on Friday August 16, 2002 @12:26PM (#4083155)
    This is the brilliant part, and crucial to the endeavour, and so easy to implement!

    It appears all the nay-sayers here haven't even read the article (no surprise). With as little code as needed to implement this it should be a must in the next mozilla mail/pine etc. code base.
  • by mr.nicholas ( 219881 ) on Friday August 16, 2002 @12:26PM (#4083158)
    Having had the same email address since '93, I receive close to 1000 spams per day to my personal account (which is also aliased from root/postmaster/webmaster).

    I've tried everything under the planet to reduce the amount that I see in my mailbox; SpamAssassin being one of the best so far. But even that lets through quite a bit (around 10%).

    So I decided to attack it from a different angle. I wrote a series of perl-scripts that I plunked into my procmail file.

    The scripts work by checking the address of the sender each time a message is received. That address is looked up in a database. If it exists in the db, and it's marked as "authorized", it's just passed into my mailbox.

    If it's marked as denied, /dev/null.

    If it's never been seen before, an authentication message is sent to the sender asking them to reply to it to authorize themselves. If that authmessage is bounced back, a db entry is made as "denied".

    If it's replied to in a normal fashion, that email is marked as "authorized" and any queued up mail from that person is pushed out.

    The concept is that spam will almost never have a valid reply-to; so it will bounce and be marked as denied.

    Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".

    Since I've set this up (for myself and my 10-year-old son who receives porn in his box (grrr!!!!)), it has worked flawlessly. The "real" email is unharmed, while the spam is stopped.

    Oh, and I have a web-based control page so that users can manually add email addresses (for lists and such).

    This week, for the first time in YEARS, I don't have spam in my mailbox anymore.

    Hurray!

    No if I can only stop those damned dictionary-based scanning of my servers, I'll be set. Thank the gods that I don't have metered service.
  • Misleading (Score:5, Interesting)

    by RainbowSix ( 105550 ) on Friday August 16, 2002 @12:29PM (#4083187) Homepage
    He isn't fighting spam, he is filtering it. There is a difference. Filtering still costs in bandwidth. Fighting it would eliminate the source and free up the gigabytes of bandwidth lost for this marketing purpose.

    Filtering is fine for now, but ultimately it must be fought and defeated.
  • Re:A weak point... (Score:2, Interesting)

    by sebi ( 152185 ) on Friday August 16, 2002 @12:30PM (#4083192)
    You should have continued to read the article.

    To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character.

    Basically the only way to get around this proposed method of statistical analysis ist to completely change the way spam copy is written. But changing that would basically defy the whole point of spam. If, to get through a filter, you had to stop writing sales pitches, then why spam in the first place?
  • by Erore ( 8382 ) on Friday August 16, 2002 @12:35PM (#4083219)

    I'm continually amazed at the people who are beating their heads up against a very simple problem. The answer is not statistics, it is not heuristics, it is not AI, it is not procmail.

    The answer is verification...aka whitelists. Check out TMDA, tmda.sourceforge.net. This program assumes you don't want mail from anybody whom you haven't explicitly allowed, or who has verified that they are a real person and not a spammer.

    Verification is simple, and some people will point out that it could be defeated by a spammer. But, the economics of spam do not make it feasible for a spammer to attempt to defeat TMDA.

    TMDA is similar to making your phone number private. You only get phone calls from people you have given your number to, and you never get telemarketers.

    TMDA user since December 2001. Spam messages that tried to get in, 12,133, spam messages that got in 3, false positives, 0. Time I've spent tweaking and modifying the program since installation, 0 minutes.

  • by Anonymous Coward on Friday August 16, 2002 @12:35PM (#4083221)
    Huh, actually 5 minutes editing my Outlook mail rules acheved exactly the same thing and I've been nearly spam free for years even though I receive at least 300 a day from my domain. No scripts, no voodoo. Just sinple point and click. There's the difference between closed source and open source. Closed source you use, open source you code.
  • Another idea (Score:2, Interesting)

    by caesar79 ( 579090 ) on Friday August 16, 2002 @12:42PM (#4083285)
    a nice idea to filter spam ...another one to fight it.

    1. the MTA's (mail transport agents like sendmail etc) establish trust relationships between themselves or manually. They also maintain a users safelist (i.e. addressboook + list of addresses user wants to recv mail from)

    2. All email over the trusted links and from addresses in the safelist are delivered unfiltered.

    3. For each email sent over an untrusted link
    a. Perform MD5 over message body.
    b. Ask neighbouring trusted agents if they have received an email whose MD5 is given.
    c. If no. of positives are greather than a threshold, reject as spam.
  • by FuzzyDaddy ( 584528 ) on Friday August 16, 2002 @12:43PM (#4083289) Journal
    Could this technique be used as a way to track evolving spam techniques over time?

    You could develop a corpus of spam over a long period of time, and look for shifts in the data. What this paper describes is distinguishing between a spam-corpus and a legit-corpus, but you could also compare a spam-1999 corpus to a spam-2002 corpus, and see if the spammers are up to anything new.

    Not that it would be useful, but it might be kind of cool to try it out and see.

  • Re:Misleading (Score:3, Interesting)

    by cybermace5 ( 446439 ) <g.ryan@macetech.com> on Friday August 16, 2002 @12:45PM (#4083307) Homepage Journal
    Wha...? Did you read the article?

    Filtering == Fighting

    The entire success of spam depends on human eyes reading it. If no one ever sees the spam, then spammers will have no money. Then they'll quit SENDING spam and have to start EATING it! Ahahaha!

    They can have the spam, egg, bacon, spam, CROW, spam, and spam.
  • by jglow ( 525234 ) on Friday August 16, 2002 @12:49PM (#4083341) Homepage Journal
    the good thing about his method is that even if a spammer gets a ahold of his analysis, the more span recieved with those words, it will slowly bump the likelyhood of it actually being a real email.. thus dumping those messages into the spam box.
  • Method applications (Score:3, Interesting)

    by lovebyte ( 81275 ) <lovebyte2000@gm[ ].com ['ail' in gap]> on Friday August 16, 2002 @12:52PM (#4083362) Homepage
    Good method. I work with Bayesian technics often and I had thought of the same thing but for a different purpose: automatic classification of emails. When you receive an email, your mail reader would propose a list of potential folders into which you might want to put your email after (or before) having read it. And the best thing is that is learns with time and it gets better. And as this article shows, this method can also automatically filter emails. Now if I have time to get involved in the Evolution project or kmail, ...
  • by michaelwexler ( 521484 ) on Friday August 16, 2002 @12:52PM (#4083365)
    Feel free to review the work at http://research.microsoft.com/~horvitz/junkfilter. htm [microsoft.com]

    They came up with similar processes to both filter and to categorize. Bayesian analysis is a very flexible, and while Paul Graham is not the first to try this, his work looks very exciting.

    I had nothing to do with any of this work; just a fan of Bayesian research.

    Michael
  • by Anonymous Coward on Friday August 16, 2002 @12:53PM (#4083372)
    Wow, don't you people actually read articles? The Slashdot crowd is so stupid.

    You would very likely not miss that letter from your wife for a number of reasons. Just as "sex" and "sexy" would increase the probability that the mail is spam, there would be words that would decrease the probability that the mail is spam. This method doesn't cue on just one or two words, it looks for fifteen words that are strongly weighted toward spam or no-spam.

    And as you get more and more mail like this from your wife, the Bayesian algorithms will learn that sex is not much of a factor (getting a .79 probability for you based on your corpi, instead of the .97 for the author, since his wife prefers to talk dirty to him on the phone).

    Why did I write this post? If you were too stupid to read the article, there is no way you will read this. Again, the reason why I never bothered to get a Slashdot account. The community if riddled with idiots.
  • xIf xYou xCan xRead xThis xYou xHave xWon xA xFabulous xVacation! xClick xHere xTo xRecieve xYour xPrize!

  • by prester ( 176898 ) on Friday August 16, 2002 @12:59PM (#4083405)
    Did you happen to read the article? He discusses this at length. He makes a strong argument that his system is actually pretty robust, since to get around it consistantly the spam has to look just like your real email, which is pretty darn hard for them to do.

    In a lot of ways this problem is like cheating in games. As long as you're the only one who knows the exploit, you can be pretty sure that it's not going to get fixed, though you'll still get kicked off every server you play on. Similarly, with his method a spammer might be able to find a particular phrasing that's likely to get through, though his messages will still be deleted on arrival. But even if he does, if he starts sending you too many emails or starts selling his technique the filter will adapt with the spam and start filtering it out.
  • Nicely done (Score:3, Interesting)

    by hrieke ( 126185 ) on Friday August 16, 2002 @01:02PM (#4083440) Homepage
    What I want to know is:
    Would this also work with email virus? I think it would since the virus would also have a defined patern to it and the program would pick it up after the first one.
    Could this be made part of the STMP protocol or built into the backbone layer of the network? Again, I no major reason why it couldn't.
    Problems that I have with it are:
    Since each word is treated as a token and everything else is not, I'm sure that spammer would quickly figure out that a spam like this just might work:
    <HTML>
    <BODY>
    Enlarge <!-- elephant --> penis [etc..]
    </BODY>
    </HTML>
    which would show the message but hide the balancing words, so it could be possible to change the delta into your favor.
    Does anyone else have thoughts on how this might be broken?
  • by stuartkahler ( 569400 ) on Friday August 16, 2002 @01:04PM (#4083453)
    Laws will never stop spammers. The damages are very hard to prove, especially when the judge/jury don't realize that their ISP filters their mail for 95+% of the spam already. Most people just don't GET it. And most spammers are sending the spam from another country, running a fly-by-night operation, so prosecution is nearly impossible.
    Filters are helpful, but they still require huge resources to receive the e-mail and process it. And as stated in the article, the risk of a false positive is often much worse than just receiving the spam.
    There are already only a few mail relays that are willing to send out spam, and virtually nobody accepts ANY mail from them. The spam going out is coming through illegally used mail servers. This shows what is to be the solution to the problem of spam: ISPs will only act to stop spam when the spammer is damaging their system.
    Most spam gets deleted without the enclosed links getting clicked by at least 99%. The company hosting the web site just sees their customer getting some success with their business. They don't know why, and they really don't believe/care when someone e-mails them to say that the user spammed them from a mail relay in china. The user probably paid for a 2 gig/month of traffic, and they are well under quota.
    It's time to change that. With a SETI@Home / Prime95 type application, we could easily DDoS a daily spammer off the net. Slashdot alone could easily field 10000 users willing to put their cable modems up to the task of pounding spammers accounts (and possibly the hosting ISP) off the net. Beat them down until the account appears to be deleted. Maybe then ISPs would hold users accountable for being spammers. Web hosting contracts might start including fines ($500+) for abusing the service, rather than just the scary risk of a cancelled account. All we have to do is beat them down before the few clueless morons come buying and make it worth their while.

    Legal? Sure, I don't see why not. I can send a 10 http requests to the ISP in a second... I've never heard of a law that says I can't do that every second. As long as the computers involved are from willing users (sysadmins get permission in writing first), there is no 'hacking'. Every DDoS case I've ever heard of involved charges of 2k+ computers 'hacked', rather than the ensuing attack. Even if it is illegal, this is vigilantism that nobody (other than the hosting ISP) is going to complain about.
  • by einstein ( 10761 ) on Friday August 16, 2002 @01:12PM (#4083511) Homepage Journal
    that sounds like a great system... any plans to release the code? I'd love to set that up at home.
    ---
  • fighting spam (Score:3, Interesting)

    by frovingslosh ( 582462 ) on Friday August 16, 2002 @01:16PM (#4083547)
    None of what I saw in the article is, in my mind, effective in fighting spam for the following reasons:

    By the time one can apply the filters, you have already received the spam. This is a load on your resources. In some cases your in-box may even fill up (yes, I've received 1000's of the same piece of spam in the same hour, exceeding the capacity of my allotted storage and effectively DOSing me from real e-mail) or you may exceed limitations from forwarding services.

    The spammers don't really care. Or notice. Their goal is to hit millions of victims, knowing that some of them will respond. The response is all they care about. Filter your e-mail all you want, you were not going to respond to them anyway. All they care about is reaching the mark that doesn't know any better, and this filter doesn't do anything to stop that (unless it is applied automatically by ISP's, unlikely due to the fear of fales positives).

    What might help is a two fold attack on what they want: responses from marks. I suggest the following:

    A massive education campaign to educate the general Internet user to never respond to (or even read) strange messages that show up in your e-mail. Banner ads would seem a good place to start, it would be a public service if a good percentage of banners were replaced with ones that educated the Internet users who still make spam profitable. This might even have the long term effect of improving banner revenue: if banners compete with spam as a way to get out a message they have a lower value than if the public is taught to not buy from spam and even to aggressively resist doing business with a spammer. In the long run an antispam banner campaign could improve banner revenue for those who help fight spam. Ideally another great way to get the word out would be UCE, but that poses a moral dilemma....

    The other thing that could effect the spammer is if the ads are not getting the desired results with the advertisers. What needs to happen here isn't filtering, it's massive negative response to the advertiser. No response don't hurt them, but making them respond themselves to unwanted responses is a more suitable way to respond to those who originate unwanted messages to use in the first place. These people need to get responses that waste their time and resources like they are wasting ours. Obviously those who supply 800 numbers are a prime target for this, while those who supply only postal addresses make it too costly to respond. I think such negative response campaigns need to be coordinated from major popular sites to be truly effective (not just from a few geeks who spend their day on an anti-spam website. Their efforts are much better applied by getting the spam sources in black holes and getting ISP's to block or filter spam). It sure would be nice to see the slashdot effect applied to spammers rather than the poor smuck who puts up a small but interesting website.

    Interested in other's thoughts in this area.

  • by LX.onesizebigger ( 323649 ) on Friday August 16, 2002 @01:23PM (#4083613) Homepage
    Even if the email doesn't bounce, no spammer alive will reply to it; so after 30 days, that email is marked as "denied".

    I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses. What happens when you order a Widget from Acme, Inc. and Acme sends you your confirmation by e-mail? Your script bounces a question, and Acme's mail server either bounces back at you, making it look like it was spam in the first place, or simply doesn't respond at all.

    The system implies that anything not sent by a human being is spam. This is not necessarily the case today. A lot of today's e-mail communications are auto-generated.

    To truly combat spam, it must be fought at the source. One step closer to that would be to integrate a standardized response to the type of message you send out in mail protocols. The problem with this is that all Joe Spammer would have to do is to point his reply-to to a valid business site.

    This brings us to the next point. Forged headers are easy to detect by software and have few (although it would be wrong to say no) legitimate applications. I cannot personally understand why it is not standard operation for mail servers to recognize and bounce messages with forged headers. Sure, it would increase processing load, but if done by all servers, more spam would be stopped closer to the source, meaning less spam to process for all.

    Or am I pulling a thinko here? Anybody?

  • by balamw ( 552275 ) on Friday August 16, 2002 @01:26PM (#4083641)

    The built in spam filters for Outlook and Hotmail are just so much less efficient than Spamassassin or Razor/SpamNET.

    My recent experience shows about 90% of the spam I get can be detected by Spamassasin, 70% by SpamNET and about the same for Hotmail. The Outlook/Outlook Express filters are basically blacklists and catch maybe 40% if properly maintained.

    It does sound very similar, so why haven't they been able to implement a Bayesian filter as successfully as the lisp guru?

  • by kawika ( 87069 ) on Friday August 16, 2002 @01:31PM (#4083696)
    Wow. It's described down to a level of detail that would make you think they've already written the Outlook add-in for it. I wonder why we haven't seen it yet?
  • by Anonymous Coward on Friday August 16, 2002 @01:36PM (#4083753)
    The CRM114 active filter uses the Bayesian
    technique described, but extends the probabilities
    to _phrases_ (including interrupted phrases) not just words.

    For example, the phrase

    Mary had a little lamb

    would insert hash marker entries on

    mary, had, a, little, lamb, mary had, mary a,
    mary little, mary lamb, mary had a, mary had
    little, mary had lamb, mary had a little

    and so on. My experiments say that you are
    just about out of significance at five words
    and it doesn't pay to go past that.

    The advantage of this is that it's often not
    words, but phrases that have the higher-level
    "meaning" (grammatical context?) that is even
    _more_ indicative of spam versus nonspam than
    the singular words taken alone.

    You can grab crm114 at:

    http://crm114.sourceforge.net

    -WSY
  • by Fizyx ( 93551 ) on Friday August 16, 2002 @02:00PM (#4084011)
    Not to filter posts for spam, but for, you know, quality!
  • by Tim Macinta ( 1052 ) <twm@alum.mit.edu> on Friday August 16, 2002 @02:20PM (#4084228) Homepage
    I've seen similar solutions before, and they are all nice and dandy except for one application: when communicating with businesses. What happens when you order a Widget from Acme, Inc. and Acme sends you your confirmation by e-mail? Your script bounces a question, and Acme's mail server either bounces back at you, making it look like it was spam in the first place, or simply doesn't respond at all.

    The system implies that anything not sent by a human being is spam. This is not necessarily the case today. A lot of today's e-mail communications are auto-generated.

    Hmmmm... how about if you were to keep a separate address space for emails you expect to be replied to from businesses? I'll use myself as an example. I could use my main address, twm@alum.mit.edu, to receive personal email and block spam using the technique described by the original poster. When I go to order something online, I could make up addresses at my domain twmacinta.com (for example, "spamproof+amazon8291@twmacinta.com") which could be proactively added to a whitelist before I gave them. I actually worked on a system to do the second half of this solution for awhile (the whitelist aliasing) for users without their own domains, but the one drawback to the system is that it wouldn't stop spam on existing addresses. The original poster's solution sounds like it would make a very nice complement.

  • by Guppy06 ( 410832 ) on Friday August 16, 2002 @02:32PM (#4084313)
    Senator Mary Landrieu
    724 Hart Senate Office Building
    Washington, DC 20510-0001

    Dear Senator Landrieu:

    Earlier this month the Federal Communications Commission (FCC) issued a record fine of nearly $5.4 million to Fax.com for transmitting unsolicited advertisements via fax machine (ie. "junk faxing"). Coincidentally, the Associated Press published a series of three articles covering the state of unsolicited e-mail advertising ("spam"). I'm left wondering how the FCC can come down hard on junk faxers but how spammers (arguably of a lower moral class) are allowed to continue to operate nearly unmolested.

    The law Fax.com was found to be guilty of breaking is Section 227 of Title 47 of the United States Code. The relevant text follows:

    Restrictions on the use of automated telephone equipment:

    It shall be unlawful for any person in the United States (...) to use any to use any telephone facsimile machine, computer, or other device to send an unsolicited advertisement to a telephone facsimile machine(.)

    It is my understanding that the reasoning behind this law is based on the ownership of resources. Fax machines are purchased and maintained at the owner's expense and only the owner's expense. An unsolicited advertisement sent to this fax machine amounts to nothing less the use of these expensive resources without prior consent. In effect "junk faxing" is considered theft and as such the offenders are held accountable by law.

    What does this have to do with spam? In my opinion, everything.

    Receiving an e-mail is by all accounts more expensive than receiving a fax. While several companies are now offering stand-alone e-mail clients, I have yet to see one of those with a lower price tag than a fax machine. But even if their price tags were the same, an e-mail station requires that the owner not only pay a monthly fee for a telephone line but also a second monthly fee for the e-mail account itself.

    Of course not even an end client is enough to receive an e-mail. The e-mail account you would be paying for is maintained on a very large (and very expensive) e-mail server, complete with its dedicated (and pricey) connection to the internet. There is simply nothing comparable to an e-mail server in the faxing domain. While a bank of fax machines doesn't cost more than the price of the machines and their associated telephone lines, the price a dedicated e-mail server and the associated connections can easily resemble that of a small car.

    So why is it that the FCC is given free reign to crack down on junk faxers but spammers are free to consume valuable equipment they do not own?

    If you are familiar with the AP articles I mentioned earlier you will know that spam is steadily eliminating the usefulness of e-mail itself. It has been estimated that spam accounts for up to 80% of the e-mail traffic to major e-mail domains such as Hotmail and Yahoo, a problem that their respective owners are all but powerless to fix. As more and more internet resources are tied up by these advertisements, the owners of these resources have had to resort to cutting off offending service providers from the rest of the internet entirely. Customers are finding themselves unable to use the internet access they have paid for simply because another customer of that same provider is abusing theirs.

    But even then the providers are powerless to drop spammers. Spammers in the recent AP articles have proudly boasted of the way they outright defraud unsuspecting internet service providers when signing up for an account. And when the provider threatens action, the spammer threatens the provider with legal action. In recent months a spammer was even successful in receiving a legal injunction against their service provider, preventing the provider from stopping the spammer from abusing their resources.

    I have little problem with receiving advertisements through the U. S. Postal Service. I know that the delivery cost for every article in my mailbox has been entirely paid by the sender. And while I am not happy with the current situation with telemarketers (I must pay for local telephone service before I have the "privilege"of being contacted by telemarketers), I must grudgingly admit that the state and federal laws designed to restrict telemarketing have been mostly successful. But I am not happy about paying several thousand dollars for a computer and $20.00 a month simply to have my e-mail account flooded to capacity with advertisements for products and services I have no interest in (and preventing legitimate e-mail from reaching me in the process). I am sure that you yourself have been bombarded with advertisements for websites featuring "nasty teens" or "penis enhancement." I have noticed that your office no longer maintains an e-mail address accessible to the public.

    The examples of spam I mentioned in the last paragraph bring me to another point: I have noticed on your website your stated commitment to enforcing decency laws on the internet, to protecting children from access to objectionable material on the internet. It should be obvious by now to even the most casual of internet users that the biggest offender in this area is the spammer. While a user must actively attempt to locate a website in order to find such material on the world wide web, the mere existence of an e-mail account all but guarantees that the owner will have such material delivered to them on a daily (if not hourly) basis.

    In my opinion the solution to this problem is very simple: expand 227 U. S. C. 47 to prohibit unsolicited e-mail advertisements in exactly the same way it prohibits unsolicited fax advertisements. Nothing more, and certainly nothing less.

    I have seen some ineffective bills drift through both houses of Congress that are written to allow unsolicited messages so long as they have an "opt-out" mechanism. Ignoring the fact that such legal loopholes would essentially negate the law entirely (can you prove that you tried to opt out?), it quite literally sickens me the way some of your fellow members of Congress feel that spam is somehow an issue dealing with the freedom of speech. The mere existence of the internet and the supposed changes it has on how business and the legal system work (even though such "changes" have been shown to be a lie) have helped to convince these poor fools that people should somehow have a right to use and abuse the property of others. Does my neighbor have the constitutional right to break my kneecap so long as they provide me with the ability to "opt out" of future kneecappings?

    The United States Constitution guarantees that all citizens are free to say what they want. It does not guarantee a soapbox upon which they can say it. Just as I am not guaranteed the right to have a billboard on Interstate 10, spammers should not have the "right" to use the resources of others simply because they're there.

    Expanding 227 U. S. C. 47 to include e-mail is an extremely important issue to me and I hope with your stated interests on your website that it is also an important issue to you as well. I know that you are up for re-election this November and I intend to find out how your competitors feel on the issue as well.
  • Re:Circumvent (Score:3, Interesting)

    by bedessen ( 411686 ) on Friday August 16, 2002 @04:07PM (#4085106) Journal
    His algorithm works because spam uses the same repetive syntax. Because so many spam/emails are sent out - it can be flagged by pattern recognition... based on the assumption that it is written in English!

    Huh? Where do you get that? The algorithm has NO KNOWLEDGE of syntax or structure. It knows only the presence (or absense) of words in the message, nothing of how they are grouped, positioned, ordered, related, structured, etc. There is zero grammar / pattern recognition as far as I can tell. As long as your corpus or database of reference mail is in the same language as the emails you wish to test, then the algorithm would work just fine. Perhaps you were thinking it used Markov chains?
  • by mattmunz ( 256529 ) on Friday August 16, 2002 @04:30PM (#4085288)
    Not only is this a great idea, it goes way beyond spam. How about "delete-as-off-topic" or "delete-as-rtfm" buttons specific to a given mailing list? The same algorithm could be used for these cases.

    Take it a step further to organize your entire mailbox. How about "categorize-as-tech-support" or "categorize-as-jboss-related". Many of us already push our email around into folders for the purpose of organization. I can't see why this algorithm can't be used to assist that process as well.

    The power of this system is that it is feedback-based. The software uses known science (statistics) to mold itself to your own preferences, by paying attention to the input that you have to make to use the application in the first place.

    Why do you think there are businesses whose sole function is to track and to report on the input people make to the various machines in their lives (computer/websites/tv/etc.)? This information is powerful and we need more examples of the ethical use of it. Note that his system is completely "individual" and doesn't require sharing user input with others through a central server.

    I haven't read the entire article, but I really think this is a great idea.
  • by NoInfo ( 247461 ) on Friday August 16, 2002 @04:40PM (#4085370) Homepage Journal
    2) Instead, append an entire dictionary wordlist to the end of your spam. Without correlating the words, this would be pretty destructive.
  • by RevAaron ( 125240 ) <revaaron AT hotmail DOT com> on Friday August 16, 2002 @04:47PM (#4085429) Homepage
    I'm not sure if I'd characterize Haskell as an aborted brain child. Some people use Haskell. Some people like it. At a lot of schools in the US at least, they teach Scheme, when all the students/faculty have "accepted" C, C++, and Java as "superior" for teaching. Which is blatently bullshit. Algol-kid languages suck, we all know that. (heh, couldn't help it) But the point still stands.
  • by incog8723 ( 579923 ) on Friday August 16, 2002 @05:32PM (#4085736)
    Maybe the concept of a P2P network could be harnessed in order to fight spam. For each spam tagged as actual spam by a real human, by a ridiculously large CRC (1024 bit or something--to rule out possibly tagging innocent mail), the CRC could be traded via the P2P network. Automatic updating, almost instantly. A client could be written in about 2k of code.

    Interacting with the email client would be another story, but just an idea.

    The only problem I can think of would be sabotage. Anyone could tag legitimate mass mailings as spam (such as a mailing list).

    Any comments on this idea?

  • by brw215 ( 601732 ) on Friday August 16, 2002 @07:07PM (#4086279) Homepage
    There are several classification techniques in the field of machine learning that are all more powerful then simple native bayes. In fact in graduate school I built one [nyu.edu] that outperformed N.B. by a significant margin.
    If people want to claim a "great new idea" they should research what has been done in the field first.
  • by Conesus ( 148179 ) on Friday August 16, 2002 @08:50PM (#4086876) Homepage
    Ok, so the subject line looks like spam. But what I did was buy a domain (conesus.com [conesus.com]) and setup auto-forwarding on everything @ the conesus.com domain.

    ANytime someone asks for my e-mail addres, it's their_business_name@conesus.com or their_personal_name@conesus.com.

    If I ever get spam from a certain address, I can block the address, and goto the site in question and change my address to something else.

    But the coolest part is if anybody sends a mass-email to me and my buds, they usually include a personal_message_to_me@conesus.com.

  • by Christopher B. Brown ( 1267 ) <cbbrowne@gmail.com> on Friday August 16, 2002 @09:14PM (#4087011) Homepage
    No, the approach does not make any assumptions about words being constructed in English.

    The "foreign language" Spam that I get gets nicely refiled by Ifile [mit.edu] into my Spam/Foreign folder.

    That folder has a corpus of messages assortedly written in Han, French, Kanji, Korean, Finnish, French, Spanish, and Russian, and Ifile nicely recognizes that words in those languages provide evidence that messages seem most relevant to go into that folder.

    Ultimately, it all involves human classification:

    • Initially, the corpus must be "primed" with an initial set of messages that I classify into the various categories I want to distinguish between.
    • Some messages are processed by Ifile into an appropriate mail folder.

      I go through them, and read them, perhaps just browsing titles when I see that spam seems appropriately filed.

      By leaving the messages in the folder, indicate that they were correctly filed, and should become part of the corpus.

    • Ifile drops some messages in the wrong folder.

      That then involves human intervention as I move the messages to where they should have been.

    Note that IFile is useful for filing good messages, not merely at throwing away spam.

    Indeed, the more that you use Bayesian filtering for, the more folders with distinctive kinds of message that you have, the better it gets at discriminating where messages should go. I don't have one "Spam" folder; I've got about 8 for different sorts of spam. I don't have one 'inbox' for all my "good" mail; the mail gets thrown into a veritable huge chasm of mail folders. The more there are, the better.

  • by kazbah ( 600283 ) on Friday August 16, 2002 @09:30PM (#4087097)

    I've had this theory for a long way on a technique that could be used to defeat spam once and for all. Despite what the author of this article states, trying to fight spam by analyzing the content is not going to defeat it, and as has been pointed out, there are many ways to work around that solution.

    Targetting the sending addresses, and most other techniques like that simply lead to wars of one-up-manship as the spammer and spam fighter struggle to find better techniques to hide and detect spam, respectively.

    So what's the theory? Fairly simple, really, and the technology is already available, but not widely implemented. Spam largely suffers from an identity problem. Consider that junk mail that arrives in the post box can easily be identified and/or blocked through legal means if necessary, largely because we know where it comes from. The reason spam has proliferated is because SMTP traffic is largely anonymous - mail servers basically trust the mail they receive and have no real way to verify the information being presented to them. Yes, they can check From: and To: headers to verify that the email is local / remote / relay attempt, whatever. But with the number of open relays on the net, it's easy to forge and bypass these checks.

    By using SSMTP (SMTP over SSL), all email can be logged with identifying information from the original sender. If enough servers on the net start to support SSMTP, and increasingly mandated its use, eventually I'd be able to block all regular SMTP traffic. This has the added advantage of making email more secure.

    But how does this stop spam? Well, it doesn't directly stop spam, but it means that we would legitimately be able to identify who originally sent the email. Once that happens, the spammer can no longer hide behind anonymous gateways. It probably wouldn't even matter too much if open relays were accidently left open - so long as the open relay didn't support SMTP but only supported SSMTP.

    Ideally, every user would require their own secure certs to properly identify the sender, but this would probably add too much cost for the average user, and may be rejected for privacy reasons. But so long as the mail servers themselves were configured this way, we would always be able to identify very quickly where the email was originally sourced, thus giving a recipient an easy place to target (and hence sue if it comes to that).

    As this takes off, it may actually be a way to make spam legitimate. The secure cert attached to the email could have an incentive allowing users to opt-in or opt-out automatically. A user could set their mail to say "yes, I'm willing to put up with ads if you're willing to pay me for it" putting the cost back on the person responsible for the spam in the first place - the advertiser.

    Anyway, it seems to me like a fairly simple way to solve this - but it does take a lot of co-operation to get there. Something that hasn't happened yet for IPv6, another new protocol that doesn't really seem to be getting off the ground. So what am I missing?

So you think that money is the root of all evil. Have you ever asked what is the root of money? -- Ayn Rand

Working...