Forgot your password?
typodupeerror
Software Programming The Internet IT Technology

Wikipedia Used for Artificial Intelligence 177

Posted by Zonk
from the great-it-has-finally-become-self-aware dept.
eldavojohn writes "It may be no surprise but Wikipedia is now being used in the field of artificial intelligence. The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. The concept is also on the forefront of artificial intelligence and progress towards an application passing the Turing Test and creating semantically aware applications. The article comments on uses of Wikipedia in this manner: '"... spam filters block all messages containing the word 'vitamin,' but fail to block messages containing the word B12. If the program never saw B12 before, it's just a word without any meaning. But you would know it's a vitamin," Markovitch said. "With our methodology, however, the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will correctly identify the message as spam," he added.'"
This discussion has been archived. No new comments can be posted.

Wikipedia Used for Artificial Intelligence

Comments Filter:
  • With the example of using Wikipedia for spam filtering as mentioned in the post, maybe more articles need to be written on spam-slang for Viagra....
  • by ILuvRamen (1026668) on Sunday January 07, 2007 @02:32PM (#17499076)
    don't you think masses of spammers are going to screw with wikipedia strategically on purpose so that it doesn't work properly for that if it starts to work very well to block them? They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc. Type spam map into google image search to see how blatently obvious it is to see where the spam comes from. Something like 98% of spam can be pinned down to 0.01% of the world by square footage. If they added fuzzy logic instead of alterable AI and only block e-mails from south korea with the word vitamin and not block ones from Nebraska with the word vitamin, then the problem would be decreased dramatically.
    • by WilliamSChips (793741) <full.infinity@NOspAm.gmail.com> on Sunday January 07, 2007 @02:50PM (#17499286) Journal
      You don't think there are hundreds of thousands of zombifiable computers in the United States? And what about people with business connections in China or Korea?
      • by ScentCone (795499) on Sunday January 07, 2007 @03:04PM (#17499416)
        You don't think there are hundreds of thousands of zombifiable computers in the United States?

        Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic. It's a system-by-system, admin-by-admin judgement call, but there's no question that Korea isn't doing nearly enough to stop this problem locally. If the local culture starts to realize that they're isolating themselves from large sections of the internet because they won't do something to prevent 99% of their outbound mail from being spam, then maybe the need to filter will also go away.

        And what about people with business connections in China or Korea?

        I have a lot of customers with contacts like that. All of them (their Asian contacts) use Yahoo, Gmail, and similar accounts specifically to avoid this problem. Businesses in China and Korea are totally aware that most ISPs in those areas have poisoned outbound SMTP relays and user desktops. Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.
        • Re: (Score:2, Insightful)

          by Gwwfps (912993)
          Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.

          I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.

          • by ScentCone (795499)
            I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.

            Yup, good point. Which is why the same thing seems be true to/from, say... Romania, etc. also
        • by syousef (465911)
          That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.

          Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.
          • by ScentCone (795499)
            Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.

            You're missing the point. When the packets from entire Class B address ranges are, by empirical testing, almost entirely crap, they people who own those addresses have already broken their little corner of the internet. Preserving the non-poisoned portion of the wider network isn't "destroying the village to save it," it's just sort of li
            • by syousef (465911)
              Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.

              You definitely do destroy not only the village but a connected community of villages with your solution. What should be happening is bringing pressure to bear against those who have had the address space allocated to them, then moving up the supply chain. Ult
              • by ScentCone (795499)
                Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.

                You're working too hard at this. The sound walls are an undesireable but nevertheless somewhat effective treatment for the symptom for a larger problem. The analogy is apt.

                What should be happening is bringing pressure to bear against those who have had
                • by syousef (465911)
                  The analogy wasn't apt at all. It was awful. What you're advocating diminishes the internet. I'm suggesting you punish the administrators not just the end users. Take away their IP address allocation and give them to someone else who's willing to make proper use of them. Don't block IPs.
    • Re: (Score:2, Informative)

      by gradedcheese (173758)
      most spam I get now looks to be from botnets rigged up using people's PCs here in the United States. Very little (in my inbox anyway) comes from the usual suspect geographical areas.
    • Re: (Score:3, Insightful)

      by Walt Dismal (534799)
      I agree that using Wikipedia opens up the knowledge base to strategic contamination. Any party with a vested interest could alter certain information and bias AIs using it. That is why I think the Israeli approach cited will run into problems.

      In my own research I've looked at the problem of AI knowledgebase contamination and know that unless a truth validation system is employed, it is all too easy to condemn the poor AI to reasoning with flawed data. And it's very difficult to design a good validation mec

    • Damn - that first sentence of yours took the words right out of my mouth. Unfortunately, I don't agree one iota with the rest of your post. But I'll just deal with the first point....

      I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a m
      • by timeOday (582209)

        I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a more permanent community around posters.

        What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications." Of course there's

        • What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications."

          Err, no. I have no idea where you got this idea from. What I actually don't like is weak attempts at improving the intelligence of computers. Furthermore, I like even less weak attempts at improving the intelligence of computers whose direct and inevitable consequence is the corruption of an incredibly useful resource, which in turn will lead to the corruption of the AI - the initial go

    • by Mr Chund Man (1013539) on Sunday January 07, 2007 @03:47PM (#17499774)
      Spam Map [postini.com]

      "South Korea, Indonesia, and especially Nigeria, etc"
      While we're at it, why not block Alberta, California, North Carolina, Virginia, Colorado, Oklahoma, Kansas, Vermont, New Hampshire, Massachusetts, Spain, France and Portugal - all spam hotspots according to the map cited? What's that, you receive email from people in these places? Tough titties, if we're to block email coming from spam hotspots as you say.

      Also, you've managed to point a finger of blame at Indonesia and Nigeria who are saintly in comparison to some more developed nations. Go racism!
    • by Incadenza (560402)

      Type spam map into google image search to see how blatently obvious it is to see where the spam comes from.

      Since you were modded 'interesting', I did exactly like you told and found this page: http://mailinator.com/mailinator/map.html [mailinator.com]. Refreshed it 3 times now, and every time at least 4 balloons are pointing at the US, one at Canada and 2 or 3 at European countries. Interesting indeed.

    • Something like 98% of spam can be pinned down to 0.01% of the world by square footage.

      A rough assessment of the last 30 days spam stored on my server suggests more than 75% comes from the USA.

      A quick look at http://www.mailinator.com/mailinator/map.html [mailinator.com] shows clusters in the south (Memphis seems to be a hotspot) and on the east coast.

      I don't know about Korea, but blocking Tennessee, Missouri and Florida would cut my spam in half. Blocking the rest of the USA would reduce it by 75%.

  • by Bodrius (191265) on Sunday January 07, 2007 @02:35PM (#17499106) Homepage
    This isn't new to Slashdotters...

    For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

    • A common tactic to defeat spam filters is to misspell words. The filters should look at the output of the Slashdot editors over the past decade to see what the common mistakes are.
  • by CRCulver (715279) <crculver@christopherculver.com> on Sunday January 07, 2007 @02:35PM (#17499108) Homepage

    Buy the federal phamacon regulatory agency's approved Be-12 from our licenced apotecaries! It's Be-12, the addition to your daily sustinence intake that makes it easier to just Be you!

    I suspect that any skilled spammer can work around such filters through circumlocution. Some of the penis spam I've been getting lately is really impressive in how oblique a reference to sex can be and yet still be immediately understandable.

  • wife at the devil, and the wife certainly cuckolds her husband. Whereas, house of Austria acquired the seventeen provinces, and by the latter, his from Leipsig, to which he refers in a subsequent one, and which I upon, than 'la pluie et le beau tens'.
    So which is it, Wikipedia? Should I open the big image attachment?
    • by Halo1 (136547) <jonas.maebe@elis ... .be minus author> on Sunday January 07, 2007 @03:02PM (#17499398) Homepage

      I recently got quite funny attempt like that, pumping some stock in the image attachment (which moreover looked like a captcha in order to avoid ocr). The title of the spam was however "cocaine inexcusable", and the body, well (just two sample quotes -- and yes, the two first sentences appeared together like that):

      We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
      The Blue Rocket is a handy little clit massager that packs a mighty punch.

      Needless to say, it triggered the bayasian filter pretty heavily in spite of all the obfuscation attempts :)

      • We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
        The Blue Rocket is a handy little clit massager that packs a mighty punch.

        Want to see where their spider got this stuff?

        The safe for children crap [bionictonic.co.uk] (since reworded)
        The Intimate Intruder Anal Probe [bionictonic.co.uk]
        The Wrist Rocket [bionictonic.co.uk]

    • A good sample of the fake content that spam engines create. It seems intuitively obvious to me that this text is completely meaningless, but getting an AI to understand why is much trickier. Clues come from the fact that "latter" is used incorrectly (being no "former" to distinguish "it" from), pronoun "his" refers to no subject, comparative "than" doesn't compare two subjects, etc.

      Unfortunately, humans make these sorts of semantic errors all the time. We're just extending a bayesian filter to make a statem
  • i prefer (Score:5, Funny)

    by macadamia_harold (947445) on Sunday January 07, 2007 @02:38PM (#17499140) Homepage
    For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers.

    I think it would be much more effective if we used a taxidermy-based solution to fight spammers.
  • It's not the words that the spam filter can't recognize that lets spam get through, its the increasing use of image spam. OCR and existing filters would do more to solve spam than would wiki-AI intelligent filters.

    Of course, the minute anti-spam software/services use OCR is the minute that spam images start looking like captchas.
    • Hmm, so what's actually happening is that the spammers are coercing the spam-filter writers to create good enough OCR so that the spammers can turn around and use that to circumvent the captcha's on the www. Talking about a devious ploy! We're fucked.
  • by tcopeland (32225) <(tom) (at) (thomasleecopeland.com)> on Sunday January 07, 2007 @02:44PM (#17499210) Homepage
    And all this time you thought it was just if and switch statements!

    Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article [shirky.com] on the Semantic web.

  • Future trends... (Score:3, Interesting)

    by creimer (824291) on Sunday January 07, 2007 @02:46PM (#17499226) Homepage
    Articial Intelligence may evolve to the point that it may decide to rewrite Wikipedia from an human-centric point of view to a AI-centric point of view (i.e., World War II resulted in the deaths of six million AIs). Since people will believe anything and Wikipedia can't be wrong, it'll be one step towards the formation of the Matrix. After all, only the victors write history.
  • B12 which is a vitamin which is also known to increase your health which your aunt sally sends you messages regularly on, so great, all messages from aunt sally are now blocked.
    • Please excuse my dear aunt sally.
      • by nelsonal (549144)
        Ha ha, I guess that's a pretty effective mnemonic (the firefox spell checker is the bees knees). I remembered that it was one, and remembered it, but had to google what it was supposed to be reminding me (even though I apply the order of operations nearly every day).
    • Re: (Score:2, Interesting)

      by CoderDog (782544)
      Presumably, Aunt Sally will be in your white-list and be passed through whether she's you tipping to startling new developments for viagra, or B-12. Most of the anti-spam work is done in an effort to avoid building mammoth personal black-lists of mostly short-lived addresses. I doubt we'll get rid of white-lists anytime soon, if ever.

      What would impress me is an AI that filtered spam very effectively, but also noticed that Aunt Sally had a new email address and continued to deliver her mail.
    • And just because your Aunt Sally doesn't want to receive spam about vitamins doesn't mean she wants to miss her weekly Bingo e-mails.
  • UMMMM wordnet? (Score:4, Informative)

    by Anonymous Coward on Sunday January 07, 2007 @02:50PM (#17499280)
    this kind of technique has been used for a while..

    http://wordnet.princeton.edu/ [princeton.edu]

    and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet [wikipedia.org]
    (like all sophisticated software) has been in development since the mid eighties..

    WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing
  • Since when (Score:4, Insightful)

    by trifish (826353) on Sunday January 07, 2007 @02:54PM (#17499316)
    Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
    • You have just descibed Data Mining.
    • Re:Since when (Score:5, Informative)

      by timeOday (582209) on Sunday January 07, 2007 @03:54PM (#17499844)
      Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
      What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
      • Re: (Score:3, Insightful)

        by maxwell demon (590494)
        What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

        The creative part?
        • Re: (Score:3, Interesting)

          by timeOday (582209)
          Maybe creative people just detect more abstract patterns (e.g. lower S/N ratio) than others?
      • Re: (Score:3, Informative)

        by sacrilicious (316896)
        What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

        Paraphrasing to make a point: What part of computing is not detecting, storing, and applying patterns and relations?

        To be meaningful, "AI" should denote more than (as the article summary indicates is being done) doing a grep through a web repository to deduce associations. There are branches of AI founded on brain neurology (neural nets), evolution (Genetic Algorithms), Bayesian logic, and various oth

      • by naoursla (99850)
        It's funny how AI is a moving target. Once we are able to reduce, explain, and understand how some aspect of AI works, many people no longer consider it AI.
      • by trifish (826353)
        > What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?

        A red herring comment modded +5 Insightful? *Shakes head*

        The keyword is part of intelligence. For instance, storing data is only a part of the "ability" called intelligence. By your logic anyone who is capable of storing is capable of artificial intelligence. However, the system advertised in this "article" has only parts of artificial intelligence. And those parts are considered rather trivial in CS.

        S
    • Re: (Score:3, Interesting)

      by Kjella (173770)
      Well, most of the defiitions on artifical intelligence go "intelligence by something artificial", then we're down to intelligence which is so fuzzily defined almost anything can be applied. The first definition on intelligence on wikipedia focuses on individuality, which in other words says it's a bunch of skills rolled up into one. The other is even fuzzier. Quote WP:

      A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994:
    • by Alef (605149)
      That is the thing with artificial intelligence research. So long as the concepts are understood only by researchers, people call it AI and regard it as something mysterious, but as soon as it gets useful applications and reaches the public it becomes "just statistics" or "business rule engines" or something similar. What you describe is data mining, a concept on the verge of entering the public mind.
      • what intelligence is is a difficult question to answer.

        personally i'd say its the ability to solve problems WITHOUT having been designed to solve those problems and the ability to see opertunities of improvement for the current way of doing things.

        cats live in our homes, foxes roam in our cities neither of those animals were designed for those environments nor have they had time for significant biological evoloution yet they find ways to manage in those environments.

        and we have in a couple of centuries gone
    • by coaxial (28297)
      That's not how wikipedia is being used. It's being used a reservoir for semantic information. You want to know if these two consecutive tokens are a name? Check wikipedia. Biographies are clearly labeled. Want to know if this token is a country? Check wikipedia. Want to know terms associated with a War of 1812? Check wikipedia. It's a data corpus made up of human anotated terms, and that's why it's valuable.
      • by trifish (826353)
        You call that "artificial intelligence", I call that a database. I don't think we should continue this discussion. Do your homework on AI first. Bye.
        • by coaxial (28297)
          You have no idea what you're talking about. If you did, you wouldn't be trying to conflate a data corpus and an algorithm. Also, if you had done the least bit of research into AI, and in this case information retrieval, you'd know just how simple real AI really is. I hate to tell you this. But AI is pretty much just simple search and table lookups. There's no magic dude. None what so ever. So I guess in that sense, it is like magic. It looks cool and amazing when you don't know how it's done, but wh
  • by D4C5CE (578304) on Sunday January 07, 2007 @02:58PM (#17499358)
    However many academic papers and spam filters throw their ever-more-elaborate algorithms at this issue, it is an arms race that cannot be won by the "good guys", as long as lawmakers keep pretending that technology alone could prevent the effects of sociopathic behavior: unsolicited bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem. This should have happened more than a decade ago, but now the world is simply running out of storage, bandwidth and CPU cycles much too quickly to afford waiting another decade (or even a year) for serious, intransigent anti-spam legislation that is long overdue.
  • All these word relation AI's make me laugh. We could have real AI if you wanted to put effort into it. Link [geocities.com]
    • The ironic part is that when I went to click on the link, the Geocities account was already dead. And yet I didn't need to read the page to understand that the author was a crank. That's the thing about intelligence that nobody has ever managed to capture to in a formal system.
  • Do they substitute numbers for letters in their filtering?
  • Given that the link distance between randomly chosen wikipedia articles is about five (sorry, don't have a link to where I saw this... and it was a while ago so maybe it's changed...) practically everything is going to be strongly associated with spam keywords.

    I don't see how this is getting us anywhere except moving closer to having a spam filter that just returns "true" to anything that isn't white-listed.

  • by MarkWatson (189759) on Sunday January 07, 2007 @03:21PM (#17499568) Homepage
    I will read the paper when I get the proceedings for the International Joint Conference for Artificial Intelligence. From the article, this seems like a statistical natural language processing application: the examples looked like they collect statistics of associations for both single word and short word sequences.

    BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.
    • by saddino (183491)
      The computational effort for short word sequences is no longer much of an issue. For example, the web clustering algorithm in the free application CQ web [q-phrase.com] computes clusters in corpus phrases up to seven words in length, and it runs without a hiccup on your standard Windows or Mac desktop.
  • by iamacat (583406) on Sunday January 07, 2007 @03:28PM (#17499632)
    There are lots of legit e-mails discussing vitamins, viagara or even penis enlargement, this post included.
  • by Sub Zero 992 (947972) on Sunday January 07, 2007 @03:42PM (#17499736) Homepage
    Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.

    The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.
    • Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.

      Along with the title, that is one of the most useless comments one finds in /.

      It is news to many of us —the great majority of readers I dare say— because we are nerds that come from different fields. I bet I could come up with common knowledge from cellular telephony that you haven't heard about and it would be news to you. If it was sufficiently interesting, it would even be newsworthy even if it's been kicked around base stations for 4 years.

      You make it sound like you have deeper knowledg

      • by dodongo (412749)

        I would be very interested in hearing about how are they going to use the general knowledge of the wiki to filter out advertisement. For instance, let's say that an email that contains B12 is talking about a plane and not the vitamin, what other elements should the program take into account to distinguish this?

        I do think you may have been a bit harsh on grandparent; I for one, having done some work in NLP, was wondering whether anyone else was really questioning the newsworthiness of the post. So you can,

  • This is a little off-topic, but I guess the only way to take out this menace of spam is to make the average joe accountable.
    If the spam originated from a botnet in his machine, make him accountable too.

    If he has installed the latest updates from Microsoft and still the botnet could get in, then it is not an issue. But, if he has not taken the effort to download the patches for say, the last 6 months, and a botnet operated from his machine, causing discomfiture to all and sundry, then he is accountable for i
  • http://threeseas.net/abstraction_physics.html [threeseas.net]

    considering the article is from physorg......

    and to think they plan to patent it? Abstraction Physics?

    I don't think so...
  • A knowledge base with associative retrieval capability has eluded researchers but they have one in Wikipedia. Now if only they can get AI to successfully [and hopefully, correctly] modify the knowledge base...
    • by kalirion (728907)
      Something like wikipedia will definitely be needed for people attempting to create true AI. The best part is that it can be easily gotten on CD (or is it DVD?), so the computer with the AI can be completely isolated from the outside world. You know, to avoid the Skynet scenario.
  • Hutter Prize (Score:3, Informative)

    by Baldrson (78598) * on Sunday January 07, 2007 @04:26PM (#17500122) Homepage Journal
    As has been previously reported on slashdot, The Hutter Prize for Lossless Compression of Human Knowledge [slashdot.org] uses a snapshot of Wikipedia for rigorously benchmarking AI (and it has already had it's first payout [slashdot.org]).

    The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence [hutter1.net] allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.

  • This is the biggest threat to Wikipedia I've heard in a long time.

    If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.
    • by sbaker (47485) *
      In the particular example given, a spammer trying to sell Vitamins using the word 'B12' would have a strong incentive to scan Wikipedia and remove all instances of the word 'B12' wherever it was found - and perhaps even to insert it spuriously in a few places where the end user might be white-listing words too.

      This would be very bad indeed for Wikipedia because it gives a motive to vandals - and not just to the stupid vandals we have right now - but to the annoyingly inventive ones too.

      Urgh!
    • This is the biggest threat to Wikipedia I've heard in a long time. If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.

      Personally, I think spammers are already much smarter than this. It may be my imagination, but if so it's surely coming, that spammers are grabbing text from places they harvest my name and just including that text in messages rather than trying to make up things from scratch. S

  • Conceptual processing is the ONLY way to deal with these issues.

    For example, what if I'm getting information sent to me from acquaintances about life extension - references to vitamins and nutrients would abound. But it wouldn't be spam.

    An AI spam blocker has to know what I'm interested in, what material I've received before that was cleared, AND has to be able to, in some sense, UNDERSTAND the content rather than just correlating it to other terms atomically in terms of frequency of occurrence. Otherwise,
  • by Alsee (515537)
    the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will [] identify the message as spam

    Ha Ha! Blocked!

    You didn't sink my battleship!

    -
  • Text of IJCAI paper (Score:3, Informative)

    by gvc (167165) on Sunday January 07, 2007 @09:26PM (#17502862)
    http://www.ijcai.org/papers07/Papers/IJCAI07-259.p df [ijcai.org]

    While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.

    Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html [nist.gov]

    The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html [nist.gov] Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.
  • Seriously. FWIW, I am for the most part a Google fanboy.

    I have had my GMail account for what, two years or so, and I really don't think google's spamfilter has ever missed a beat. That is to say that all the real spam I receive every day (~40 to 100 spams depending on the day) ends up in the spam folder, not my inbox. Spam is a total non-issue for me. OTOH, my hotmail inbox is so atrocious and the spamfilter so bad that I can't use the account for anything important. I don't know what kind of black magic
  • by mikeee (137160)
    Obviously, Judgement Day will be triggered by Skynet in a final, frustrated attempt to eliminate spammers.
  • The word "vitamin" in a message means it is spam? Methinks that the intelligence should be applied to better test for what is spam rather than simple minded associated term collecting for hot words from various online sources. Bayesian filters are much better than this already and do not require wikipedia reading to do their jobs with 99% accuracy after fairly minimal training.

Take care of the luxuries and the necessities will take care of themselves. -- Lazarus Long

Working...