Wikipedia Used for Artificial Intelligence

Catch up on stories from the past week (and beyond) at the Slashdot story archive

Wikipedia Used for Artificial Intelligence 177

Posted by Zonk on Sunday January 07, 2007 @02:25PM from the great-it-has-finally-become-self-aware dept.

eldavojohn writes "It may be no surprise but Wikipedia is now being used in the field of artificial intelligence. The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. The concept is also on the forefront of artificial intelligence and progress towards an application passing the Turing Test and creating semantically aware applications. The article comments on uses of Wikipedia in this manner: '"... spam filters block all messages containing the word 'vitamin,' but fail to block messages containing the word B12. If the program never saw B12 before, it's just a word without any meaning. But you would know it's a vitamin," Markovitch said. "With our methodology, however, the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will correctly identify the message as spam," he added.'"

This discussion has been archived. No new comments can be posted.

Wikipedia Used for Artificial Intelligence

Load All Comments

Search 177 Comments Log In/Create an Account

Comments Filter:

Wikipedia needs work for spam filtering.... (Score:2, Insightful)

by MoHaG ( 1002926 ) writes:

With the example of using Wikipedia for spam filtering as mentioned in the post, maybe more articles need to be written on spam-slang for Viagra....
- - Re:Wikipedia needs work for spam filtering.... (Score:5, Insightful)
    
    by Metasquares ( 555685 ) writes: <{moc.derauqsatem} {ta} {todhsals}> on Sunday January 07, 2007 @04:31PM (#17500176) Homepage
    
    Infer too much and the false positive rate skyrockets, though...
    
    Parent Share
    twitter facebook
uh oh, there goes wikipedia (Score:4, Interesting)

by ILuvRamen ( 1026668 ) writes: on Sunday January 07, 2007 @02:32PM (#17499076)

don't you think masses of spammers are going to screw with wikipedia strategically on purpose so that it doesn't work properly for that if it starts to work very well to block them? They should just stop being afraid of being called racist and super-filter every e-mail that comes out of South Korea, Indonesia, and especially Nigeria, etc. Type spam map into google image search to see how blatently obvious it is to see where the spam comes from. Something like 98% of spam can be pinned down to 0.01% of the world by square footage. If they added fuzzy logic instead of alterable AI and only block e-mails from south korea with the word vitamin and not block ones from Nebraska with the word vitamin, then the problem would be decreased dramatically.

Share
twitter facebook
- Re:uh oh, there goes wikipedia (Score:5, Insightful)
  
  by WilliamSChips ( 793741 ) writes: <full...infinity@@@gmail...com> on Sunday January 07, 2007 @02:50PM (#17499286) Journal
  
  You don't think there are hundreds of thousands of zombifiable computers in the United States? And what about people with business connections in China or Korea?
  
  Parent Share
  twitter facebook
  - Re:uh oh, there goes wikipedia (Score:5, Interesting)
    
    by ScentCone ( 795499 ) writes: on Sunday January 07, 2007 @03:04PM (#17499416)
    
    You don't think there are hundreds of thousands of zombifiable computers in the United States?
    
    Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic. It's a system-by-system, admin-by-admin judgement call, but there's no question that Korea isn't doing nearly enough to stop this problem locally. If the local culture starts to realize that they're isolating themselves from large sections of the internet because they won't do something to prevent 99% of their outbound mail from being spam, then maybe the need to filter will also go away.
    
    And what about people with business connections in China or Korea?
    
    I have a lot of customers with contacts like that. All of them (their Asian contacts) use Yahoo, Gmail, and similar accounts specifically to avoid this problem. Businesses in China and Korea are totally aware that most ISPs in those areas have poisoned outbound SMTP relays and user desktops. Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.
    
    Parent Share
    twitter facebook
    - Re: (Score:2, Insightful)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Re: (Score:2)
        
        by ScentCone ( 795499 ) writes:
        
        I would think that the majority of inbound mail those places get from say the US will be "toxic" as well. When legitimate traffic between two regions are scarce (like between places with differing languages and a large geographical seperation), of course the spam will seem overwhelming by proportion.
        
        Yup, good point. Which is why the same thing seems be true to/from, say... Romania, etc. also
    - Re: (Score:2)
      
      by syousef ( 465911 ) writes:
      
      That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic.
      
      Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.
      - Re: (Score:2)
        
        by ScentCone ( 795499 ) writes:
        
        Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.
        
        You're missing the point. When the packets from entire Class B address ranges are, by empirical testing, almost entirely crap, they people who own those addresses have already broken their little corner of the internet. Preserving the non-poisoned portion of the wider network isn't "destroying the village to save it," it's just sort of li
        
        Re: (Score:2)
        
        by syousef ( 465911 ) writes:
        
        Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.
        
        You definitely do destroy not only the village but a connected community of villages with your solution. What should be happening is bringing pressure to bear against those who have had the address space allocated to them, then moving up the supply chain. Ult
        
        Re: (Score:2)
        
        by ScentCone ( 795499 ) writes:
        
        Sorry but what a terrible analogy. Sound walls don't redirect traffic, they fix the problem of sound affecting nearby homes. You're mixing a traffic metaphor with a sound metaphor in a way that makes so little sense it's worse than bad - it's confusing.
        
        You're working too hard at this. The sound walls are an undesireable but nevertheless somewhat effective treatment for the symptom for a larger problem. The analogy is apt.
        
        What should be happening is bringing pressure to bear against those who have had
        
        Re: (Score:2)
        
        by syousef ( 465911 ) writes:
        
        The analogy wasn't apt at all. It was awful. What you're advocating diminishes the internet. I'm suggesting you punish the administrators not just the end users. Take away their IP address allocation and give them to someone else who's willing to make proper use of them. Don't block IPs.
        
        Re: (Score:2)
        
        by syousef ( 465911 ) writes:
        
        You shouldn't be targeting geography at all. NY or Korea, it makes no difference, some businesses may have a legitimate need to communicate with someone at a particular geography. The Internet's beauty is that with few exceptions (shipping costs, time zones, legislation) you don't even need to worry about someone's physical location.
        
        I'm not suggesting you block a nation. I'm suggesting you strike a deal with someone else in that country to provide the same addresses, on pain of losing them if they can't con
    - - Re: (Score:2)
        
        by ScentCone ( 795499 ) writes:
        
        interestingly, most of the Nigerian scam email i receive use Yahoo accounts, and Yahoo certainly hasn't done much to police them, so I think your point is kinda silly.
        
        also, having looked at enough email headers from spammers, while they may originate from some of those countries you mentioned, i notice many use accounts like Yahoo and gmail from U.S. servers, which shoots your whole theory down.
        
        But, it's not a theory. I'm talking about what I actually see in logs and message queues, especially on rece
- Re: (Score:2, Informative)
  
  by gradedcheese ( 173758 ) writes:
  
  most spam I get now looks to be from botnets rigged up using people's PCs here in the United States. Very little (in my inbox anyway) comes from the usual suspect geographical areas.
- Re: (Score:3, Insightful)
  
  by Walt Dismal ( 534799 ) writes:
  
  I agree that using Wikipedia opens up the knowledge base to strategic contamination. Any party with a vested interest could alter certain information and bias AIs using it. That is why I think the Israeli approach cited will run into problems.
  In my own research I've looked at the problem of AI knowledgebase contamination and know that unless a truth validation system is employed, it is all too easy to condemn the poor AI to reasoning with flawed data. And it's very difficult to design a good validation mec
  - - Re: (Score:2)
      
      by FooAtWFU ( 699187 ) writes:
      
      Maybe the AI is working from a local copy of the Wikipedia database [wikimedia.org] that isn't vulnerable to live vandalism or anything silly like that. And maybe Wikipedia spammers are more interested in a) putting links to their sites at the bottom of articles to boost PageRank and to capture the attention of random viewers or b) putting in biased promotional material and and other advertisements in a relevant page. And maybe this is likely to be far more attractive of an option than spamming Wikipedia in irrelevant plac
- Re: (Score:2)
  
  by NeutronCowboy ( 896098 ) writes:
  
  Damn - that first sentence of yours took the words right out of my mouth. Unfortunately, I don't agree one iota with the rest of your post. But I'll just deal with the first point....
  
  I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a m
  - Re: (Score:2)
    
    by timeOday ( 582209 ) writes:
    
    I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a more permanent community around posters.
    What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications." Of course there's
    - Re: (Score:2)
      
      by NeutronCowboy ( 896098 ) writes:
      
      What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications."
      Err, no. I have no idea where you got this idea from. What I actually don't like is weak attempts at improving the intelligence of computers. Furthermore, I like even less weak attempts at improving the intelligence of computers whose direct and inevitable consequence is the corruption of an incredibly useful resource, which in turn will lead to the corruption of the AI - the initial go
- Re:uh oh, there goes wikipedia (Score:5, Interesting)
  
  by Mr Chund Man ( 1013539 ) writes: on Sunday January 07, 2007 @03:47PM (#17499774)
  
  Spam Map [postini.com]
  
  "South Korea, Indonesia, and especially Nigeria, etc"
  While we're at it, why not block Alberta, California, North Carolina, Virginia, Colorado, Oklahoma, Kansas, Vermont, New Hampshire, Massachusetts, Spain, France and Portugal - all spam hotspots according to the map cited? What's that, you receive email from people in these places? Tough titties, if we're to block email coming from spam hotspots as you say.
  
  Also, you've managed to point a finger of blame at Indonesia and Nigeria who are saintly in comparison to some more developed nations. Go racism!
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Incadenza ( 560402 ) writes:
  
  Type spam map into google image search to see how blatently obvious it is to see where the spam comes from.
  Since you were modded 'interesting', I did exactly like you told and found this page: http://mailinator.com/mailinator/map.html [mailinator.com]. Refreshed it 3 times now, and every time at least 4 balloons are pointing at the US, one at Canada and 2 or 3 at European countries. Interesting indeed.
- Re: (Score:2)
  
  by ozmanjusri ( 601766 ) writes:
  
  Something like 98% of spam can be pinned down to 0.01% of the world by square footage.
  A rough assessment of the last 30 days spam stored on my server suggests more than 75% comes from the USA.
  A quick look at http://www.mailinator.com/mailinator/map.html [mailinator.com] shows clusters in the south (Memphis seems to be a hotspot) and on the east coast.
  I don't know about Korea, but blocking Tennessee, Missouri and Florida would cut my spam in half. Blocking the rest of the USA would reduce it by 75%.
Nothing new here... (Score:5, Funny)

by Bodrius ( 191265 ) writes: on Sunday January 07, 2007 @02:35PM (#17499106) Homepage

This isn't new to Slashdotters...

For years, Slashdot posts have used wikipedia as a form of artificial intelligence.

Share
twitter facebook
- Mine Slashdot headlines (Score:2)
  
  by Ed Avis ( 5917 ) writes:
  
  A common tactic to defeat spam filters is to misspell words. The filters should look at the output of the Slashdot editors over the past decade to see what the common mistakes are.
Comment removed (Score:3, Insightful)

by account_deleted ( 4530225 ) writes: on Sunday January 07, 2007 @02:35PM (#17499108)

Comment removed based on user account deletion

Share
twitter facebook
- Re: (Score:2)
  
  by dangitman ( 862676 ) writes:
  
  Your penis gets spam? Damn it must hurt if you put it through a filter.
- Re: (Score:2)
  
  by Watson Ladd ( 955755 ) writes:
  
  Well, just filter out all Bloodhound Gang lyrics and we're ok.
WikiTuring Test (Score:2)

by MillionthMonkey ( 240664 ) writes:

wife at the devil, and the wife certainly cuckolds her husband. Whereas, house of Austria acquired the seventeen provinces, and by the latter, his from Leipsig, to which he refers in a subsequent one, and which I upon, than 'la pluie et le beau tens'.
So which is it, Wikipedia? Should I open the big image attachment?
- Re:WikiTuring Test (Score:4, Funny)
  
  by Halo1 ( 136547 ) writes: on Sunday January 07, 2007 @03:02PM (#17499398)
  
  I recently got quite funny attempt like that, pumping some stock in the image attachment (which moreover looked like a captcha in order to avoid ocr). The title of the spam was however "cocaine inexcusable", and the body, well (just two sample quotes -- and yes, the two first sentences appeared together like that):
  
  We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
  The Blue Rocket is a handy little clit massager that packs a mighty punch.
  
  Needless to say, it triggered the bayasian filter pretty heavily in spite of all the obfuscation attempts :)
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by MillionthMonkey ( 240664 ) writes:
    
    We are working with Internet Content Rating Association to make the internet safer for children. Powered by a super strong Japanese motor and gears this incredibly powerful anal probe will hit the spot every time.
    The Blue Rocket is a handy little clit massager that packs a mighty punch.
    Want to see where their spider got this stuff?
    
    The safe for children crap [bionictonic.co.uk] (since reworded)
    The Intimate Intruder Anal Probe [bionictonic.co.uk]
    The Wrist Rocket [bionictonic.co.uk]
- Re: (Score:2)
  
  by mandelbr0t ( 1015855 ) writes:
  
  A good sample of the fake content that spam engines create. It seems intuitively obvious to me that this text is completely meaningless, but getting an AI to understand why is much trickier. Clues come from the fact that "latter" is used incorrectly (being no "former" to distinguish "it" from), pronoun "his" refers to no subject, comparative "than" doesn't compare two subjects, etc.
  
  Unfortunately, humans make these sorts of semantic errors all the time. We're just extending a bayesian filter to make a statem
i prefer (Score:5, Funny)

by macadamia_harold ( 947445 ) writes: on Sunday January 07, 2007 @02:38PM (#17499140) Homepage

For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers.

I think it would be much more effective if we used a taxidermy-based solution to fight spammers.

Share
twitter facebook
Cool solution to yesterday's problem (Score:2)

by G4from128k ( 686170 ) writes:

It's not the words that the spam filter can't recognize that lets spam get through, its the increasing use of image spam. OCR and existing filters would do more to solve spam than would wiki-AI intelligent filters.

Of course, the minute anti-spam software/services use OCR is the minute that spam images start looking like captchas.
- Re: (Score:2)
  
  by NoOneInParticular ( 221808 ) writes:
  
  Hmm, so what's actually happening is that the spammers are coercing the spam-filter writers to create good enough OCR so that the spammers can turn around and use that to circumvent the captcha's on the www. Talking about a devious ploy! We're fucked.
Artificial intelligence! (Score:4, Informative)

by tcopeland ( 32225 ) writes: <tom&thomasleecopeland,com> on Sunday January 07, 2007 @02:44PM (#17499210) Homepage

And all this time you thought it was just if and switch statements!

Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article [shirky.com] on the Semantic web.

Share
twitter facebook
Future trends... (Score:3, Interesting)

by __aaclcg7560 ( 824291 ) writes: on Sunday January 07, 2007 @02:46PM (#17499226)

Articial Intelligence may evolve to the point that it may decide to rewrite Wikipedia from an human-centric point of view to a AI-centric point of view (i.e., World War II resulted in the deaths of six million AIs). Since people will believe anything and Wikipedia can't be wrong, it'll be one step towards the formation of the Matrix. After all, only the victors write history.

Share
twitter facebook
Uhh (Score:2)

by unborracho ( 108756 ) writes:

B12 which is a vitamin which is also known to increase your health which your aunt sally sends you messages regularly on, so great, all messages from aunt sally are now blocked.
- Re: (Score:2)
  
  by DavidLeblond ( 267211 ) writes:
  
  Please excuse my dear aunt sally.
  - Re: (Score:2)
    
    by nelsonal ( 549144 ) writes:
    
    Ha ha, I guess that's a pretty effective mnemonic (the firefox spell checker is the bees knees). I remembered that it was one, and remembered it, but had to google what it was supposed to be reminding me (even though I apply the order of operations nearly every day).
- Re: (Score:2, Interesting)
  
  by CoderDog ( 782544 ) writes:
  
  Presumably, Aunt Sally will be in your white-list and be passed through whether she's you tipping to startling new developments for viagra, or B-12. Most of the anti-spam work is done in an effort to avoid building mammoth personal black-lists of mostly short-lived addresses. I doubt we'll get rid of white-lists anytime soon, if ever.
  
  What would impress me is an AI that filtered spam very effectively, but also noticed that Aunt Sally had a new email address and continued to deliver her mail.
- I dinna think it means what the AI thinks it means (Score:2)
  
  by HTH NE1 ( 675604 ) writes:
  
  And just because your Aunt Sally doesn't want to receive spam about vitamins doesn't mean she wants to miss her weekly Bingo e-mails.
UMMMM wordnet? (Score:4, Informative)

by Anonymous Coward writes: on Sunday January 07, 2007 @02:50PM (#17499280)

this kind of technique has been used for a while..

http://wordnet.princeton.edu/ [princeton.edu]

and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet [wikipedia.org]
(like all sophisticated software) has been in development since the mid eighties..

WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing

Share
twitter facebook
- - Re: (Score:2, Interesting)
    
    by modeless ( 978411 ) writes:
    
    I can't imagine that wikipedia would be better for this than wordnet
    
    You must not have a very good imagination. Wikipedia articles are far larger than wordnet definitions, with much more potential to hold useful information. Wikipedia has a much larger scope than wordnet, including huge amounts of cultural, historical, and scientific data that wordnet ignores. Wikipedia has a larger team of contributors. Wikipedia has data in several other languages besides English. Wikipedia is constantly updated with
Since when (Score:4, Insightful)

by trifish ( 826353 ) writes: on Sunday January 07, 2007 @02:54PM (#17499316)

Since when a database + automated search (keyword patterns and relations) = artifical intelligence?

Share
twitter facebook
- Re: (Score:2)
  
  by Flamesplash ( 469287 ) writes:
  
  You have just descibed Data Mining.
- Re:Since when (Score:5, Informative)
  
  by timeOday ( 582209 ) writes: on Sunday January 07, 2007 @03:54PM (#17499844)
  
  Since when a database + automated search (keyword patterns and relations) = artifical intelligence?
  What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Insightful)
    
    by maxwell demon ( 590494 ) writes:
    
    What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
    
    The creative part?
    - Re: (Score:3, Interesting)
      
      by timeOday ( 582209 ) writes:
      
      Maybe creative people just detect more abstract patterns (e.g. lower S/N ratio) than others?
  - Re: (Score:3, Informative)
    
    by sacrilicious ( 316896 ) writes:
    
    What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
    Paraphrasing to make a point: What part of computing is not detecting, storing, and applying patterns and relations?
    To be meaningful, "AI" should denote more than (as the article summary indicates is being done) doing a grep through a web repository to deduce associations. There are branches of AI founded on brain neurology (neural nets), evolution (Genetic Algorithms), Bayesian logic, and various oth
  - Re: (Score:2)
    
    by naoursla ( 99850 ) writes:
    
    It's funny how AI is a moving target. Once we are able to reduce, explain, and understand how some aspect of AI works, many people no longer consider it AI.
    - The target hasn't moved (Score:2)
      
      by ClosedSource ( 238333 ) writes:
      
      The issue isn't understanding how AI "works", it's understanding how to make AI work. AI isn't a moving target, we just keep assuming we're closer to it than we really are.
  - Re: (Score:2)
    
    by trifish ( 826353 ) writes:
    
    > What part of human/animal intelligence is not detecting, storing, and applying patterns and relations?
    
    A red herring comment modded +5 Insightful? *Shakes head*
    
    The keyword is part of intelligence. For instance, storing data is only a part of the "ability" called intelligence. By your logic anyone who is capable of storing is capable of artificial intelligence. However, the system advertised in this "article" has only parts of artificial intelligence. And those parts are considered rather trivial in CS.
    
    S
  - - Re: (Score:2)
      
      by timeOday ( 582209 ) writes:
      
      I hope we do have a spirit that makes us innately different from machines, but I'll just point out that an AI that can exhibit human-level intelligence would revolutionize the world, whether "weak" or "strong." In fact I'd prefer they were "weak" so we wouldn't have to give them rights or feel guilty about making them work for us.
- Re: (Score:3, Interesting)
  
  by Kjella ( 173770 ) writes:
  
  Well, most of the defiitions on artifical intelligence go "intelligence by something artificial", then we're down to intelligence which is so fuzzily defined almost anything can be applied. The first definition on intelligence on wikipedia focuses on individuality, which in other words says it's a bunch of skills rolled up into one. The other is even fuzzier. Quote WP:
  
  A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994:
- Re: (Score:2)
  
  by Alef ( 605149 ) writes:
  
  That is the thing with artificial intelligence research. So long as the concepts are understood only by researchers, people call it AI and regard it as something mysterious, but as soon as it gets useful applications and reaches the public it becomes "just statistics" or "business rule engines" or something similar. What you describe is data mining, a concept on the verge of entering the public mind.
  - Re: (Score:2)
    
    by petermgreen ( 876956 ) writes:
    
    what intelligence is is a difficult question to answer.
    
    personally i'd say its the ability to solve problems WITHOUT having been designed to solve those problems and the ability to see opertunities of improvement for the current way of doing things.
    
    cats live in our homes, foxes roam in our cities neither of those animals were designed for those environments nor have they had time for significant biological evoloution yet they find ways to manage in those environments.
    
    and we have in a couple of centuries gone
- Re: (Score:2)
  
  by coaxial ( 28297 ) writes:
  
  That's not how wikipedia is being used. It's being used a reservoir for semantic information. You want to know if these two consecutive tokens are a name? Check wikipedia. Biographies are clearly labeled. Want to know if this token is a country? Check wikipedia. Want to know terms associated with a War of 1812? Check wikipedia. It's a data corpus made up of human anotated terms, and that's why it's valuable.
  - Re: (Score:2)
    
    by trifish ( 826353 ) writes:
    
    You call that "artificial intelligence", I call that a database. I don't think we should continue this discussion. Do your homework on AI first. Bye.
    - Re: (Score:2)
      
      by coaxial ( 28297 ) writes:
      
      You have no idea what you're talking about. If you did, you wouldn't be trying to conflate a data corpus and an algorithm. Also, if you had done the least bit of research into AI, and in this case information retrieval, you'd know just how simple real AI really is. I hate to tell you this. But AI is pretty much just simple search and table lookups. There's no magic dude. None what so ever. So I guess in that sense, it is like magic. It looks cool and amazing when you don't know how it's done, but wh
Just make spam a crime! (Score:4, Insightful)

by D4C5CE ( 578304 ) writes: on Sunday January 07, 2007 @02:58PM (#17499358)

However many academic papers and spam filters throw their ever-more-elaborate algorithms at this issue, it is an arms race that cannot be won by the "good guys", as long as lawmakers keep pretending that technology alone could prevent the effects of sociopathic behavior: unsolicited bulk messages won't go away unless sending them is severely punishable and vigorously prosecuted in all nations that contribute to the problem. This should have happened more than a decade ago, but now the world is simply running out of storage, bandwidth and CPU cycles much too quickly to afford waiting another decade (or even a year) for serious, intransigent anti-spam legislation that is long overdue.

Share
twitter facebook
- - Re: (Score:2)
    
    by Neoprofin ( 871029 ) writes:
    
    For me, drug addiction, poverty, world hunger, nuclear proliferation, racism, sexual harrasment, and rising energy concerns have all been solved. Whew! Glad we got that out of the way.
    
    Just because a problem is not having an obvious and overt effect on you personally doesn't erase your knowledge that something exists. Administrators are having a problem, they're telling you with their actions. If there was no spam there'd be no spam filters, if it wasn't getting worse they wouldn't need better ones. You cl
For true AI, you need 3d spacial recognition (Score:2)

by CrazyJim1 ( 809850 ) writes:

All these word relation AI's make me laugh. We could have real AI if you wanted to put effort into it. Link [geocities.com]
- Re: (Score:2)
  
  by smallfries ( 601545 ) writes:
  
  The ironic part is that when I went to click on the link, the Geocities account was already dead. And yet I didn't need to read the page to understand that the author was a crank. That's the thing about intelligence that nobody has ever managed to capture to in a formal system.
how about pen1s en1argement? (Score:2)

by gamer4Life ( 803857 ) writes:

Do they substitute numbers for letters in their filtering?
associations... (Score:2)

by pedantic bore ( 740196 ) writes:

Given that the link distance between randomly chosen wikipedia articles is about five (sorry, don't have a link to where I saw this... and it was a while ago so maybe it's changed...) practically everything is going to be strongly associated with spam keywords.
I don't see how this is getting us anywhere except moving closer to having a spam filter that just returns "true" to anything that isn't white-listed.
Looks like good research (Score:3, Informative)

by MarkWatson ( 189759 ) writes: on Sunday January 07, 2007 @03:21PM (#17499568) Homepage

I will read the paper when I get the proceedings for the International Joint Conference for Artificial Intelligence. From the article, this seems like a statistical natural language processing application: the examples looked like they collect statistics of associations for both single word and short word sequences.

BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.

Share
twitter facebook
- Re: (Score:2)
  
  by saddino ( 183491 ) writes:
  
  The computational effort for short word sequences is no longer much of an issue. For example, the web clustering algorithm in the free application CQ web [q-phrase.com] computes clusters in corpus phrases up to seven words in length, and it runs without a hiccup on your standard Windows or Mac desktop.
Not very "intelligent" (Score:5, Insightful)

by iamacat ( 583406 ) writes: on Sunday January 07, 2007 @03:28PM (#17499632)

There are lots of legit e-mails discussing vitamins, viagara or even penis enlargement, this post included.

Share
twitter facebook
Not New, not newsworthy (Score:3, Informative)

by Sub Zero 992 ( 947972 ) writes: on Sunday January 07, 2007 @03:42PM (#17499736) Homepage

Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.

The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.

Share
twitter facebook
- Re: (Score:2)
  
  by Virtual_Raider ( 52165 ) writes:
  
  Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.
  Along with the title, that is one of the most useless comments one finds in /.
  It is news to many of us —the great majority of readers I dare say— because we are nerds that come from different fields. I bet I could come up with common knowledge from cellular telephony that you haven't heard about and it would be news to you. If it was sufficiently interesting, it would even be newsworthy even if it's been kicked around base stations for 4 years.
  You make it sound like you have deeper knowledg
  - Re: (Score:2)
    
    by dodongo ( 412749 ) writes:
    
    I would be very interested in hearing about how are they going to use the general knowledge of the wiki to filter out advertisement. For instance, let's say that an email that contains B12 is talking about a plane and not the vitamin, what other elements should the program take into account to distinguish this?
    I do think you may have been a bit harsh on grandparent; I for one, having done some work in NLP, was wondering whether anyone else was really questioning the newsworthiness of the post. So you can,
Make the people accountable (Score:2)

by thePig ( 964303 ) writes:

This is a little off-topic, but I guess the only way to take out this menace of spam is to make the average joe accountable.
If the spam originated from a botnet in his machine, make him accountable too.

If he has installed the latest updates from Microsoft and still the botnet could get in, then it is not an issue. But, if he has not taken the effort to download the patches for say, the last 6 months, and a botnet operated from his machine, causing discomfiture to all and sundry, then he is accountable for i
Look up Abstraction Physics (Score:2)

by 3seas ( 184403 ) writes:

http://threeseas.net/abstraction_physics.html [threeseas.net]

considering the article is from physorg......

and to think they plan to patent it? Abstraction Physics?

I don't think so...
Perhaps this is all that we were missing for AI (Score:2)

by alexwcovington ( 855979 ) writes:

A knowledge base with associative retrieval capability has eluded researchers but they have one in Wikipedia. Now if only they can get AI to successfully [and hopefully, correctly] modify the knowledge base...
- Re: (Score:2)
  
  by kalirion ( 728907 ) writes:
  
  Something like wikipedia will definitely be needed for people attempting to create true AI. The best part is that it can be easily gotten on CD (or is it DVD?), so the computer with the AI can be completely isolated from the outside world. You know, to avoid the Skynet scenario.
Hutter Prize (Score:3, Informative)

by Baldrson ( 78598 ) * writes: on Sunday January 07, 2007 @04:26PM (#17500122) Homepage Journal

As has been previously reported on slashdot, The Hutter Prize for Lossless Compression of Human Knowledge [slashdot.org] uses a snapshot of Wikipedia for rigorously benchmarking AI (and it has already had it's first payout [slashdot.org]).
The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence [hutter1.net] allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.

Share
twitter facebook
But spammers can add content to WIkipedia (Score:2)

by dpbsmith ( 263124 ) writes:

This is the biggest threat to Wikipedia I've heard in a long time.

If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.
- Re: (Score:2)
  
  by sbaker ( 47485 ) * writes:
  
  In the particular example given, a spammer trying to sell Vitamins using the word 'B12' would have a strong incentive to scan Wikipedia and remove all instances of the word 'B12' wherever it was found - and perhaps even to insert it spuriously in a few places where the end user might be white-listing words too.
  
  This would be very bad indeed for Wikipedia because it gives a motive to vandals - and not just to the stupid vandals we have right now - but to the annoyingly inventive ones too.
  
  Urgh!
- The double-edged sword that is knowledge (Score:2)
  
  by NetSettler ( 460623 ) * writes:
  
  This is the biggest threat to Wikipedia I've heard in a long time. If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.
  Personally, I think spammers are already much smarter than this. It may be my imagination, but if so it's surely coming, that spammers are grabbing text from places they harvest my name and just including that text in messages rather than trying to make up things from scratch. S
As I've Said Many Times Before (Score:2)

by Master of Transhuman ( 597628 ) writes:

Conceptual processing is the ONLY way to deal with these issues.

For example, what if I'm getting information sent to me from acquaintances about life extension - references to vitamins and nutrients would abound. But it wouldn't be spam.

An AI spam blocker has to know what I'm interested in, what material I've received before that was cleared, AND has to be able to, in some sense, UNDERSTAND the content rather than just correlating it to other terms atomically in terms of frequency of occurrence. Otherwise,
PBEM (Score:2)

by Alsee ( 515537 ) writes:

the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will [] identify the message as spam

Ha Ha! Blocked!

You didn't sink my battleship!

-
Text of IJCAI paper (Score:3, Informative)

by gvc ( 167165 ) writes: on Sunday January 07, 2007 @09:26PM (#17502862)

http://www.ijcai.org/papers07/Papers/IJCAI07-259.p df [ijcai.org]

While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.

Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html [nist.gov]

The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html [nist.gov] Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.

Share
twitter facebook
Who needs AI? (Score:2)

by DarkProphet ( 114727 ) writes:

Seriously. FWIW, I am for the most part a Google fanboy.

I have had my GMail account for what, two years or so, and I really don't think google's spamfilter has ever missed a beat. That is to say that all the real spam I receive every day (~40 to 100 spams depending on the day) ends up in the spam folder, not my inbox. Spam is a total non-issue for me. OTOH, my hotmail inbox is so atrocious and the spamfilter so bad that I can't use the account for anything important. I don't know what kind of black magic
Skynet (Score:2)

by mikeee ( 137160 ) writes:

Obviously, Judgement Day will be triggered by Skynet in a final, frustrated attempt to eliminate spammers.
intelligence, artificial or otherwise? (Score:2)

by samantha ( 68231 ) * writes:

The word "vitamin" in a message means it is spam? Methinks that the intelligence should be applied to better test for what is spam rather than simple minded associated term collecting for hot words from various online sources. Bayesian filters are much better than this already and do not require wikipedia reading to do their jobs with 99% accuracy after fairly minimal training.
- Re: (Score:2)
  
  by asavage ( 548758 ) writes:
  
  I think it just wasn't explained well. What it is supposed to do is recognize that an unseen word has the same meaning as a word the spam filter already knows and adjust the score of the email in the same way. Any email filter that filtered out emails based on the occurrence of any single word would have an unacceptable amount of legitimate email filtered.
- Comment removed (Score:4, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Sunday January 07, 2007 @02:43PM (#17499196)
  
  Comment removed based on user account deletion
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Danny Rathjens ( 8471 ) writes:
    
    Bayesian analysis can still work, but only in combination with OCR software.
    
    That is not entirely correct. Bayesian filters work with *all* textual tokens in a message, not just the visible text in the body of the message. e.g. if your image spam all have various combinations of debora@somerandomdomain in the mail headers as a recent spambot was doing or if your spam all used the same relays and consequently has the same Received: headers, then a Bayesian filter will still rank it higher than non-spam.
  - Re: (Score:2)
    
    by rjshields ( 719665 ) writes:
    
    Yes, but OCR is too slow to actually be useful. Plus spammers are using slanted, wobbly, coloured text, random backgrounds and all manner of methods to prevent OCR from working effectively.
    - - Re: (Score:2)
        
        by rjshields ( 719665 ) writes:
        
        Oh, so you just need a slanted, wobbly, colored text and random background detector that doesn't FP like crazy ;)
  - OCR unnecessary (Score:2)
    
    by gvc ( 167165 ) writes:
    
    The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg file, thus making it seem innocuous, and then putting the real advertisement in a GIF or PNG file that would be displayed by HTML-capable mail readers. Bayesian analysis can still work, but only in combination with OCR software.
    
    Bayesian filters (and other statistical filters colloqually known as Bayesian) ca
  - Re: (Score:2)
    
    by T.E.D. ( 34228 ) writes:
    
    The Bayesian analysis in spam filters only works on text. Spammers realized that they could get around it by filling the text portion of the message with some random passage from a Project Gutenberg fil
    I'm using Thunderbird 1.5.0.9, and it seems to work great on those "book attack" spams. I haven't seen one get through yet, so they appear to be less likely to get through than normal spams.
    
    On a guess, I'd say that a random chunk of literature is far more likely to contain words never used in valid correspond
  - - Re: (Score:2)
      
      by gvc ( 167165 ) writes:
      
      Content based filtering is NOT working and will NEVER work!
      
      I don't usually respond to ACs, but this particular belief is common enough that I feel I should say a few words. The overall goal of spam abatement is to enhance the probability that legitimate email will be delivered in a timely and efficient manner to its intended recipient. Content-based filtering is widely deployed in this context and it is fairly effective for its intended purpose. Demonstrably more effective, and less intrusive, than for
      - Uhm... what color is the sky in your world? (Score:2)
        
        by Gary W. Longsine ( 124661 ) writes:
        
        I think the point is that many, if not most email users find themselves wading through a sea of spam despite the multiple layers of content filtering that happen between the point of origin and their inbox. The AC is partly right. Content filtering has merely delayed the death of email.
        
        College students these days are often heard to say, "I have an email address but I never use it." They prefer their cell phones because voice and SMS text messages are not yet flooded with spam. Email may not be dead,
      - Re: (Score:2)
        
        by gvc ( 167165 ) writes:
        
        The problem with content based filtering is it either increases the amount of wading due to quality control needs or decreases the amount of wading at the expense of lost messages.
        
        There's no evidence that the statement above is true. A user who has to wade through a mixture of spam and non-spam will overlook some of the non-spam. The question is whether the human or the machine will overlook more. A subsidiary question is, once overlooked, how likely is the message to be retrieved using some subsidiary
- Re:The B12 example is horrible (Score:4, Informative)
  
  by tepples ( 727027 ) writes: <tepples.gmail@com> on Sunday January 07, 2007 @03:21PM (#17499576) Homepage Journal
  
  Suppose somebody was trying to sell me a B12 bomber.
  
  Then your e-mail account's Bayes map would have the map (word B12 -> folder Aircraft) with a high probability, which would outweigh (word B12 -> article Vitamin -> folder Drug Spam).
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by dkf ( 304284 ) writes:
    
    Plus, a message discussing a B12 bomber would be likely to have other high-ham words, especially in the context of an ongoing discussion on the topic. Bayesian filters (or at least the ones that are any good) pick up on this sort of thing too, and it is part and parcel of what makes real content filtering so effective. But effective content filtering has to be done on actual mailboxes; it depends on the fact that individual people don't discuss that many different topics on a normal basis...
- - Re: (Score:2)
    
    by tepples ( 727027 ) writes:
    
    Of course nothing prevents you from changing ISPs if your ISP forces unreasonable policies onto you...
    
    Unless you live in Qatar [slashdot.org]. Or more practically for residents of countries with an anglophonic majority, unless you live in an area where both the local cable company and the local DSL company have policies that you consider unreasonable.
    - - Re: (Score:2)
        
        by tepples ( 727027 ) writes:
        
        For email you can always host your own servers.
        
        You mean "smarthosting" through an e-mail provider in North America or Europe, right? Otherwise, your cable or DSL connection is on the "dynamic IP" list as well as a "spam haven country" list.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Wikipedia needs work for spam filtering.... (Score:2, Insightful)

Re:Wikipedia needs work for spam filtering.... (Score:5, Insightful)

uh oh, there goes wikipedia (Score:4, Interesting)

Re:uh oh, there goes wikipedia (Score:5, Insightful)

Re:uh oh, there goes wikipedia (Score:5, Interesting)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:uh oh, there goes wikipedia (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Nothing new here... (Score:5, Funny)

Mine Slashdot headlines (Score:2)

Comment removed (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

WikiTuring Test (Score:2)

Re:WikiTuring Test (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

i prefer (Score:5, Funny)

Cool solution to yesterday's problem (Score:2)

Re: (Score:2)

Artificial intelligence! (Score:4, Informative)

Future trends... (Score:3, Interesting)

Uhh (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Interesting)

I dinna think it means what the AI thinks it means (Score:2)

UMMMM wordnet? (Score:4, Informative)

Re: (Score:2, Interesting)

Since when (Score:4, Insightful)

Re: (Score:2)

Re:Since when (Score:5, Informative)

Re: (Score:3, Insightful)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Re: (Score:2)

The target hasn't moved (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Just make spam a crime! (Score:4, Insightful)

Re: (Score:2)

For true AI, you need 3d spacial recognition (Score:2)

Re: (Score:2)

how about pen1s en1argement? (Score:2)

associations... (Score:2)

Looks like good research (Score:3, Informative)

Re: (Score:2)

Not very "intelligent" (Score:5, Insightful)

Not New, not newsworthy (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Make the people accountable (Score:2)

Look up Abstraction Physics (Score:2)

Perhaps this is all that we were missing for AI (Score:2)

Re: (Score:2)

Hutter Prize (Score:3, Informative)

But spammers can add content to WIkipedia (Score:2)

Re: (Score:2)

The double-edged sword that is knowledge (Score:2)