Bayesian Tail

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Bayesian Tail 63

Posted by timothy on Wednesday December 29, 2004 @01:48PM from the seek-novelty dept.

A user writes "We all know anti-spam-software using Bayesian filtering. The results with these are amazingly good. So that made me thinking: why not create a tool which monitors logfiles and determines using a Bayesian filter what events to display and what not? That's why I created btail. Btail is just that: it monitors a logfile and filters it with a Bayesian filter. The results are above my own expectations!"

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 63 Comments Log In/Create an Account

Comments Filter:

- Re:Sure... (Score:4, Funny)
  
  by I_Love_Pocky! ( 751171 ) writes: on Wednesday December 29, 2004 @08:27PM (#11214544)
  
  Why would you run this on an MS system? The critical errors are so common that btail would discard them with the rest of the log file.
  
  Parent Share
  twitter facebook
Cool idea but may be dangerous (Score:2, Insightful)

by PhilippeT ( 697931 ) writes:

This is a cool idea but I wouldn't want to use it on to filter logs on important systems... every line may be crucial.

Anyhow credits on a decent idea
- Re:Cool idea but may be dangerous (Score:5, Insightful)
  
  by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Wednesday December 29, 2004 @02:08PM (#11210846) Homepage
  
  This is a cool idea but I wouldn't want to use it on to filter logs on important systems... every line may be crucial.
  
  Perhaps, but doesn't the same apply to your email? Every email may be crucial as well -- but if you miss a crucial email because it was buried in spam, isn't the effect the same as if it was caught by an overzealous spam filter?
  
  Parent Share
  twitter facebook
  - Re:Cool idea but may be dangerous (Score:5, Insightful)
    
    by cpuffer_hammer ( 31542 ) writes: on Wednesday December 29, 2004 @02:18PM (#11210969) Homepage
    
    Why not use it to colorize, Or to rebuild the logs in HTML.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Informative)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
    - Re:Cool idea but may be dangerous (Score:3, Informative)
      
      by lars_stefan_axelsson ( 236283 ) writes:
      
      Why not use it to colorize, Or to rebuild the logs in HTML.
      I published a paper [chalmers.se], with GPL source code [chalmers.se] (you need Python etc) a few months back using visualisation (colorisation) to lend the user insight into the operation of a Bayesian classifier.
      It actually works pretty well, and the idea could be applied to other uses of the Naive Bayesian classifier.
  - Re:Cool idea but may be dangerous (Score:3, Insightful)
    
    by GlassHeart ( 579618 ) writes:
    
    The far more important difference is that we cannot control the generation of incoming email, which is why we are reduced to filtering as intelligently as possible.
    Server logs are not the same at all. The administrator has some control over the logs that get generated, and the programmer has full control. There isn't supposed to be the equivalent of email spam at all, because useless messages should just be filtered or redirected at the source. Leaving everything at "verbose" and relying on filtering just
- Re:Cool idea but may be dangerous (Score:1, Insightful)
  
  by bzebarth ( 727391 ) writes:
  
  I guess it depends on the kind of log file and what you are looking for. If you are talking about a database log file for instance, all errors lines may start with ERR or something and the log file may contain entries for every login and logout that you really don't care about. If you are only interested in certain types of entries it certainly seems useful.
  - Re:Cool idea but may be dangerous (Score:2)
    
    by OrangeSpyderMan ( 589635 ) writes:
    
    You mean like, errr, 'grep'? I think the whole point of this tool is for situations where 'grep' can't easily filter out the rest. In my ideal world developers always tag log entries in such a way that getting the entries you want out is as easy as grepping the file ;-)
examples (Score:4, Interesting)

by rogueuk ( 245470 ) writes: on Wednesday December 29, 2004 @01:52PM (#11210678) Homepage

Do you have any examples of what type of stuff it learns to filter and what it learns to show? The btail site is kind of lacking of what it outputs versus what it filters

Share
twitter facebook
- Hey! (Score:3, Funny)
  
  by Jeremiah Cornelius ( 137 ) writes:
  
  Now even geeks can get a little tail!
Site getting sluggish already (Score:5, Informative)

by Kiaser Zohsay ( 20134 ) writes: on Wednesday December 29, 2004 @01:56PM (#11210722)

Blockquote from the readme.txt:

Step 1. compile & install

make install

Step 2. configure btail

Default configuration file:
db_bad = .btail_db_bad
db_good = .btail_db_good
db_conf = .btail_db_conf
logfile = /var/adm/messages

db_... are the database files which are filled by blearn. They are
used as reference when btail calculates if an event is bad or good.
logfile is the logfile which you want to monitor. As you see, one
needs a seperate configurationfile AND databases(!) for each file
to monitor.

Step 3. learn logging

blearn -g good_logging
blearn -b bad_logging

good_logging should contain events which are considered ok.
bad_logging should contain logging of events you want to see, e.g.
disk errors, invalid loggings, etc.

Step 3. use btail

btail

This will read the logfile defined in btail.conf and emit events
which are considered not-ok by the bayesian filter.

--- folkert@vanheusden.com

Still very preliminary at this point, but shows promise. Now, to build and try it out!

Share
twitter facebook
- Re:Site getting sluggish already (Score:2)
  
  by Lars T. ( 470328 ) writes:
  
  Thanks. Now I know what to do with the source I can't get from the site.
Well, whereæs the story ? (Score:2, Interesting)

by noselasd ( 594905 ) writes:

Did that story end a bit quick ?
Heck, I wanna know what the results are goddamnit. What made the thing so great.
- Re:Well, whereæs the story ? (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  The missing bit was that his expectations had been that the code would burst into violet flame, emitting strange new quantum rays that would turn his bones into mercury.
  
  When it merely turned a couple of dogs inside out, he knew the time had come to offer it as a Slashdot story.
my comment: (Score:1)

by anonieuweling ( 536832 ) writes:

Go Folkert, go! ;-]
- I concur (Score:1)
  
  by nietsch ( 112711 ) writes:
  
  Go Folkert! Your site is still standing, so lets wait and see what happens when your story hits the frontpage ;-)
What I would like to see (Score:2, Interesting)

by bhima ( 46039 ) writes:

The environment I work in is highly E-mail centric and I work on many projects. I would like to see some sort of Bayesian filtering employed to sort all of the e-mails I get into folders based on projects.
- Re:What I would like to see (Score:4, Informative)
  
  by tonkdude ( 806199 ) writes: on Wednesday December 29, 2004 @02:41PM (#11211260)
  
  I currently use CRM114 and on the mailing list, some one (Evan Prodromou) has created a program that does just this using the CRM114 language. It is called "Monkeyplexer" based on the idea that you could train a monkey to sort your mail box into folders.
  
  If you pop over to the CRM114 site [sourceforge.net] and search the general list archives [sourceforge.net] for monkeyplexer to find the discussions about it.
  
  Here is the last version announcement that I could find in my mailbox:
  
  monkeyplexer is a tool for automatically sorting incoming email messages into appropriate folders. A new version of monkeyplexer, 0.7, is now available. http://bad.dynu.ca/~evan/monkeyplexer/monkeyplexer -0.7.tar.gz [bad.dynu.ca]
  
  This version includes the following changes:
  You can specify which mailboxes to use, instead of which mailboxes to exclude. This can save some typing and some time at runtime, at the expense of dynamically updating the list. You can tell the monkeytrainer to only train messages that were received in the last few weeks, days, hours, minutes -- whatever. The monkeyplexer remembers which messages have been trained for which folders. If you train a message for a different folder, the monkeyplexer will automatically forget the first folder before training for the new one. Thanks to everyone who has installed monkeyplexer already. I hope this new version helps some people out. I find it easier and more accurate.
  
  ~ESP
  
  Parent Share
  twitter facebook
- Re:What I would like to see (Score:3, Informative)
  
  by rmohr02 ( 208447 ) writes:
  
  POPFile [sf.net] is exactly what you're looking for.
If this were Trek... (Score:5, Insightful)

by AndroidCat ( 229562 ) writes: on Wednesday December 29, 2004 @02:24PM (#11211052) Homepage

01:37 Overheat in plasma injector #1.
01:56 Plasma injector #1 offline, switching to #2 backup.
02:23 Overheat in plasma injector #2.
02:44 Failure to shutdown plasma injector #2.
02:58 Overheat in reactor core.
03:20 Containment weakening.
03:25 Containment weakening.
03:30 Containment weakening.
03:35 Five minutes to containment failure.
03:40 FIVE SECONDS TO WARP CORE BREACH!!!
Better be careful to train the filter about those warnings that don't happen very often, but when they do, you really want to know about them.

Share
twitter facebook
- Re:If this were Trek... (Score:3, Interesting)
  
  by aoteoroa ( 596031 ) writes:
  
  True. But if the Star Trek error log resembled real life then it might look more like:
  01:37 [error] Overheat in plasma injector #1.
  01:37 [warning] Cargo bay door 2 is open.
  01:38 [warning] Cargo bay door 2 is open.
  01:38 [warning] Oxegen sensor on deck 2 not responding.
  01:39 [warning] Cargo bay door 2 is open.
  01:40 [warning] Cargo bay door 2 is open.
  01:41 [warning] Oxegen sensor on deck 2 not responding.
  01:56 [error] Plasma injector #1 offline, switching to #2 backup.
  
  In other words real interesting errors i
  - Re:If this were Trek... (Score:2)
    
    by metalhed77 ( 250273 ) writes:
    
    If I were you I'd just write a special script to do that, probably far less hassle and more accurate.
  - Re:If this were Trek... (Score:1)
    
    by sahtanax ( 639159 ) writes:
    
    tail -f /foo/bar/mysite.com-error.log | grep -i php
    
    ... will show only the php errors. :-)
  - Re:If this were Trek... (Score:2)
    
    by ThisNukes4u ( 752508 ) writes:
    
    As others have said, it would be much easier to pipe that through grep or a perl script first instead.
  - Re:If this were Trek... (Score:1)
    
    by AndroidCat ( 229562 ) writes:
    
    Oh sure, I can definitely see where it would be useful for highlighting the important stuff in a sea of logs, and easier for crafting a general solution rather than a pile of rules and regexs.
    For my firewall sound effects program, I basically tail the ZoneAlarm logs, and play a selectable sound effect depending on the port/type. It's cute and even useful for detecting patterns (if you don't mind the noise), but I'm thinking about if Bayesian filtering could be applied to a real security report. It might be
  - Re:If this were Trek... (Score:2)
    
    by Wolfbone ( 668810 ) writes:
    
    As has been said: it's easiest just to write ad hoc filters for this sort of thing. This is what I've been using since my wmail stopped working properly:
    
    tail -F /var/log/messages | awk '/from=/,"\033[1;32m&\033[0m",$7); system("aplay sounds/newmail.wav&")}; {sub(/.*/,"\033[37m&",$1); sub(/.*/,"\033[33m&\033[0m",$3); print}'
    
    I use similar rules for alerts about SSH break-in attempts, mail relay probes and machine check exceptions. I know there are all sorts of sophisticated log analyzer and
    - Re:If this were Trek... (Score:2)
      
      by Wolfbone ( 668810 ) writes:
      
      Sorry - that didn't come out right because of the angle brackets in it:
      
      tail -F /var/log/messages | awk '/from=</ {sub(/<.*>/,"\033[1;32m&\033[0m",$7); system("aplay sounds/newmail.wav&")}; {sub(/.*/,"\033[37m&",$1); sub(/.*/,"\033[33m&\033[0m",$3); print}'
  - Re:If this were Trek... (Score:2)
    
    by Nevyn ( 5505 ) * writes:
    
    I do this kind of thing all the time, allow me to share...
    
    tail -f foo.log | grep "PHP Fatal error"
- Re:If this were Trek... (Score:2)
  
  by RAMMS+EIN ( 578166 ) writes:
  
  My guess is that you are probably exactly interested in the uncommon log messages.
  
  That said, I don't buy logfile filtering until I see it works. Sometimes you are interested in messages of one kind, sometimes in messages of another kind. I still think that fixed pattern matching can do the job better. Of course, that's what many people feel about spam filtering.
  - Re:If this were Trek... (Score:2)
    
    by ibbey ( 27873 ) * writes:
    
    That said, I don't buy logfile filtering until I see it works. Sometimes you are interested in messages of one kind, sometimes in messages of another kind. I still think that fixed pattern matching can do the job better. Of course, that's what many people feel about spam filtering.
    
    I think you're misinterpreting what the tool is meant for. Often, when you are looking at the logs, you are looking for something in particular. In those cases, as you suggest, grep is probably the best tool for the job. But, as
Bayesian is good for almost everything (Score:4, Interesting)

by Ki Master George ( 768244 ) writes: on Wednesday December 29, 2004 @02:59PM (#11211419)

Bayesian filtering could be used for lots of things outside of spam. One example could possibly be Wikis, determining spam from ham modifications (well, yes, it is spam here). I've had some other ideas that involve Bayesian, but they've escaped me for the moment.

Share
twitter facebook
- Bayesian is good for almost everything-Dessert. (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  "I've had some other ideas that involve Bayesian, but they've escaped me for the moment."
  
  Recovering the Slashdot lost since 2000, by eliminating most (-1) material e.g.GNAA,FP,etc. Eliminating the human biasis in the moderation system (Since client-side moderation is out). Tagging interesting material (A Baysian agent).
- Re:Bayesian is good for almost everything (Score:3, Interesting)
  
  by dasunt ( 249686 ) writes:
  Bayesian filtering could be used for lots of things outside of spam. One example could possibly be Wikis, determining spam from ham modifications (well, yes, it is spam here). I've had some other ideas that involve Bayesian, but they've escaped me for the moment.
  
  Email sorting filters: imagine a baynesian setup that can decide if a new mail should be sorted into "work", "friends", "ebay", "amazon", "project", etc.
  Interest filters: Run slashdot stories and comments through your own trained baynesian so
- Re:Bayesian is good for almost everything (Score:2)
  
  by Phil John ( 576633 ) writes:
  
  I always thought a bayesian (adult) web-site content filter would be a good idea, a la netnanny but without a canned list of "bad" url's.
  
  In fact, I may just go and make a ff extension that does just that, hmmm, mebbe call it "NNSFW"?
- Re:This code belongs on (Score:3, Insightful)
  
  by rmohr02 ( 208447 ) writes:
  
  Give him a break--it is the first release, and I doubt he's had much feedback yet.
- Re: (Score:1)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re:This code belongs on (Score:2)
    
    by Juanvaldes ( 544895 ) writes:
    
    If anyone else is interested in what getopt is gnu usage and example [gnu.org]
- Well, no it doesn't ... (Score:5, Insightful)
  
  by Chromodromic ( 668389 ) writes: on Wednesday December 29, 2004 @03:20PM (#11211634)
  
  All due respect, you're being a bit hard on the guy. He's not doing badly here.
  
  The [brackets] used in the usage message are standard in the Unix world for specifying an optional or default argument. Just look at any man page. So that, actually, is pretty straightforward. The name of the default config file would likely also be spelled out in the man page, which I would expect, so that's not confusing.
  
  As for changing the if construct into a switch, well, I'm trusting the accuracy of your excerpt, but I didn't find his code to be very difficult to read, to be honest, and certainly not a candidate for DailyWTF, which typically contains laughably horrible code.
  
  As far as other code may go, the guy states that this is in a nascent stage, so jumping on his source files seems like a bit of an easy shot :|
  
  Parent Share
  twitter facebook
- Re:This code belongs on (Score:5, Insightful)
  
  by Hard_Code ( 49548 ) writes: on Wednesday December 29, 2004 @04:32PM (#11212588)
  
  That aside, your code would be easier to read (slashcode's broken formatting nonwithstanding) if you used a switch construct.
  Speak for yourself. Given that the switch cases are all mutually exclusive, and disregarding the default case, there are only 2 paths, switch is more obfuscatory than clarifying in my opinion.
  
  Parent Share
  twitter facebook
Reinvent the Wheel Much? (Score:5, Informative)

by runswithd6s ( 65165 ) writes: on Wednesday December 29, 2004 @05:28PM (#11213130) Homepage

(Stage Left) Enters the Controllable Regex Mutilator [sourceforge.net], crm114, with a noticable strut. He's been there, done that.

CRM114 is a system to examine incoming e-mail, system log streams, data files or other data streams, and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires. Criteria for categorization of data can be by satisfaction of regexes, by sparse binary polynomial matching with a Bayesian Chain Rule evaluator, a Hidden Markov Model, or by other means. Accuracy of the SBPH/BCR classifier has been seen in excess of 99 per cent, for 1/4 megabyte of learning text. In other words, CRM114 learns, and it learns fast .

Share
twitter facebook
Why learning with supervision? (Score:3, Interesting)

by MoobY ( 207480 ) writes: <anthonyNO@SPAMliekens.net> on Wednesday December 29, 2004 @06:39PM (#11213781) Homepage

I thought this app was learning everything was in the log, and then only showed the new out-of-the-ordinary log entries that didn't quite fit in with the rest. This would allow to filter out freak events from the log and show them to the user. How different would such an app be from the proposed btail? And how confident would you be about such an unsupervised log analyzer?

Share
twitter facebook
new pr0n! (Score:2)

by robdeadtech ( 232013 ) * writes:

do I get a discount if I already have a subscription to Black Tail?
Here's how to make this a lot more useful (Score:4, Interesting)

by Julian Morrison ( 5575 ) writes: on Thursday December 30, 2004 @02:21AM (#11216683)

Step 1: Allow the option to automatically discover and load canned training packages, eg: a directory under /etc. Make it automatically pick the right training file to use when called with a logfile (so eg: btail httpd.conf knows to look for the training for httpd.conf files).

Step 2: Include btail with major distros

Step 3: Any package for an app that generates logs can come with a ready-made canned training package, which gets dropped into the /etc directory.

That way, you could apt-get a package, start btail-ing its logfiles immediately without the need to tediously train the filter first. Training would still be possible, to personalise the filter.

Share
twitter facebook
Nor for Me (Score:1, Insightful)

by schestowitz ( 843559 ) writes:

People who monitor log files know best where to look and what to ignore. It is better to incorporate filtering into the application that generates the logs.
- throw out the baby with the bathwater, will you? (Score:2)
  
  by Roman_(ajvvs) ( 722885 ) writes:
  
  I'm don't understand how your suggestion fits in with your initial statement, or as a comment on the usefulness of a bayesian log filter. It's true that people know best. But after scrolling through the same "Operation completed with errors" line time and time again, the minute effort required adds up. even a simple automated filter can assist, which this person has implemented. It is better to incorporate filtering into the application that generates the logs.
  That's akin to only filling a dictionary with
  - Spellcheck in Firefox (Score:1)
    
    by schestowitz ( 843559 ) writes:
    
    Yes, I contradicted myself somehow. You need to look at the two parts of my reply separately. About Firefox, try what I do and use kedit or the like to run a spellcheker (ALT+T+S). It takes 2 seconds to copy and paste text and I invoke kedit using CTRL+ALT+E (xbindkeys).
Bayesian (Score:2, Interesting)

by inertia187 ( 156602 ) writes:

Bayesian tail might be neat. I like the idea of broadening the use, but I'd much rather see bayesian filters used on my in-box for more than just spam. I envision a filter that would sort out e-mails based on subject matter. This would have the net effect of improving the filter technology because it's trying to sort e-mails you actually want to look at.

We all know that if the filter makes a mistake and hides a message in the Spam box, and chances are you'll might miss many of them, another the chance t
Bayesian AIM bot (Score:3, Interesting)

by duncangough ( 530657 ) writes: on Friday December 31, 2004 @05:31AM (#11226434) Homepage

I love Bayes stuff - and there's a very nice Python module written by divmod [divmod.org].

I was playing around with AIML to cobble together a basic chat bot when I realised that I could use a Bayesian parser to radically cut down the amount of AIML that I needed to write. AIML is an XML style of chat bot repsonses, it's clever in that it's highly recursive but the downside is that you need to create a rule for every eventuality.

By adding in a bit of Bayesian guessing before the AIML parser got it hands on the conversation, I'm able to keep the AIML files very focused and give the chat bot a bit more sparkle - you don't have to train him about everything. After a while he realised that 'yo', 'hi' and 'hello' are all the same thing, so he just guesses that you're saying hello and pulls out the correct response from the AIML file (rather than creating an AIML rule to deal with all the variations on 'hello').

If you're interested I'd strongly recommend installing GrokitBot. You can get the source and a bit more explanation at my site, Suttree.com [suttree.com]

Playaholics : Free Online Games [playaholics.com]

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:Sure... (Score:4, Funny)

Cool idea but may be dangerous (Score:2, Insightful)

Re:Cool idea but may be dangerous (Score:5, Insightful)

Re:Cool idea but may be dangerous (Score:5, Insightful)

Re: (Score:3, Informative)

Re:Cool idea but may be dangerous (Score:3, Informative)

Re:Cool idea but may be dangerous (Score:3, Insightful)

Re:Cool idea but may be dangerous (Score:1, Insightful)

Re:Cool idea but may be dangerous (Score:2)

examples (Score:4, Interesting)

Hey! (Score:3, Funny)

Site getting sluggish already (Score:5, Informative)

Re:Site getting sluggish already (Score:2)

Well, whereæs the story ? (Score:2, Interesting)

Re:Well, whereæs the story ? (Score:1, Funny)

my comment: (Score:1)

I concur (Score:1)

What I would like to see (Score:2, Interesting)

Re:What I would like to see (Score:4, Informative)

Re:What I would like to see (Score:3, Informative)

If this were Trek... (Score:5, Insightful)

Re:If this were Trek... (Score:3, Interesting)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:1)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:1)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:2)

Re:If this were Trek... (Score:2)

Bayesian is good for almost everything (Score:4, Interesting)

Bayesian is good for almost everything-Dessert. (Score:1, Interesting)

Re:Bayesian is good for almost everything (Score:3, Interesting)

Re:Bayesian is good for almost everything (Score:2)

Re:This code belongs on (Score:3, Insightful)

Re: (Score:1)

Re:This code belongs on (Score:2)

Well, no it doesn't ... (Score:5, Insightful)

Re:This code belongs on (Score:5, Insightful)

Reinvent the Wheel Much? (Score:5, Informative)

Why learning with supervision? (Score:3, Interesting)

new pr0n! (Score:2)

Here's how to make this a lot more useful (Score:4, Interesting)

Nor for Me (Score:1, Insightful)

throw out the baby with the bathwater, will you? (Score:2)

Spellcheck in Firefox (Score:1)

Bayesian (Score:2, Interesting)

Bayesian AIM bot (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals