Data Munging with Perl 66

Posted by timothy on Thursday April 26, 2001 @11:45AM from the can-they-say-that-on-tv? dept.

For those inundated with data -- numbers, names, dates, temperatures, colors, seismographic sensor output, voting records(!), or anything else -- the paltry concerns of user interface may be less important than the assurance that they can make something useful from all that stuff. Data munger extraordinaire chromatic has again delivered his insightful dissection of a programming book aimed at people with Perl knowledge and a lot of data to wade through, and No, it's not from O'Reilly. Maybe it's for you.

Data Munging with Perl
author	David Cross
pages	283
publisher	Manning Publications
rating	9
reviewer	chromatic
ISBN	1-930110-00-6
summary	Dave explores Perl's unique and compelling abilities tomanage and manipulate data of all types, sizes, and shades.

The Scoop

Larry Wall, so goes the story, needed to glue together two systems on opposite sides of the country. Calling on the virtues of Laziness (why throw together something for just one job) and Hubris (why not write a new language?), he created Perl. Though it's found new niches in the post-web world, Perl earns its bread and butter munging data.

Dave Cross has put together a friendly and handy compendium of techniques, tricks, and best practices. Suitable for raw novices to experienced intermediates, Data Munging with Perl is a gentle but firm romp from flat text, past structured and binary files, to the realm of custom parsers. Clean examples and lots of modules accompany the explanations.

What's to Like?

The book plots a natural course through topics ordered by complexity. It opens with a theoretical overview of data processing. This introduces terminology and outlines the general types of data one might encounger. Additionally, the author writes with the authority of experience when exploring the basic approaches and best practices. While other books aimed at novice users shy away from programs-as-filters and data structures, Cross prefers to instill good habits from the start.

Beyond munging data, the book provides a decent introduction to idiomatic and effective Perl programming. While the brief tutorial won't magically produce new JAPHs, the thoughtful and continual devotion to good technique and skill will inspire smarter programmers. More important than knowing many useful tricks is knowing when and how to use a handful of tools -- and where to go for more.

The overall level of quality is excellent. The binary data chapter stands out as the clearest explanation available, and the information on munging dates and times will save readers plenty of grief. Additionally, the entire parsing section introduces a handful of powerful but sorely-underused tools to handle HTML, XML, and even creating custom parsers. Rounding out the curriculum is an appendix that explores the larger modules, mentioned earlier, in more detail (XML::Parser, DBI, Date::Manip).

What's to Consider?

Only two things might turn readers from this book. The first is its deceptive length. While the text is short, the examples are clear and the text packs a lot of wallop in what's there. Careful readers who follow the links to other resources will have little trouble supplementing their education. (On the other hand, another ten pages describing Parse::RecDescent would have been a nice addition. It's hard to fault the author for deferring to the module's voluminous documentation.)

Second, longtime Perl programmers may find little new material, particularly if they are familiar with the wealth of modules on the CPAN. The intended audience is clearly new and underexperienced programmers. While there's plenty of good advice presented well, the book falls more toward the tutorial side of the aisle than the reference section. This does not detract from the book, but it does narrow the base of potential readers slightly.

The Summary

Manning Publications continues its fine line of Perl books with the consistent and powerful Data Munging with Perl. Coders looking to transform data somehow and hackers who want to take advantage of Perl's unique features will improve their knowledge and understanding. If you find yourself working with files or records in Perl, this book will save you time and trouble.

Introduction
1. Data, data munging, and Perl
2. General practices to use when munging data
3. Generally useful Perl idioms
4. Pattern matching
Data Munging
1. Unstructured data
2. Record-oriented data
3. Fixed-width & binary data
Simple Data Parsing
1. More complex data formats
2. HTML
3. XML
4. Building your own parsers
Conclusion
1. Looking back -- and ahead
1. Modules reference
2. Essential Perl

You can purchase this book at ThinkGeek.

Data Munging With Perl

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 66 Comments Log In/Create an Account

Comments Filter:

Re:The power of paper? (Score:1)

by Anonymous Coward writes:

Yes, but you can take books and loan them to your friends once you've mastered the contents. Now you are adding to someone else's earning potential and putting them in your debt by doing so.
Why is this important? I've never gotten a good job out of an ad. I only ever have gotten good jobs by friends referring me to other people. For those of you who've been around in the industry for a while, you know this is true...
Re:good to see (Score:1)

by Jason Earl ( 1894 ) writes:

Agreed. I love Python, but I still use Perl for data munging. It is without a doubt the best text processing tool available (and that's not exactly a niche market either).
Re:Perl vs. Python (Score:1)

by Mandrias ( 5341 ) writes:

Offtopic? This is a legit question that I've never seen answered to my satisfaction. Must I ask slashdot for answer? Because God forbid I ask this in a Perl story's comments.

Sheesh.
Re:The power of paper? (Score:1)

by Luke ( 7869 ) writes:

Offtopic? How about "Good Idea"?
Question (Score:1)

by weston ( 16146 ) writes:

What is "unstructured" data?

Just wondering....

--
Re:The power of paper? (Score:1)

by TetsuoShima ( 34625 ) writes:

How about retarded idea?

The original contention was that texts online are FREE, so buying a 2nd monitor (vs a book) doesn't support that argument at all.
Re:Question (Score:1)

by babbage ( 61057 ) writes:

Right. Sometimes I can't understand these posts either, but Buffy can...
Perl needs a Parser::Pony::Postings module...
(Hi Dave... :)
Re:Question (Score:1)

by babbage ( 61057 ) writes:

Agreed -- maybe I shouldn't have used it as an example, but I was trying to think of better ones & couldn't (& still can't). In a way, XML falls into the binary category in the same way that sonnets do -- it has a regularly defined structure that is, in a way, self-describing and thus structured, but compared to CSV or fixed-length formats it's very unstructured. I probably should have come up with a better example.
Re:The power of paper? (Score:1)

by emmons ( 94632 ) writes:

True, but you can't learn how to program more effectively by browsing a hierarchy of perl modules.

BTW: This two minute posting limit is REALLY annoying!

----
Re:yep, it's a good book but not the CookBook (Score:1)

by wganz ( 113345 ) writes:

The Cookbook is in dire need of a new edition. The latest print edition is circa 1998 and there are about 25 pages of errata. My recommendation is for everyone to hold off buying O'Reilly's Perl Cookbook until they release a new edition.
Re:The power of cntrl-F (Score:1)

by phossie ( 118421 ) writes:

to each their own - until a really, really good AI is incorporated into some browser's search function, i'll be using printed text when i can: my eyes (connected to my brain) can see if something on-topic is mentioned that i wasn't specifically searching for, whereas "Find" will just tell me what i want isn't there. as the previous poster mentioned, eyes sometimes do better with printed material vs. display, and my eyes/brain are like that.
Re:Question (Score:1)

by DrHyde ( 134602 ) writes:

I think the point is that whilst the encoding may be highly structured (such as ASCII) the data that is encoded (such as a posting on /.) is not. Your computer can not understand this post.
Re:The power of paper? (Score:1)

by toybuilder ( 161045 ) writes:

A book is nice on a 6 hours flight from Los Angeles to London. And I don't panic if I drop the book.
Re:yep, it's a good book (Score:1)

by cDarwin ( 161053 ) writes:

If this book is 1/10th as clear and useful as Object Oriented Perl (Damian Conway, Manning 2000), it's well worth the money.

--
good to see (Score:1)

by spankfish ( 167192 ) writes:

Perl is very mungey indeed. Can't think of a language I've found more useful for data format massage.

--
Re:Boycott This Book!!! (Score:1)

by Mactire_Dearg ( 211446 ) writes:

So, let me understand this correctly. You think all information should be free unless that information is something you dont want someone else using because you dont trust their modivations?
Nice internal conflict you have there...
So who should judge who can get information and who cant?
books are fine... but... (Score:1)

by Corporate Gadfly ( 227676 ) writes:

... they don't help you get out of a jam. They are great for learning about a topic from scratch and to build concepts. Where most books in general lack is the mapping to reality. That's when deja used to come in handy. groups.google is adequate in some respects but the interface is not as efficient and the archive is not that big. Oh how I wish that google would have left the old deja interface intact for now.
Re:yep, it's a good book (Score:1)

by behindthewall ( 231520 ) writes:

Looking forward to that.
Re:Answer by example (Score:1)

by vla1den ( 233261 ) writes:

The answer to the question is data structured or not depends on who is reading. It does not make any sense to say, "Word processing files are unstructured". It is structured enough to display its content, so for microsoft's programmer it is well structured data. It is however can contain a lot of crappy text within, so from the reader's point of view it'll be unstructured. You should think what they gonna do with this data before you structure it.
Re:The power of paper? (Score:1)

by chuqui ( 264912 ) writes:

> Get a second monitor to read documentation from. Not only would it pay for itself within 4 books,

you go ahead and buy the monitor. I'll buy the book, tell my boss I'm researching my latest project,and you'll find me out in the park by the lake with my laptop while you're stuck in your office...

Oh,and you might want to go research the retention studies that compare how well people remember what's learned on a monitor vs printed material. I'm sure there are some nice references online (I know I've seen them, but I've forgotten where...)
Follow that link (or was it kink...) (Score:1)

by abumarie ( 306669 ) writes:
Error 404: Doh! Whatever you requested exists not. Maybe it fell into the bit-bucket. Maybe there is a typo in how the document was being accessed. If this error persists, please contact webmonster@thinkgeek.com [mailto].
- ThinkGeek Main page [thinkgeek.com]
(Error 404: File or resource not found)
Re:So let me get this straight... (Score:1)

by abumarie ( 306669 ) writes:

to the birdbrain who moderated this. you wouldn't know a pun if it hit you between the eyes...
Re:So let me get this straight... (Score:1)

by abumarie ( 306669 ) writes:

I intend to leave this line of argument before it Snobols
Re:Answer by example (Score:2)

by Anonymous Coward writes:

Word processing files can be structured. One based upon SGML, XML or even WordPerfect's system of tags would be an example.
Or even, *cough*, MS-Word files. A better example of an unstructured file might be a straight ASCII text file (although nitpickers might point-out a straight text file could contain structured elements; i.e., SGML, XML, HTML).
Re:Question (Score:2)

by pod ( 1103 ) writes:

I would say the binary data you mention (sound, image, video) is very structured. You may have tried writing a display utility for TIFF or AVI or DOC, and this is extremely hard because those formats don't actually encode the content, they're just containers for other formats. You can write a loader or parser for a media file almost blindly; while the actual contents are often encoded in various formats (witness DOC, TIFF, AVI, and, uh, what's a good sound container format? PCM?), which themselves are also very well defined and structured.
Re:Answer by example (Score:2)

by sql*kitten ( 1359 ) writes:

Word processing files are unstructured.
Not if you write them properly, for example, if it's a heading, actually set it to be a heading, don't just make it bold. Good word processors support this, and that's why you can use them to write books and legal documents, which need to be maintained and updated just as programs do. Word even has built-in version control!
Manning should choose a new cover theme (Score:2)

by Zico ( 14255 ) writes:

Don't judge a book by it's cover, I know, but I still think they'd sell a lot more of their books (I have two, possibly three), if they'd get some kind of cover scheme which displays anything other than those fruity portraits they're using now.

Cheers,
RIAA and Perl Data Munging (Score:2)

by KFury ( 19522 ) writes:

I'm surprised the RIAA hasn't filed an injunction against educating people how to data munge with Perl. After all, data munging is a known method of removing SDMI encryption^H^H^H^H^H^H^H protection...

Kevin Fox
--
Re:Answer by example (Score:2)

by gorilla ( 36491 ) writes:

Word processing files can be structured. One based upon SGML, XML or even WordPerfect's system of tags would be an example.
Re:Boycott This Book!!! (Score:2)

by babbage ( 61057 ) writes:

Let me get this straight. Somebody publishes a book telling everyone -- including you -- how to use Perl to analyse & leverage your data, and you want to censor it because people you don't like could use the same techniques? How exactly does this help Free Software?
I thought the GPL [gnu.org] implies that you shouldn't discriminate who gets to use your code, and I thought that the Open Source Definition [opensource.org] explicitly says that one "[5] must not discriminate against any person or group of persons [and [6]] must not discriminate against any person or group of persons."
I'm looking & looking, but I just do see anything anywhere about it being okay to only pay attention the parts you find convenient or expedient. Maybe you can point me in the right direction here?
In the meantime, this is for me, a non-spammer, regular working shmoe, a very educational & useful book. I'm not gonna support a boycott of it just because it doesn't jibe with your situational ethics...
Re:Question (Score:2)

by babbage ( 61057 ) writes:

Well, there is a degree of structure to it, but it not nearly as regular as other examples. The point I was trying to make is that, given that there is a range of "structuredness", binary formats as a whole generally fall somewhere in the middle of the range, with some versions falling more to one side or the other.
Ascii is so structured that decoding it is trivial. Unicode is still structured, but not as trivial. Mp3 might be much more hairy, and then the ones you describe almost sound like "meta-formats", which provide a framework for bundling other formats together -- thus leading to high level structure & low level disorder, or at least complex & hard to decode order.
Re:The power of cntrl-F (Score:2)

by passion ( 84900 ) writes:

So when I'm thumbing through my book, looking for a specific phrase, I have to algorithmically and frantically scan every page in the book. This also introduces the risk that my feeble organic eyes might actually miss the word.

I prefer using my browser's control-F "find" feature, or grep, or what have you to pick out the key word(s) of my current interest.
The magic of a book (Score:2)

by BierGuzzl ( 92635 ) writes:

There's most definately something about a book that digitized media just can't replace. It has a distinctive smell, it doesn't require batteries or a power adapter, and it doesn't expose you to radiation.
Reading a book is something that'll mentally bring you back into your inner classroom, just like smelling a box of crayons will bring you back to your inner child. Make no mistake about it, there's definately magic in dem dar books.
Re:Boycott This Book!!! (Score:2)

by TennesseeJed ( 110599 ) writes:

It's not the tools or technology that is ever to blame, it is the misuse of same that is the issue.
Re:The power of paper? (Score:2)

by andy@petdance.com ( 114827 ) writes:

I'm just curious what can be found in a paper tomb that cannot be cobbled together from various up-to-date and *free* sources from the web.
"Death Of Books Imminent: Film At 11"
Personally, I mark up my books all the time. You know, with ink. Like when I made a big red circle & arrow pointing to the part in Unix In A Nutshell that reminds me that for join to work, the input files have to be sorted.
--
Re:Unstructured data (Score:2)

by Animats ( 122034 ) writes:
Several operations in Perl are more expensive than they should be:
- Removing one character from the beginning of a string (this shifts the whole string down one).
- Subscripting a string. (There's SUBSTR, but it's overly general for such a primitive function.)
- Fanning out on cases (switch or case statements).
- Operations on single characters generally (there's no char type, and single character strings are expensive.)
- Using a character as an index into a table.
Unfortunately, those are the operations that are in the inner loop of a tokenizer for any formal language (C, HTML, etc.)
A fast built-in that returned the numeric value (as ORD does) of character N of a string would be clunky, but would provide a way to speed things up without going outside Perl. The Perl-ish way for doing such things involves regular expressions, and that's an example of "if the only tool you have is a hammer, everything looks like a nail".
Re:Unstructured data...FOOL (Score:2)

by Animats ( 122034 ) writes:

Write a tokenizer that way, and it's even worse. You end up applying several regular expressions per character processed. I've seen an HTML parser written that way. It's very clever. It's only about 30 lines of code. It takes forever.
Theoretically, a highly optimizing regular expression compiler that looked at multiple statements containing regular expressions could generate an efficient tokenizer from such Perl code, but that's not what's inside the Perl engine.
Re:The power of paper? (Score:2)

by cDarwin ( 161053 ) writes:

I did an extended consulting gig in a third world country a couple of years ago. The people I worked with there were all very intelligent, and had a good general knowledge of computer science (the kind you get from going to school). But, their knowledge of specific technologies (like perl and EJB) was very spotty. This, I found, was directly traceable to the fact that they got all of their knowledge of these subjects from the Web. They couldn't afford to buy the books, you see.
A good computer book provides thorough end to end coverage of a subject (a great one lays it out in a way that is easy to understand and possibly fun).

--
Re:Question (Score:2)

by boaworm ( 180781 ) writes:

>> XML is structured data.
Well, that isn't correct. XML is a markup language (eXtensible Markup Language) which very well can be used to repressent both structured and unstructured data.
It all depends on what query language you use for XML, there are serveral different right now.
For unstructured data (parsing and transforming documents) XSL [w3.org], XQL [w3.org] etc are useful.
For structured data, check out XML-QL [w3.org].
Re:Question (Score:2)

by boaworm ( 180781 ) writes:

Structured data is data which is related to other data. Databases are typical examples of structured data being stored. Relations and couplings between different data.
The opposite, unstructured data, is simply when the data is not related to other parts. As someone mentioned, a plain document is a good example of unstructured data.
Re:The power of paper? (Score:2)

by perlyking ( 198166 ) writes:
Here are some reasons I buy books occasionally:
- You can read them in the bathroom.
- It would cost me more in time locating and collating from the various sources than it would buying a decent book.
- I have a crap memory and having everything to hand in a nicely indexed book helps makes up for it :-)
These are personal reasons though, I guess different people have different requirements.
Re:Question (Score:2)

by perlyking ( 198166 ) writes:

Maybe the kind of data a thirty year old temp who is scare of computers enters?
I have seen a LOT of wierd data in my time... :-)
Re:The power of paper? (Score:2)

by enigma42 ( 207185 ) writes:

You should try http://search.cpan.org/ It's a great resource for finding what module does what. You can search for keywords or browse the hierarchy. It even links to the man pages for a particular module so you can get up to speed without downloading anything.
Re:Question (Score:2)

by _N0EL ( 245472 ) writes:

Many links on XML here [techtarget.com].
Re:Manning should choose a new cover theme (Score:3)

by Anonymous Coward writes: on Thursday April 26, 2001 @11:50AM (#263571)

Hey! I happen to like those fruity portraits! It's what made "Object Oriented Perl" by Damian Conway stand out on the bookshelf. That book is invaluable.

Although I think "Data Munging with Perl" is probably less usefull to me, and I probably wont get it, the cover is cool.

By your use of the word "fruity" I take it to mean you are afraid of looking queer. Not to worry, I'm sure you already do.

Re:Unstructured data (Score:3)

by Matts ( 1628 ) writes: on Thursday April 26, 2001 @09:23AM (#263572) Homepage

The current version of HTML::Parser is written in C, with an XS interface. Maybe you aught to take a visit to CPAN?

Effective Perl Programming (Score:3)

by jjohn ( 2991 ) writes: on Thursday April 26, 2001 @10:19AM (#263573) Homepage Journal

Author: Joseph Hall (with Randal Schwartz)
ISBN: 0201419750
Publisher: Addison Wesley (1998)

Fun collection of Perl idioms and some good stuff on h2xs.

Re:Boycott This Book!!! (Score:3)

by el_nino ( 4271 ) writes: on Thursday April 26, 2001 @08:25AM (#263574) Homepage Journal

I'm pretty sure this is a troll, but munging has nothing to do with data mining. This is what the Jargon File has to say on the word 'munge':

munge /muhnj/ vt. 1. [derogatory] To imperfectly transform
information. 2. A comprehensive rewrite of a routine, data structure or
the whole program. 3. To modify data in some way the speaker doesn't
need to go into right now or cannot describe succinctly (compare
{mumble}). 4. To add {spamblock} to an email address.

In this case Dave means 'doing stuff with data' akin to the Jargon File's third definition of the word.
--
Niklas Nordebo | niklas at nordebo.com

Re:The power of paper? (Score:3)

by elmegil ( 12001 ) writes: on Thursday April 26, 2001 @09:16AM (#263575) Homepage Journal

You know, it's kinda hard to balance that laptop on the sink when I'm in the john.
Books don't require batteries that might run down, or suffer from any of a dozen other complaints against the "true portability" of electronic systems for getting documentation.
If you're happy with phosphors, more power to you, but if I want a reference, I want one that is as portable as I am; one without leashes to the power grid (even if they're only intermittent), and one with some editorial intelligence up front to filter down to the topics I care about, rather than a kitchen soup like the web where I have to sift through 10000 google hits to find the page that really answers my question.

Answer by example (Score:3)

by wiredog ( 43288 ) writes: on Thursday April 26, 2001 @08:02AM (#263576) Journal

Spreadsheets and database tables are structured data. Word processing files are unstructured.

Re:Unstructured data (Score:3)

by ikekrull ( 59661 ) writes: on Thursday April 26, 2001 @05:40PM (#263577) Homepage

Embarrassing? Nope.. C is used for speed.

The people who create and maintain perl are smart enough to realise that no tool is universally useful.

Mixes of C and perl simply require the appropriate compiled .pm to be put in one of Perl's library folders. Just put the files in the right directories and run.

You can precompile a binary .pm for each of your platforms and distribute it if you like, or you can just not use the C modules. Of course, you pay a price in speed for using only perl, but you can't have it all.

If ease of distribution is paramount, write the parser in C, embed a perl interpreter in it and code the perl portion appropriately.

Re:The power of paper? (Score:3)

by graxrmelg ( 71438 ) writes: on Thursday April 26, 2001 @09:11AM (#263578)

OT Note: the correct term is tome ... not tomb (which is where somebody is buried).

I figured the use of "tomb" was intentional. After all, it's where you put dead trees.

Re:Unstructured data (Score:3)

by Animats ( 122034 ) writes: on Thursday April 26, 2001 @12:26PM (#263579) Homepage

The current version of HTML::Parser is written in C...
Yes. It's embarassing that Perl needs help from C to ... manipulate strings.
(I don't want to use that because the mix of C and Perl makes portability more difficult. All-Perl code you just put in the right directory and run. Mixes of C and Perl require compilers, package managers, makefiles, and installers. The target is shared-hosting services, where users may not have shell access. It's seriously annoying that Perl does this simple operation so slowly.)

Physical quality of Manning's books? (Score:3)

by bheckel ( 128323 ) writes: <`moc.liamg' `ta' `todhsals+lekceh.b'> on Thursday April 26, 2001 @12:02PM (#263580) Homepage Journal

Manning's Object Oriented Perl is a great book with a terribly cheap binding. My copy is already falling apart. Anyone know if they have improved the binding quality on this one?

Re:The power of paper? (Score:4)

by rho ( 6063 ) writes: on Thursday April 26, 2001 @11:35AM (#263581) Journal

The best part about a book -- a well written book, not a "How to Be and Unleashed Dummy in 21 Days" book -- is the time and care put into it by a host of professionals, whereas a Web resource tends to be cobbled together from a community of geniuses and idiots alike.
Look at Slashdot -- some of it is great, some of it would wither a pile of dog poo it's so bad. php.net is similar -- the function reference is good if you're looking for arguments to a rarely used function, but the user-contributed stuff is off-and-on useful.
That's partially why you pay $50 for a good tech book -- the team of people needed to put together a *good* book is quite expensive. You need a knowledgeable author, a clued-in editor, a savvy fact-checker... all these people cost money.

"Beware by whom you are called sane."

Re:Question (Score:4)

by holzp ( 87423 ) writes: on Thursday April 26, 2001 @07:58AM (#263582)

take a look at the source of a perl program.

yep, it's a good book (Score:4)

by jacobito ( 95519 ) writes: on Thursday April 26, 2001 @08:17AM (#263583) Homepage

Along with the Camel, "Effective Perl Programming" (Addison/Wesley, don't remember author's name), and the "Perl Cookbook," this has been one of my favorite programming books. Mind you, I'm not a seasoned hacker, so YMMV. But for anyone who already understands the basics of Perl, this book is a great way to learn something practical.

Like Chromatic, though, I really wished that the section on Parse::RecDescent had been longer...

Unstructured data (Score:4)

by Animats ( 122034 ) writes: on Thursday April 26, 2001 @08:04AM (#263584) Homepage

In this context, "unstructured data" often refers to text in a natural language. An SEC filing [sec.gov] is a good example of data with enough structure that machine processing is possible, but not enough that it's easy.
We have an engine which processes such data, but it's slow, because it's in Perl. Most of the time goes into modules recommended in this book, like HTML::Parser. The big problem is that simple tokenizing, like extracting HTML tags, is incredibly slow in Perl. The classic "get next character, get character class for character, switch on character class" operation is something Perl does very badly.
Yes, you can write low-level C functions and call them from Perl to deal with such problems, but that kills portability.

Re:The power of paper? (Score:4)

by rgmoore ( 133276 ) writes: <glandauer@charter.net> on Thursday April 26, 2001 @09:05AM (#263585) Homepage

One thing that I haven't seen mentioned yet is that books are easier to read than monitors. Monitors just can't match a book's DPI, and the higher resolution of the printed page can actually improve reading speed and retention and reduce eye strain. That may or may not be a big issue for you, but it can be a big deal and a reasonable justification for the extra expense. Another advantage of a printed book is that the author has already gone to the trouble of cobbling together the data for you so that you don't have to spend your time scrounging the web for it; if you're a consultant getting paid $100 per hour it doesn't take much time scouring the web for information to add up to more than the cost of the book.
OT Note: the correct term is tome (from the Greek word meaning to cut, and the same root as in medical procedures ending in -otomy, as tomes were originally produced by cutting a long scroll into smaller sections) not tomb (which is where somebody is buried).

Boycott This Book!!! (Score:4)

by none2222 ( 161746 ) writes: on Thursday April 26, 2001 @08:10AM (#263586)

Have you stopped to consider the consequences of the information contained in books like this? This type of effort should not be supported by the Free Software community.
Books like this give corporations the tools they need to destroy our privacy and strip us of our rights. How do you think Double Click puts the information about you it sells into useable form? With techniques it learns from this type of book. Same goes for the corporate websites you visit, your supermarket, etc.
Information wants to be free, but not the information in this book. Data mining and Data munging techniques should never have left the hallowed halls of academe. Once they enter the public domain, they are immediately exploited by greedy corporations. The author should have thought about that before writing a book like this.
If you buy or support books like this, you have lost any right to complain about your privacy being violated. If you are serious about privacy, boycott this book!

Re:The power of paper? (Score:5)

by interiot ( 50685 ) writes: on Thursday April 26, 2001 @08:38AM (#263587) Homepage

Get a second monitor to read documentation from. Not only would it pay for itself within 4 books, but it's more useful than a stack of spent books.
--

Re:Question (Score:5)

by babbage ( 61057 ) writes: <cdevers@cis.us[ ]hal.edu ['out' in gap]> on Thursday April 26, 2001 @08:12AM (#263588) Homepage Journal

XML is structured data.
Log files are generally fairly structured data.
CSV files are structured data.
Free flowing ASCII text is unstructured data.
Shakespeare's sonnets, however well formed, are unstructured data
(unless you can come up with a parser that recognizes iambic pentameter... :).
Falling somewhere in the middle is binary data. It has a structured format but freeform contents. Consider the various sound, image, and video formats. Maybe Shakespeare's sonnets could fall into this category too... :)
There are situations where you could want to analyze each form. Parsing Apache log files is a slightly different task than analysing formal XML documents or sloppy HTML pages or messy ASCII email. This book helps give you a feel for which situation you may be dealing with, and thus what tools & techniques might be useful for that situation.
Though some will tell you otherwise, this book has nothing to do with "Buffy the Vampire Slayer." Sorry, grep.

Re:yep, it's a good book (Score:5)

by thoughtstream ( 140380 ) writes: on Thursday April 26, 2001 @11:45AM (#263589)

Like Chromatic, though, I really wished that the section on Parse::RecDescent had been longer...

Be careful what you wish for...
Next year I'll be writing a book about Parse::RecDescent (or its successor Parse::FastDescent) and grammatical parsing techniques.
Damian

The power of paper? (Score:5)

by tenzig_112 ( 213387 ) writes: on Thursday April 26, 2001 @08:12AM (#263590) Homepage

This is not flame bait. I'm just curious what can be found in a paper tomb that cannot be cobbled together from various up-to-date and *free* sources from the web.
Perhaps I'm still stuck in the paper age (somewhere between bronze & silicon), but I find myself spending $50 a pop for progamming books I only skim through. If I need reference material, I hit PHP.net (for my PHP projects [ridiculopathy.com]).
Am I missing something?

Re:The power of paper? (Score:5)

by rfsayre ( 255559 ) writes: on Thursday April 26, 2001 @08:27AM (#263591) Homepage

Yes, you are missing something. You're absolutely right that you can get all the reference material you need on the web. That's what it does best. However, when you're trying to *learn* a new language, it's better to have your editor, a couple console windows, and a book open. That speeds up the write/compile/run cycle. No flipping back and forth from the browser. You learn faster.

Art At Home [artathome.org]

Re:Boycott This Book!!! (Score:5)

by Neea ( 447021 ) writes: on Thursday April 26, 2001 @09:20AM (#263592)

Dude, take your medication before you post. This book isn't going to tell the bad guys how to get your credit card number from the porn site you just visited. The bad guys already know how to get every piece of data about you that they want. So go back to your room in your mom's basement and put the aluminum foil hat back on your head. Remember - shiny side out.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The Scoop

What's to Like?

What's to Consider?

The Summary

Table of Contents

Re:The power of paper? (Score:1)

Re:good to see (Score:1)

Re:Perl vs. Python (Score:1)

Re:The power of paper? (Score:1)

Question (Score:1)

Re:The power of paper? (Score:1)

Re:Question (Score:1)

Re:Question (Score:1)

Re:The power of paper? (Score:1)

Re:yep, it's a good book but not the CookBook (Score:1)

Re:The power of cntrl-F (Score:1)

Re:Question (Score:1)

Re:The power of paper? (Score:1)

Re:yep, it's a good book (Score:1)

good to see (Score:1)

Re:Boycott This Book!!! (Score:1)

books are fine... but... (Score:1)

Re:yep, it's a good book (Score:1)

Re:Answer by example (Score:1)

Re:The power of paper? (Score:1)

Follow that link (or was it kink...) (Score:1)

Re:So let me get this straight... (Score:1)

Re:So let me get this straight... (Score:1)

Re:Answer by example (Score:2)

Re:Question (Score:2)

Re:Answer by example (Score:2)

Manning should choose a new cover theme (Score:2)

RIAA and Perl Data Munging (Score:2)

Re:Answer by example (Score:2)

Re:Boycott This Book!!! (Score:2)

Re:Question (Score:2)

Re:The power of cntrl-F (Score:2)

The magic of a book (Score:2)

Re:Boycott This Book!!! (Score:2)

Re:The power of paper? (Score:2)

Re:Unstructured data (Score:2)

Re:Unstructured data...FOOL (Score:2)

Re:The power of paper? (Score:2)

Re:Question (Score:2)

Re:Question (Score:2)

Re:The power of paper? (Score:2)

Re:Question (Score:2)

Re:The power of paper? (Score:2)

Re:Question (Score:2)

Re:Manning should choose a new cover theme (Score:3)

Re:Unstructured data (Score:3)

Effective Perl Programming (Score:3)

Re:Boycott This Book!!! (Score:3)

Re:The power of paper? (Score:3)

Answer by example (Score:3)

Re:Unstructured data (Score:3)

Re:The power of paper? (Score:3)

Re:Unstructured data (Score:3)

Physical quality of Manning's books? (Score:3)

Re:The power of paper? (Score:4)

Re:Question (Score:4)

yep, it's a good book (Score:4)

Unstructured data (Score:4)

Re:The power of paper? (Score:4)

Boycott This Book!!! (Score:4)

Re:The power of paper? (Score:5)

Re:Question (Score:5)

Re:yep, it's a good book (Score:5)

The power of paper? (Score:5)

Re:The power of paper? (Score:5)

Re:Boycott This Book!!! (Score:5)

Related Links Top of the: day, week, month.

Slashdot Top Deals