Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Java Programming IT Technology

Using XML in Performance Sensitive Apps? 97

A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."
This discussion has been archived. No new comments can be posted.

Using XML in Performance Sensitive Apps?

Comments Filter:
  • Sure (Score:5, Funny)

    by jsse ( 254124 ) on Wednesday July 16, 2003 @02:58AM (#6450541) Homepage Journal
    <?xml version="1.0" encoding="UTF-8"?>
    <session session="2003-06-27T17:03:39GMT+08:00" session-serialNumber="06302003b01" encode-version="1.8"><structure id="bzip2"><info cdate="2003-07-12T14:57:07+08:00" expiry-date="" id="OBD12" mdate="2003-07-12T14:57:07+08:00" name="" notes="" organization="Sd7+/OtxQ==" version="1.0"/><content code="H4sIAAAAAAAAAMy9CThW2xc/rpQpYxKJvIakEu88IJkz RKiQIXOSMfOskEqGRJJIpshcyjxLokLGIoRkCplDvP/zVhe/Dv /n+77d5/5+nude92zn7LPWXmt91mevvc+++1Vl5I7AhBB0+/v6 G5rpaNAwCBRiY3SRTkwMQid8wtza1NDO3M3UBAIDLk9C0Ajglz xEB4LEwuEQJAoN0SPcBsFh0Dg48F+yEAwUhUMC/6UCIVyfhuDQ CBRwLS4OoTO1NiH0DCH5x8XO9DwdICEchoPQQX//wNCQn78h1n Q0v1qQGDjuzzYUHIMFtSGxoGexSAQS3IZFgNpQ8A3akKg/2mCA NDBwG/rPZ2EwJO5PmWEwFBz0LByHAN0Hx6FB9yFR0D/1ANrgf+ oLA4wFB7ehQc9i4CjQOzBwDEgPLHAjqA2HBr0XB4eC25BIDKgN jQHpi8NB/3wHHJD6T1ngUAT2T92At4L8BQ4Y+M/3wmFQLBTUhg DZHA5DgcYeDsNCQc8Cwvw5pnA4HA16LzDMoHfAMWD5EFDwOxBw 5J8+DkcAwQBqw8DA/eEwoP6QgHagNiQK9CwSAwM/i0OAxhkFQ4 P6QyFAfg9HoeEgPVBAwP3ZhobiQGOP3sBGaBQK9F7ArUD3YcBY AgfcGfRewByg/jAYkD/DMThQ7MOxMAzoPiwS/CwWUATUhgU/i4 OCsAmACBho/HAosB44DAgTAbcC+R8CGP0/bY6AgrETAcWAxh4B xYFiEAGDQ8FtSMSfY4qAoTF/jh8ChoOCZIHDQPEBeAHuz3hDwM GxjwAGHyQzoAioPwQC/qePIxAoUPwiEBgsaEyRUBCOI5CA94La kDjQGCAxCNA7gNtAbSgYCCeBvAvyA4LIoPeiAID+sw0NA+UKBB oBwjoEGngY1IYFjxUGCAdQGxwNkg+zwRhgMAiQjTA4UCwAaA8F 9YfdwK+waCxIDywO7Pc4GCg3InAI8Pjh0FDws1hQHkRCoSAsRg Ie/acsSCgKlOORgEuC78OBxgoJgyP/lA8JhMef44KEYU">

    Hint: The shorter the header, the faster.

    P.S. This is a joke, for humor-impaired
  • using DOM (Score:5, Informative)

    by mlati ( 462475 ) on Wednesday July 16, 2003 @03:00AM (#6450550)
    1. I use DOM objects, in this case the MSXML free threaded model, to handle xml strings and read out the string only at the last point.
    2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.

    using this method I have manage to push it to ~200 concurrent requests.

    mlati
    • Re:using DOM (Score:3, Informative)

      by macrom ( 537566 )
      I am not 100% sure, but I believe the System.Xml namespace in C# uses DOM. Which is sad because an article a few months back in Windows Developer Journal cited a test where MSXML was the slowest parser around. I believe it was Xerces that ran the fastest.

      As mentioned above, we use std::wstring as the storage mechanism (which isolates developers from the dreaded BSTR that MSXML uses. Ick.), but beware because that isolates your non-C++ users from the interface. We're looking at moving our business rule-enfo
    • Re:using DOM (Score:2, Insightful)

      by DukeyToo ( 681226 )
      If you break it down, there are two basic methods of parsing XML - DOM-based or Stream-based. DOM requires the whole XML document to be loaded in memory, and so is inherently bad for scalability.

      Stream-based combined with XPATH processing is the way to go if you want to just get particular elements from the document. Even if you need to parse the whole document, I would still stay with stream-based method.
      • Pull parsers have become a little more popular recently. There is a more thorough overview at xml.com [xml.com], by the way.

        As to your second paragraph, I don't seem to get what you are talking about. Stream-based APIs and XPath generally don't mix at all - how should an XPath expression like //foo[position()=last()] be handled in, say, a SAX handler?

        There is, however, some kind of middle ground, namely Streaming Transformations for XML [sf.net], an XSLT-ripoff based on SAX with a limited XPath lookalike. Quite useful, IM

        • I do not know of any implementations, but I do not think there is anything inherent about a stream that prevents a single xpath expression from being evaluated. The stream just has to skip over parts of the document that are not relevant to the Xpath expression.

          When I wrote my comment I thought that .NET had such a beast, but further investigation showed it does not. In any case, I do not think it could be based off of something derived from SAX, but it could be derived from an XMLReader (.NET object).

          I
      • Probably the fastest XML parser possible is FleXML [sourceforge.net]. You feed it the DTD for your format and it generates C code a la lex/yacc.
  • by PD ( 9577 ) *
    It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.

    You've got two choices. More processors, which are pretty cheap right now; or a simpler and more specialized language to replace XML.
    • by archeopterix ( 594938 ) on Wednesday July 16, 2003 @06:02AM (#6451062) Journal
      It's hard to parse. That takes cycles. You can probably tweak the parsing to make it faster, but that might not get you from 20 concurrent to 2000 concurrent.
      In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).

      The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.

      • by clintp ( 5169 ) on Wednesday July 16, 2003 @07:25AM (#6451359)
        In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).
        SHHH! Don't say that too loudly!

        The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN! [sourceforge.net]"

        <stage_whisper>But in my experience, most of the time, you're right</stage_whisper>

        • Well, I wasn't really advocating writing your own XML parser, although if enough parameters are fixed (encoding, namespaces and such) and the DTD is simple, that might be an option. I was just trying to say that the parser does not have to be slow. Just try to find a SAX-style parser, one that lets you define events associated with tags (parsing on-the-fly) instead of one that slurps an XML file and produces a DOM-tree out of it. While the tree might prove more convenient (you can traverse it in all directi
          • Even writing your own parser isn't entirely a bad idea. It depends on your message size. A few months ago, in an all-night hacking session, I whipped up a SAX parser that was over 3 times faster than expat for messages under a certain amount (roughly 200 bytes, IIRC). Often parsers will bog down because they have lots of features most people don't need - like namespaces for instance.
            • Maybe it's time someone wrote an intelligent pre-parser. Take a cursory look at the XML and pass it on to an appropriate parser based on encoding, DTD, size, etc. Or run the document through a pipeline, where every single request takes longer to process, but you can several in the pipe at the same time.

              There's no reason there has to be a single heroic XML parser that does everything.
      • by Viol8 ( 599362 ) on Wednesday July 16, 2003 @10:18AM (#6452246) Homepage
        In a protocol designed for efficiency you shouldn't have to parse anything at all!
        If some binary protocol was used you'd would for example use 1 char to represent the field types
        another to represent the record types and so forth. If you put all this into a packet that can be DIRECTLY mapped on a C structure you'll
        save god knows how many cycles. I like the way you say you just have to recognise tags. Have you any idea of the amount of
        processing involved in even simple regexp matching?? This is the problem when high level coders try to design low level
        systems, they simply don't have a clue how things really work and assume that the high level procedures/objects that they work with
        are some sort of magic that "just happens" and you can use them everywhere with no performance degradation.
        • yes...

          I like that idea. Let's map the input directly to a c struct. For complicated items containing lists with interrellationships, you just map it to an array of such structs. The items just store offsets, so you can just add that offset to the base pointer to get the referred item.

          ... or any other item in your address space.

          This idea is perhaps the most braindead idea I have ever heard. Completely throw out security for a bit of efficiency?

          Of course you could validate your data structure, but since

          • Throw out security? Wtf are you talking about?? Anything can be encrypted if thats what you want.
            Do you think ssh uses XML?? Don't be fuckwit, go get a clue and if you manage to find one then get back to me.
            • I take it you are unfamiliar with the buffer overflow problem.

              SSH and encryption do nothing against attacks against the machine, only messages in transit.

              another problem of wrong problem, wrong solution.

              My request stands.
            • Ah, there's nothing like posting on SlashDot to humiliate yourself amoungst your peers. Buffer overflows and stack smashing are the worst security flaws these days. Encryption is there to keep people reading your cleartext, and is only a sub-set of good security principles. Please, try to post only on topics in which you are knowledgable.

              P.S. if the other guy returns with that clue I suggest you nab it and try using it for yourself.

          • I like that idea. Let's map the input directly to a c struct. For complicated items containing lists with interrellationships, you just map it to an array of such structs. The items just store offsets, so you can just add that offset to the base pointer to get the referred item. ... or any other item in your address space.

            Replace 'c struct' with 'cobol record' and you've pretty much got EDI, which has been around since before the people who invented XML were born. And it worked, and still does, for the pur

            • Ok. I know nothing about COBOL, so I can't discuss that. So I'll stick to XML.

              Unless you are importing completely trivial, flat, records (in which case using xml seems like overkill), you need a nesting structure... let's say lists. Since you don't know the length of the list when writing your program, you need to dynamically allocate it. That sounds like a pointer. If you allocate it on the stack, read on.

              So now you need to make sure that all uses of that pointer (ie, references from one item to another
              • A binary parsing program? Oh you mean like RPC or CORBA, or any of a thousand existing debugged solutions that are more efficient in terms of processing overhead and network utilization?

                Oh that's right. I forgot. Those were designed by Neanderthals before HTTP existed, so they aren't worth looking at.

                • Yeah. like those.

                  So, then.

                  Why isn't RPC-gen used as a container format?

                  You've got me thinking, and I'm curious.
                • Of course those protocols are binary formatted packet streams, they were designed for extremely quick and low cost message passing. You might as well compare XML to GIF files or a JPEG for all the sense that last statement made. Binary formatting is appropriate for low level network transports, DCOM, RPC, CORBA, etc, but is not appropriate for a format that is designed to allow easy interchange of data between completely disparate systems e.g. 32 bit big endian machines and 64 bit little endian machines. Ev
              • If you need a (complex) binary protocol, why not use ASN.1? Mature, tested, compact (if using Packet Encoding Rules), almost readable (if using Basic Encoding Rules).

                There are many ASN.1 compilers available (most of which have a rather steep licence cost...)

              • "Additionally, this approach brings in endian ordering issues. I doubt the int representation
                in a struct is the same accross PPC, x86, and big iron. Likely COBOL mandates a byte ordering"

                When you have time to tear yourself away from your Dummies Guide To Markup Languages I suggest
                you go check out the man pages/help files on the htons(), ntohs(), htonl()& ntohl() C functions.
          • You do realize that, to some degree, what he described is precisely how TCP/IP's wire protocol was meant to work, right?

            But of course, we all know how awful an idea TCP/IP was.
        • XML is not designed for speed, but for information exchange. Mapping onto a C structure may work well for a single platform and a single compiler but each processor and compiler have their own ideas about ordering of struct members and padding e.g. Intel likes DWORD alignment if available and used to pad as required...not sure about the latest batch of processors and compilers.

          You lose portability between platforms by trying this low level mapping. How well do you thin big endian systems will like to share

          • the big endian/little endian issue only arises if you are passing binary numeric fields - COMP in COBOL, int or integer in C, Pascal etc.

            So the first rule of portability is don't use binary or packed formats - use character based ones. This approach also means you can easily translate ASCII into EBCDIC into Unicode...
          • Oh for gods sake , as I've pointed out to someone else , go check out the htons() , ntohs(), htonl() and ntohl() functions
            for solving endian issues. Ordering of struct members?? In the C standard it states that all members MUST be laid
            out in memory in the order they're defined in the C code. Otherwise half of the unix networking code would fail!
            Alignment is a non issue when passing data from one machine to another.

            No doubt you've read about all these terms in some book and think you're being smart but
            • I'm already well familiar with the htons () and ntohs () functions seeing as I've programmed low level network handling code. Those functions aren't really the issue, it's the way some broken (read Microsoft) compilers *don't* follow the standards and lay their structure out differently in memory. You only need 1 compiler on 1 platform to be doing this to end up with a forked code base because blasting records to the drive or network won't work with these dysfunctional compilers. That is the main issue. Let
              • "it's the way some broken (read Microsoft) compilers *don't* follow the standards and lay their"

                I didn't know about that but it doesn't surprise me. However , just because one company doesn't
                comply with a standard doesn't mean that it shouldn't be used. After all , MS don't follow the
                telnet RFCs to the letter but we still use it.

                "Your inability to see the value in losing a little performace to gain a lot of compatibilty is showing who the real clueless person in this thread is. It's all about the
                • And I agree with you entirely, but you seem to have forgotten that the original story was called "Using XML in Performance Sensensitive Apps" in this case the guy was talking about using it in a 1000 per sec concurrent request system which IMO is crazy.

                  Oh yeh, I forgot about that. In which case the guy is crazy unless he's planning on building a reasonable sized cluster or moving the transforms back onto the client machines. It's easy enough to do 1000 hits/sec, but 1000 page request / sec is another ballp

      • PD wrote: "It's hard to parse. That takes cycles."

        In my experience XML isn't hard to parse at all. Basically, you just have to recognize tags (basic regexp) and match opening ones and closing ones (use a stack, Luke).

        The cycles he was talking about were obviously CPU processing cycles. Show me a CPU which has opcodes for regular expressions. Do you even know enough about how processors work to tell which operations will require more processing time? Even a line by line text file is easier to process th

    • by Anonymous Coward
      Why not use a simpler, easier to parse, more general language?

      Sexp parsing libraries exist for Lisp (duh), Scheme, Java, C, Perl, Python.

    • Even if the parsing could be done in zero execution time, XML is still consuming excessive network bandwidth.

      XML is very flexible, and an excellent solution when flexibility is truly required in what the next data element is.

      Howeveer, doubling (or worse) the network bandwidth used in downloading a table in order to allow each record to have a different set of fields is just plain stupid.

      A realistic compromise is to use XML to describe different "row formats" that will be used. And then to deliver

      • So compress the XML. Since it's text, and usually very regular text, it compresses nicely. A simple pretuned huffman filter will do wonders.
        • OK, if you have an extraordinary amount of CPU time to waste, then compressed XML would work just fine.

          You also need to be able to postpone processing until large chunks are received, because compression doesn't work well over small data transfers.

          In the real world however, XML wastes bandwidth and at least some processing power. Dynamic compression would reduce the bandwidth waste (perhaps even eliminate it) but only by increasing the processing power being wasted to a truly bothersome level.

          i ha

          • That's why I said to use to pretuned huffman tree. It doesn't stress the CPU, and you can use it on your first byte. It won't work for just any random XML stream, but you can tune it for the stuff you have control over. In fact, if you are considering a roll-your-own binary format, you already have control over both ends.
      • A better solution is Gigabit Ethernet. Honestly, you read the crap that is SlashDot and yet complain about wasting bandwidth ;->

        Do you use a text only browser to do all your web browsing or are you downloading some of those nasty bug graphics too? Isn't CSS and tables to layout web pages a waste of bandwidth? Why not get rid of all the formatting tags that bloat web pages too and just have plain black and white web pages with (perhaps) the H1-H6 tags. Bandwidth is cheap and plentiful, especially inside

  • by KDan ( 90353 ) on Wednesday July 16, 2003 @03:12AM (#6450586) Homepage
    It might be of some use if you actually told us what libraries you used, what methods, etc, not just "I tried to parse some XML files". Is that result of 20 concurrent requests using a SAX parser or DOM? Are you using the standard java DOM implementation (slow and bulky), or one of the slicker ones like JDOM, dom4j, etc (there's a bunch you should have a look at). Another thing you could do t o improve performance is to identify the points where you don't really need a DOM (eg you're just reading the values once and discarding) and use a SAX parser instead to fill in a custom class or a hashtable or such.

    Daniel

    • I assume the poster has tried a number of methods if they went to the trouble of mentioning pull-parsers.

      I doubt he/she is daft enough to be using a slow DOM implementation in situations where SAX would suffice.
    • First of all, the people who say that you should simply switch to a structured binary protocol, and get at your messages through casting are right. That'll be a lot faster. But if you're stuck with implementing a WebService then you're stuck with XML.

      As for using DOM, I'd argue that you should never use it in a performance critical application. I understand that you need to refer to different parts of the message at concurrently so an event-based parser alone won't work. But what you ought to consider is u

  • java and c#? (Score:5, Insightful)

    by Anonymous Coward on Wednesday July 16, 2003 @03:41AM (#6450678)
    well there's your problem.

    With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

    what do you do with the XML, do you generate HTML from it with XSLT or what?

    another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.
    • Re:java and c#? (Score:3, Interesting)

      by jslag ( 21657 )
      With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.

      Amen. All of my XML processing code for the last year has been written using the above-mentioned tools, and it's been fast enough that I haven't needed to spend time performance tuning.

      See the apache axkit project [axkit.org] for more info.
  • by setien ( 559766 ) on Wednesday July 16, 2003 @03:48AM (#6450695)
    I love XML, and I use it anywhere I can get away with it, but I know from my old job, that switching to a binary protocol that is streamlined for the task at hand can give you performance gains over XML protocols that are just plain ridiculous.
    I think we the results we measured were something like 1000 times as many connections on a custom binary protocol over an XML based one.
    That was in C++ mind you. YMMV.
  • The homescreen app in the MS Smartphone has it's config specified by XML.

    For speed, and to avoid parser-usage memory leaks that may exist or be introduced by improper usage of other homescreen plugin developers a seperate app loads all the homescreen plugins feeding them their xml config. This app then streams the plugins out in a binary format (each plugin must support streaming) and then quits, solvin gany memory leaks.

    Then the homescreen app streams them back in and out again as needed without the xml
  • by Bazzargh ( 39195 ) on Wednesday July 16, 2003 @04:48AM (#6450826)
    First off, any chance you could post those benchmarks? 20 requests/second seems low, I'm wondering what the rest of the setup was.

    For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).

    Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).

    It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.

    Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.

    You might want to look at switching toolkits entirely as well - GLUEs benchmarks [themindelectric.com] sound a lot better than yours.
  • by Bart van der Ouderaa ( 32503 ) on Wednesday July 16, 2003 @05:58AM (#6451046)
    Have you profiled your application?
    Do you test on a dedicated test system?

    If your only getting 20 concurrent users regardless of platform (could be, it really depends on the setup and complexity of the problem), maybe the technology isn't the problem but it could be network etc.

    benchmarking is fine, but if you do it on the whole system you don't know what the problem really is.
    Find out precisely what the problem is (network/xml parser/your app logic /db connection/db speed). Look at your own code with a profiler to see the bottleneck.

    If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!). Also look at the most efficient way of using the tree (java dom is, as already said, slow in usage) or maybe you can go from sax directly to your object model without using a tree but building your own sax parser.

    If you can't get a performance gain (which I really doubt), be honest to your client. "If you want to do it that way it's going to cost you" or "it can't be done on one machine" how did they get the idea they could handle 1000's of requests a second anyway? Work on your expectationmanagment (basicly work on making their expectations more realistic). If you promise mountains make sure you can deliver them first. If you can't deliver them make them not want mountains but molehills :-)
    • If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!).

      I got enormous performance gains by switching from xerces-j [apache.org] to dom4j [dom4j.org] in one application. I also found its API much more straightforward.

      On the other hand, I have run into a few bugs in dom4j -- but it was simple enough to fix them and subm

  • So don't use XML. (Score:3, Insightful)

    by WasterDave ( 20047 ) <davep AT zedkep DOT com> on Wednesday July 16, 2003 @06:03AM (#6451064)
    I don't understand what the problem is here. You're saying that you like XML, but it's slow. Fine, don't use it. It's not like it's the only tool in existence, is it?

    Dave
    • by Knight2K ( 102749 )
      I would guess that using XML is to some degree a political issue that can't be avoided. Which is really symptomatic of the age-old problem of the business and technical sides not really listening to each other.
  • AOLserver and tDOM (Score:3, Informative)

    by Col. Klink (retired) ( 11632 ) on Wednesday July 16, 2003 @08:13AM (#6451531)
    I'm just going to guess at what your problem is since you didn't really tell us. I'm assuming that your application needs to load the entire DOM tree 20 times for 20 concurrent requests and that's taking either too much CPU or too much memory.

    The solution would be to load the DOM in the backend and have front-end applications access it.

    You could try using AOLserver [aolserver.com] as a multi-threaded web server and tDOM [ibm.com] as your DOM processor.

    • With this lousy performance I am starting to wonder if the DOM contains his entire website and he is parsing out the page that he needs to serve. Honestly, XML and XSL are lightning fast on the servers and platform we use, which is just plain old Windows 2000 Server and MSXML 4.0. We're talking hundreds of pages marked up per second, each with maybe a dozen sections that are transformed individually (by our CMS).
  • XmlTextReader (Score:2, Informative)

    Many have asked about what libraries you are using to get at the XML. Loading up a whole DOM document is indeed quite inefficient.

    On the .Net platform, I would suggest using the XmlTextReader class. This class and its bretheren are the parsers underlying Microsoft's DOM implementation, and anything else that needs access to XML. The class is noted for its strong performance advantage over loading a DOM or using XPathNavigator - and it is indeed a very lightweight class. It is certainly not as comfortab
    • XmlTextReader is a pull parser right. Therefore it still wouldn't help in situations where the entire message is used by the application. Assuming of course the sender is not including un-necessary data in the message. I believe in those cases, XmlTextReader at best will be equal to DOM. Of course using a different method like .NET Remoting may be an option, if performance is really required and webservices is non-negotiable.
      • But, unlike the DOM, the XmlTextReader does not have to allocate the entire tree in memory. In fact, it shouldn't have to allocate anything except for the string to hold the text itself. It simply changes a flag to tell what kind of "element" it is looking at (Element, EndElement, CDATA, Comment, etc etc). Even when reading in the entire document, there is less overhead.

        I don't have the time to do it, but I would suggest creating a quick test to compare the relative performance of the DOM and XmlTextRe

        • I don't have the time to do it, but I would suggest creating a quick test to compare the relative performance of the DOM and XmlTextReader, or even XPathNavigator.

          I've run benchmarks, in situations where the entire document is needed and used DOM can be better (depends on use case). Of course if you don't need DOM, then don't use it. The challenge from my perspective is this. If you have a consumer which uses objects and your webservice has to reason over that same object model, you'll probably have to us

  • Wrong uses of XML (Score:5, Insightful)

    by Randolpho ( 628485 ) on Wednesday July 16, 2003 @10:03AM (#6452127) Homepage Journal
    This is an example of the wrong way to use XML.

    XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).

    What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.
    • by __past__ ( 542467 )
      It's quite funny that you highlight XML being a markup language (or rather, a tookit to build markup languages), and don't even include document markup as something it's good for.

      Despite all the hype behind XML, markup somehow doesn't really seem to be any more hip than in the dark SGML ages. Sometimes I really wonder why all the data-heads try reinventing ASN.1 with more bloat and complexity so hard.

  • Interesting article (Score:2, Informative)

    by f00zbll ( 526151 )
    There's an interesting article that compares the different types of parser and their advantage at a fairly low level. Dennis Sosnoski's article on xml performance [sosnoski.com] was included on IBM's site a while back. It's a worth while read.

    I'd have to agree with people's assertion that performance intensive apps should use a custom protocol and preferably binary based or some kind of delayed stream parser that only accesses the XML node when the app calls for it. I believe Sun has an API in the works for XML stream pa

  • Explain more (Score:2, Insightful)

    by vadim_t ( 324782 )
    First, what does your program do? Why are you so sure XML takes so much time to process? And, is really XML the best format for your application?

    You could get speed improvements by making things simpler. If XML data takes so much to process on your server then I guess you have two possible problems: Either the amount of data is very big, or you're doing something wrong. You don't really have to use every feature of XML in your program.

    Make sure you also understand what XML is for. Sending bitmaps by trans
  • In my experience that seems about right. I'm using AxKit with caching shutoff and my own Language module and those are about the results I get. For me its not a big deal since speed isn't that important.

    What little I have looked at the speed issue points to 2 things. First caching probably helps alot. Second it may pay to customize the output code. From the few performance tests I've done libxml's output code was the main slow down. Keeping in mind the performance testing was done on C++ code not perl code
  • Sounds like you make extensive use of DOM-based parsing. Ever look at the memory footprint required to parse a large XML document using a DOM parser? I've seen 1MB XML files that require >100MB of memory to parse to a DOM. Event-based parsing helps a lot here.

    Now, if you need to manipulate the DOM (and thus require DOM-based parsing), I would suggest caching. One thing some of the commercial XML-based databases do is to parse XML files when they are added to the DB, storing the resulting DOM rather
  • S-expressions (Score:2, Informative)

    by toomuchPerl ( 688058 )
    why even bother w/ XML? S-expressions are truly superior, and much easier to parse. You can write an S-expression parser in about a hundred lines of Perl, and there exist decent libraries or bindings for S-expression parsers available for C, Python, Java, Ruby. It's much faster and the overhead is always less.

    --toomuchPerl

  • by Voivod ( 27332 ) <cryptic.gmail@com> on Wednesday July 16, 2003 @03:28PM (#6455227)
    If you are using C/C++ check out gSOAP. It goes real fast, runs on many platforms, and I've used it to talk to Java, PHP, C# etc without a problem. It does about 3000 transactions per second on my little desktop PC. Obviously 100 parallel clients aren't going to get that speed, but it sounds like it will be much faster than what you're using!

    http://www.cs.fsu.edu/~engelen/soap.html
  • Proper Parsing (Score:3, Informative)

    by jkichline ( 583818 ) on Wednesday July 16, 2003 @04:34PM (#6455817)
    I have to agree with many of the comments. The parser you choose is the most important decision. DOM is typically a memory hog and takes time. In my experience the MSXML 4.0 parser is very fast, written in C, etc. DOM is easier to user, but obviously can have some downsides. XML is great for portability and faster development, but performance concerns can arise.

    Find out where the bottleneck lies. If you are running an XSLT processor on the server, that will limit your request/sec. I've found that stream XML from the server to a client (such as IE6, gasp) and having the client render to HTML is wicked fast. The XSLT parser in IE renders asynchronously allowing the results to be displayed before the entire doc is loaded. Of course this is MS specific stuff I've experienced, etc.

    SAX is faster for grabbing XML events. While writing a web spider, I was parsing HTML using an HTML parser. I switched from that to regex and saw crawl speed increase significantly. It depends if you need to whole XML doc or not.

    You may want to try loading the XML DOM once and serialize the binary. You could then ship the binary around town. Macromedia has some tools like this that can send binary objects to a flash client, etc. Limit the parsing.

    Another tip... if you have control over the XML schema, you may want to research how to structure XML for performance. I've heard that attribute heavy XML docs are more efficient than docs with embedded data, etc. Also look into some XML tricks like IDs, etc.

    Good luck in your pursuit. Choose your parser carefully. If testing turns out negative, you may just want to use some binary data. XML is a wonderful technology designed to aid in system integration, and ease of use... but it comes at a price.
  • We need to know more about what you are doing to really be able to understand.

    For example: Are you serializing XML to/from objects in Java or C#? Are you writing custom serializers? Or are you using the built in introspective type serializers for Objects?

    Are you using Document centric SOAP, in which case your doing more parsing and logical operation that serialization/deserialization?

    Do you really know that SOAP is your bottleneck? Have you profiled it?

    I'm using SOAP in production with J2EE right now wi
  • by RhettLivingston ( 544140 ) on Wednesday July 16, 2003 @08:38PM (#6457681) Journal

    Most of the work in an off the shelf XML parser is verifying that the XML is "good" or matches some schema specification. If its coming from one of your programs and going to one of your programs and you've done reasonable debugging, its good. You just parse it and use it. Not enough has been done to optimize the "trusted" app communications scenario even though in reality, that's probably 95%+ of the actual usage of XML. Very few sites are actually publishing XML that is really getting used by programs and pages other than the ones they've written.

    Parsing it is very easy and quick if you're in full control of the encoding. You can optimize your parser greatly by choosing not to handle the general case, but to instead handle only what your specific encoder generates.

    Use the protocol, pick up the buzz word for your app, but leave the pain of the generalities meant to handle some free data exchange world that is 15 years in the future out. When the semantic net comes about and applications can actually use any XML without needing to be written to use that XML schema, then you can worry about the general case.

  • This is somewhat simplified, but code like:
    if (strcmp(tag, "surname") == 0)
    ; // handle surname
    else if (strcmp(tag, "firstname") == 0)
    ; // handle firstname
    is obviously a whole lot slower than code like:
    if (tagByte == TAG_SURNAME)
    ; // handle surname
    else if (tagByte == TAG_FIRSTNAME)
    ; // handle firstname

    The problem with XML is that it is a general-purpose textual encoding, and as with most textual encodings it requires more bytes than a dedicated binary encoding does. The resu
    • I wholeheartedly agree. You can run into performance issues using XML in large distributed systems. ASN or TIRPC is the way to go.
  • by aminorex ( 141494 ) on Wednesday July 16, 2003 @11:24PM (#6458484) Homepage Journal
    If you really do require full XML support, the fastest libraries are the GNOME libxml et al. See the benchmark results [sourceforge.net] if you don't believe me.

    If you can do with basic parsing, the nanoxml and picoxml libraries will put everything else to shame.

  • Biztalk (Score:2, Informative)

    by badfish2 ( 316297 )
    We use Biztalk for a lot of enterprise-level XML parsing, and we get up to 200+ documents parsed per second. Of course, there's a lot of hardware being used - 3 2-processor processing boxes handling the workload, for example. But for a system pushing and pulling messages in and out of a SQL Server database it works pretty well. And these are pretty decently sized documents, doing mapping and using all kinds of functoids and whatnot.
  • Get a 64 bit processor with lots of memory and cache EVERYTHING. I'm using an app that caches the entire database(currently about 10 mg).It's lightning fast for hundreds of users. Suppose thousands could be handled.

  • Can you be more specific about why your SOAP messages are so large? What do they contain? What platform is giving you such poor performance? BTW it makes great sense to have the ability to use XML to communicate across heterogenous application boundaries, but why not use a framework that abstracts the wire format? For example, you could leverage binary-format remoting on .NET or a similar technology on Java (RMI?) that has the capability to communicate via XML, but can also use its own efficient wire fo
  • If you have 20 conections and each one is doing 20mB/s I don't think parser issues are your problem. Similarly it sounds likely that your problem is with the XML implementation (poor memory allocators) rather than xml per say. Or is it that xml is bloating your file format? I haven't had any problems with xml performance on clients running on 33 mhz 68 processors.
  • On person mentioned using a C struct that you can whack directly into memory, and several others suggested using caching the DOM somehow. Of course these combine perfectly together. Get the sender to put a unique id or md5 in the header. If you don't have a file with that name in the cache dir then parse and dump the parsed structure to disc. If the file exists then pull file into memory and send the rest of the incoming byte stream to /dev/null. Of course caching won't help you if all your incoming XML fil
  • XML can be very verbose and this could be a problem, especially with parsers doing a lot of copying and langages allocating memory slowly.

    Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.

    This is true for C++ as well. The std::string(const char *) constructor copies the string. This could be a performance bottleneck with C++ wrapper to C parsing API. I therefore use a patched version of Arabica (C++ SAX2 wrapper to Expat) [sf.net] relying

Math is like love -- a simple idea but it can get complicated. -- R. Drabek

Working...