Using XML in Performance Sensitive Apps? 97
A Parser's Baggage queries: "For the last couple of years I've been working with XML based protocols and one thing that keeps coming up is the amount of CPU power needed to handle 10, 20, 30 or 40 concurrent requests. I've ran benchmarks on both Java and C#, and my results show that on a 2ghz CPU, the upper boundary for concurrent clients is around 20, regardless of the platform. How have other developers dealt with these issues and what kinds of argument do you use to make the performance concerns know to the execs. I'm in favor of using XML for it's flexibility, but for performance sensitive applications, the weight is simply too big. This is especially true when some executive expects and demands that it handle 1000 requests/second on a 1 or 2 cpu server. Things like stream/pull parsers help for SOAP, but when you're reading and using the entire message, pull parsing doesn't buy you any advantages."
Sure (Score:5, Funny)
<session session="2003-06-27T17:03:39GMT+08:00" session-serialNumber="06302003b01" encode-version="1.8"><structure id="bzip2"><info cdate="2003-07-12T14:57:07+08:00" expiry-date="" id="OBD12" mdate="2003-07-12T14:57:07+08:00" name="" notes="" organization="Sd7+/OtxQ==" version="1.0"/><content code="H4sIAAAAAAAAAMy9CThW2xc/rpQpYxKJvIakEu88IJk
Hint: The shorter the header, the faster.
P.S. This is a joke, for humor-impaired
using DOM (Score:5, Informative)
2. I would also suggest using wstring/string in the STL library as you can reserve string buffers in advance in case you have to handle the XML as strings, that's if your using c++, don't know much about c#/java sorry.
using this method I have manage to push it to ~200 concurrent requests.
mlati
Re:using DOM (Score:3, Informative)
As mentioned above, we use std::wstring as the storage mechanism (which isolates developers from the dreaded BSTR that MSXML uses. Ick.), but beware because that isolates your non-C++ users from the interface. We're looking at moving our business rule-enfo
Re:using DOM (Score:2, Insightful)
Stream-based combined with XPATH processing is the way to go if you want to just get particular elements from the document. Even if you need to parse the whole document, I would still stay with stream-based method.
Re:using DOM (Score:2)
As to your second paragraph, I don't seem to get what you are talking about. Stream-based APIs and XPath generally don't mix at all - how should an XPath expression like //foo[position()=last()] be handled in, say, a SAX handler?
There is, however, some kind of middle ground, namely Streaming Transformations for XML [sf.net], an XSLT-ripoff based on SAX with a limited XPath lookalike. Quite useful, IM
Re:using DOM (Score:1)
When I wrote my comment I thought that
I
Re:using DOM (Score:2)
XML is just hard to parse (Score:2, Insightful)
You've got two choices. More processors, which are pretty cheap right now; or a simpler and more specialized language to replace XML.
Re:XML is just hard to parse (Score:5, Informative)
The problem with perceived XML inefficiency is that many implementations build a whole parse tree in memory - that's slow mostly because of node allocations/deallocations. Removing the intermediary parse tree decreased CPU time per request by the factor of 15 in my application.
Re:XML is just hard to parse (Score:5, Insightful)
The XML Police that exist in several communities will come down on you like flies on manure. "You can't parse XML in regexps! That's not really parsing! You need to use the standard-flavor-of-the-month XML libraries for your language (which of course, may need dozens of prerequisite libraries)! What about CDATA? DTDs?! Encodings!? OH THINK OF THE CHILDREN! [sourceforge.net]"
<stage_whisper>But in my experience, most of the time, you're right</stage_whisper>
Re:XML is just hard to parse (Score:3, Interesting)
Re:XML is just hard to parse (Score:2, Insightful)
Re:XML is just hard to parse (Score:3, Interesting)
There's no reason there has to be a single heroic XML parser that does everything.
Re:XML is just hard to parse (Score:4, Insightful)
If some binary protocol was used you'd would for example use 1 char to represent the field types
another to represent the record types and so forth. If you put all this into a packet that can be DIRECTLY mapped on a C structure you'll
save god knows how many cycles. I like the way you say you just have to recognise tags. Have you any idea of the amount of
processing involved in even simple regexp matching?? This is the problem when high level coders try to design low level
systems, they simply don't have a clue how things really work and assume that the high level procedures/objects that they work with
are some sort of magic that "just happens" and you can use them everywhere with no performance degradation.
Re:XML is just hard to parse (Score:2)
I like that idea. Let's map the input directly to a c struct. For complicated items containing lists with interrellationships, you just map it to an array of such structs. The items just store offsets, so you can just add that offset to the base pointer to get the referred item.
This idea is perhaps the most braindead idea I have ever heard. Completely throw out security for a bit of efficiency?
Of course you could validate your data structure, but since
Re:XML is just hard to parse (Score:1, Flamebait)
Do you think ssh uses XML?? Don't be fuckwit, go get a clue and if you manage to find one then get back to me.
Re:XML is just hard to parse (Score:2)
SSH and encryption do nothing against attacks against the machine, only messages in transit.
another problem of wrong problem, wrong solution.
My request stands.
Re:XML is just hard to parse (Score:2)
P.S. if the other guy returns with that clue I suggest you nab it and try using it for yourself.
Re:XML is just hard to parse (Score:1)
Replace 'c struct' with 'cobol record' and you've pretty much got EDI, which has been around since before the people who invented XML were born. And it worked, and still does, for the pur
Re:XML is just hard to parse (Score:2)
Unless you are importing completely trivial, flat, records (in which case using xml seems like overkill), you need a nesting structure... let's say lists. Since you don't know the length of the list when writing your program, you need to dynamically allocate it. That sounds like a pointer. If you allocate it on the stack, read on.
So now you need to make sure that all uses of that pointer (ie, references from one item to another
Re:XML is just hard to parse (Score:2)
A binary parsing program? Oh you mean like RPC or CORBA, or any of a thousand existing debugged solutions that are more efficient in terms of processing overhead and network utilization?
Oh that's right. I forgot. Those were designed by Neanderthals before HTTP existed, so they aren't worth looking at.
Re:XML is just hard to parse (Score:2)
So, then.
Why isn't RPC-gen used as a container format?
You've got me thinking, and I'm curious.
Re:XML is just hard to parse (Score:2)
Re:XML is just hard to parse (Score:1)
There are many ASN.1 compilers available (most of which have a rather steep licence cost...)
Re:XML is just hard to parse (Score:2)
in a struct is the same accross PPC, x86, and big iron. Likely COBOL mandates a byte ordering"
When you have time to tear yourself away from your Dummies Guide To Markup Languages I suggest
you go check out the man pages/help files on the htons(), ntohs(), htonl()& ntohl() C functions.
Yes, I see. So you'll be going with TCP/IP then? (Score:2)
But of course, we all know how awful an idea TCP/IP was.
Re:XML is just hard to parse (Score:3, Insightful)
You lose portability between platforms by trying this low level mapping. How well do you thin big endian systems will like to share
Re:XML is just hard to parse (Score:1)
So the first rule of portability is don't use binary or packed formats - use character based ones. This approach also means you can easily translate ASCII into EBCDIC into Unicode...
Re:XML is just hard to parse (Score:2)
for solving endian issues. Ordering of struct members?? In the C standard it states that all members MUST be laid
out in memory in the order they're defined in the C code. Otherwise half of the unix networking code would fail!
Alignment is a non issue when passing data from one machine to another.
No doubt you've read about all these terms in some book and think you're being smart but
Re:XML is just hard to parse (Score:2)
Re:XML is just hard to parse (Score:2)
I didn't know about that but it doesn't surprise me. However , just because one company doesn't
comply with a standard doesn't mean that it shouldn't be used. After all , MS don't follow the
telnet RFCs to the letter but we still use it.
"Your inability to see the value in losing a little performace to gain a lot of compatibilty is showing who the real clueless person in this thread is. It's all about the
Re:XML is just hard to parse (Score:2)
Oh yeh, I forgot about that. In which case the guy is crazy unless he's planning on building a reasonable sized cluster or moving the transforms back onto the client machines. It's easy enough to do 1000 hits/sec, but 1000 page request / sec is another ballp
Re:XML is just hard to parse (Score:2)
The cycles he was talking about were obviously CPU processing cycles. Show me a CPU which has opcodes for regular expressions. Do you even know enough about how processors work to tell which operations will require more processing time? Even a line by line text file is easier to process th
Re:XML is just hard to parse (Score:1, Insightful)
Sexp parsing libraries exist for Lisp (duh), Scheme, Java, C, Perl, Python.
Parsing isn't the issue (Score:2)
Even if the parsing could be done in zero execution time, XML is still consuming excessive network bandwidth.
XML is very flexible, and an excellent solution when flexibility is truly required in what the next data element is.
Howeveer, doubling (or worse) the network bandwidth used in downloading a table in order to allow each record to have a different set of fields is just plain stupid.
A realistic compromise is to use XML to describe different "row formats" that will be used. And then to deliver
Re:Parsing isn't the issue (Score:3, Interesting)
Re:Parsing isn't the issue (Score:2)
OK, if you have an extraordinary amount of CPU time to waste, then compressed XML would work just fine.
You also need to be able to postpone processing until large chunks are received, because compression doesn't work well over small data transfers.
In the real world however, XML wastes bandwidth and at least some processing power. Dynamic compression would reduce the bandwidth waste (perhaps even eliminate it) but only by increasing the processing power being wasted to a truly bothersome level.
i ha
Re:Parsing isn't the issue (Score:2)
Re:Parsing isn't the issue (Score:2)
Do you use a text only browser to do all your web browsing or are you downloading some of those nasty bug graphics too? Isn't CSS and tables to layout web pages a waste of bandwidth? Why not get rid of all the formatting tags that bloat web pages too and just have plain black and white web pages with (perhaps) the H1-H6 tags. Bandwidth is cheap and plentiful, especially inside
Is that using SAX or DOM? (Score:5, Insightful)
Daniel
Re:Is that using SAX or DOM? (Score:1)
I assume the poster has tried a number of methods if they went to the trouble of mentioning pull-parsers.
I doubt he/she is daft enough to be using a slow DOM implementation in situations where SAX would suffice.
Re:Is that using SAX or DOM? (Score:3, Insightful)
First of all, the people who say that you should simply switch to a structured binary protocol, and get at your messages through casting are right. That'll be a lot faster. But if you're stuck with implementing a WebService then you're stuck with XML.
As for using DOM, I'd argue that you should never use it in a performance critical application. I understand that you need to refer to different parts of the message at concurrently so an event-based parser alone won't work. But what you ought to consider is u
java and c#? (Score:5, Insightful)
With mod_perl, XML::LibXML, XML::LibXSLT, I EASILY get 100/per second. and my code is shitty.
what do you do with the XML, do you generate HTML from it with XSLT or what?
another thing to try: intelligently cache your results in shared memory. you can easily double performance or more.
Re:java and c#? (Score:3, Interesting)
Amen. All of my XML processing code for the last year has been written using the above-mentioned tools, and it's been fast enough that I haven't needed to spend time performance tuning.
See the apache axkit project [axkit.org] for more info.
Switch to a custom protocol (Score:5, Interesting)
I think we the results we measured were something like 1000 times as many connections on a custom binary protocol over an XML based one.
That was in C++ mind you. YMMV.
In the MS Smartphone (Score:2)
For speed, and to avoid parser-usage memory leaks that may exist or be introduced by improper usage of other homescreen plugin developers a seperate app loads all the homescreen plugins feeding them their xml config. This app then streams the plugins out in a binary format (each plugin must support streaming) and then quits, solvin gany memory leaks.
Then the homescreen app streams them back in and out again as needed without the xml
Re:In the MS Smartphone (Score:3, Funny)
Benchmarks, handmade parser... (Score:5, Informative)
For the first part: we had performance problems on an app where the customer had insisted on xml everywhere. However, in one particularly critical part of the system we were getting hammered by the garbage collection overhead of SAX (its efficient for text in elements, but not for attribute values or element names).
Anyway - we knew what was coming into the system as we were also the producers of this xml at an earlier stage. So we wrote a custom SAX parser that only supported ASCII, no DTDs, internal subsets etc; and wrote it to return element/attribute names from a pool (IIRC we used a ternary tree to store this stuff, so we didn't need to create a string to do the lookup).
It was like night and day. XML parsing dropped from generating 80% of the garbage to about 5% and it just didn't appear on my list of performance issues from then on.
Java strings do a lot of copying, the point is to get yourself as close as possible to a zero-copy xml parser as you can.
You might want to look at switching toolkits entirely as well - GLUEs benchmarks [themindelectric.com] sound a lot better than yours.
Re:Benchmarks, handmade parser... (Score:3, Interesting)
So what you're saying is that you stopped using XML and used something completely different that has a visual similarity to XML.
Hint: if it doesn't do unicode, DTDs, CDATA sections and all the other crap, its not XML.
Re:Benchmarks, handmade parser... (Score:2, Insightful)
Re:Benchmarks, handmade parser... (Score:1)
If summat can't be writ with iso8859-1, it ain't worth the writtin', so help me God.
profile your application (Score:5, Interesting)
Do you test on a dedicated test system?
If your only getting 20 concurrent users regardless of platform (could be, it really depends on the setup and complexity of the problem), maybe the technology isn't the problem but it could be network etc.
benchmarking is fine, but if you do it on the whole system you don't know what the problem really is.
Find out precisely what the problem is (network/xml parser/your app logic
If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!). Also look at the most efficient way of using the tree (java dom is, as already said, slow in usage) or maybe you can go from sax directly to your object model without using a tree but building your own sax parser.
If you can't get a performance gain (which I really doubt), be honest to your client. "If you want to do it that way it's going to cost you" or "it can't be done on one machine" how did they get the idea they could handle 1000's of requests a second anyway? Work on your expectationmanagment (basicly work on making their expectations more realistic). If you promise mountains make sure you can deliver them first. If you can't deliver them make them not want mountains but molehills
dom4j vs. xerces-j (Score:1)
If you do end up blaming the parser, change it! (and i don't mean using a different parsing method as most use a sax parser to generate the tree anyway) there are parsers that are 50% faster than those used as standard (xerces isn't the fastest java parser around!).
I got enormous performance gains by switching from xerces-j [apache.org] to dom4j [dom4j.org] in one application. I also found its API much more straightforward.
On the other hand, I have run into a few bugs in dom4j -- but it was simple enough to fix them and subm
So don't use XML. (Score:3, Insightful)
Dave
Re:So don't use XML. (Score:2, Insightful)
AOLserver and tDOM (Score:3, Informative)
The solution would be to load the DOM in the backend and have front-end applications access it.
You could try using AOLserver [aolserver.com] as a multi-threaded web server and tDOM [ibm.com] as your DOM processor.
Re:AOLserver and tDOM (Score:2)
XmlTextReader (Score:2, Informative)
On the
Re:XmlTextReader (Score:1)
Re:XmlTextReader (Score:1)
I don't have the time to do it, but I would suggest creating a quick test to compare the relative performance of the DOM and XmlTextRe
Re:XmlTextReader (Score:1)
I've run benchmarks, in situations where the entire document is needed and used DOM can be better (depends on use case). Of course if you don't need DOM, then don't use it. The challenge from my perspective is this. If you have a consumer which uses objects and your webservice has to reason over that same object model, you'll probably have to us
Wrong uses of XML (Score:5, Insightful)
XML is great because it's extensible and a markup language. It's great for storage, configuration files, and certain forms of data transmission (which is just a sub-set of storage).
What XML is not good for is performance-critical transmission protocols. It's too verbose and too complex, and both are bad for protocols. That is the mistake made by the author of the article. Go with a structured protocol and skip the XML.
Re:Wrong uses of XML (Score:3, Insightful)
Despite all the hype behind XML, markup somehow doesn't really seem to be any more hip than in the dark SGML ages. Sometimes I really wonder why all the data-heads try reinventing ASN.1 with more bloat and complexity so hard.
Interesting article (Score:2, Informative)
I'd have to agree with people's assertion that performance intensive apps should use a custom protocol and preferably binary based or some kind of delayed stream parser that only accesses the XML node when the app calls for it. I believe Sun has an API in the works for XML stream pa
Explain more (Score:2, Insightful)
You could get speed improvements by making things simpler. If XML data takes so much to process on your server then I guess you have two possible problems: Either the amount of data is very big, or you're doing something wrong. You don't really have to use every feature of XML in your program.
Make sure you also understand what XML is for. Sending bitmaps by trans
Caching and IO (Score:1)
What little I have looked at the speed issue points to 2 things. First caching probably helps alot. Second it may pay to customize the output code. From the few performance tests I've done libxml's output code was the main slow down. Keeping in mind the performance testing was done on C++ code not perl code
Event-based parsing...caching...JDOM (Score:2)
Now, if you need to manipulate the DOM (and thus require DOM-based parsing), I would suggest caching. One thing some of the commercial XML-based databases do is to parse XML files when they are added to the DB, storing the resulting DOM rather
S-expressions (Score:2, Informative)
--toomuchPerl
You picked the wrong tools (Score:3, Informative)
http://www.cs.fsu.edu/~engelen/soap.html
Re:You picked the wrong tools (Score:1)
gSOAP is just awesome!
Proper Parsing (Score:3, Informative)
Find out where the bottleneck lies. If you are running an XSLT processor on the server, that will limit your request/sec. I've found that stream XML from the server to a client (such as IE6, gasp) and having the client render to HTML is wicked fast. The XSLT parser in IE renders asynchronously allowing the results to be displayed before the entire doc is loaded. Of course this is MS specific stuff I've experienced, etc.
SAX is faster for grabbing XML events. While writing a web spider, I was parsing HTML using an HTML parser. I switched from that to regex and saw crawl speed increase significantly. It depends if you need to whole XML doc or not.
You may want to try loading the XML DOM once and serialize the binary. You could then ship the binary around town. Macromedia has some tools like this that can send binary objects to a flash client, etc. Limit the parsing.
Another tip... if you have control over the XML schema, you may want to research how to structure XML for performance. I've heard that attribute heavy XML docs are more efficient than docs with embedded data, etc. Also look into some XML tricks like IDs, etc.
Good luck in your pursuit. Choose your parser carefully. If testing turns out negative, you may just want to use some binary data. XML is a wonderful technology designed to aid in system integration, and ease of use... but it comes at a price.
More info needed (Score:2)
For example: Are you serializing XML to/from objects in Java or C#? Are you writing custom serializers? Or are you using the built in introspective type serializers for Objects?
Are you using Document centric SOAP, in which case your doing more parsing and logical operation that serialization/deserialization?
Do you really know that SOAP is your bottleneck? Have you profiled it?
I'm using SOAP in production with J2EE right now wi
Parse it, don't check it (Score:4, Insightful)
Most of the work in an off the shelf XML parser is verifying that the XML is "good" or matches some schema specification. If its coming from one of your programs and going to one of your programs and you've done reasonable debugging, its good. You just parse it and use it. Not enough has been done to optimize the "trusted" app communications scenario even though in reality, that's probably 95%+ of the actual usage of XML. Very few sites are actually publishing XML that is really getting used by programs and pages other than the ones they've written.
Parsing it is very easy and quick if you're in full control of the encoding. You can optimize your parser greatly by choosing not to handle the general case, but to instead handle only what your specific encoder generates.
Use the protocol, pick up the buzz word for your app, but leave the pain of the generalities meant to handle some free data exchange world that is 15 years in the future out. When the semantic net comes about and applications can actually use any XML without needing to be written to use that XML schema, then you can worry about the general case.
ASN.1 (Score:1)
if (strcmp(tag, "surname") == 0)
;
else if (strcmp(tag, "firstname") == 0)
;
is obviously a whole lot slower than code like:
if (tagByte == TAG_SURNAME)
;
else if (tagByte == TAG_FIRSTNAME)
;
The problem with XML is that it is a general-purpose textual encoding, and as with most textual encodings it requires more bytes than a dedicated binary encoding does. The resu
Re:ASN.1 (Score:1)
Fastest all-around full-featured XML support libs (Score:4, Informative)
If you can do with basic parsing, the nanoxml and picoxml libraries will put everything else to shame.
Biztalk (Score:2, Informative)
Cache (Score:1)
Application design? (Score:1)
We need more information (Score:1)
Taking a couple of the better ideas (Score:2)
Don't forget CS101 basics when dealing with XML (Score:1)
This is true for C++ as well. The std::string(const char *) constructor copies the string. This could be a performance bottleneck with C++ wrapper to C parsing API. I therefore use a patched version of Arabica (C++ SAX2 wrapper to Expat) [sf.net] relying