Stephane Rodriguez Dismantles Open XML 188
Elektroschock writes "Stephane Rodriguez, a reengineering specialist who became popular for his article on MS Office 2007 binary data, now comprehensively debunks Microsoft's new Open XML format. With small case studies he demonstrates the impossible challenges third-party developers will face. His conclusion: it is 'defective by design.' Next week members of the International Standard Organization are likely to approve the format as a second official ISO standard for office documents, even though most nations have submitted comments. Rodriguez claims he is 'not affiliated to any pro-MS or anti-MS party/org[anization]/ass[ociation].'"
Re:This is not proof of OOXML being defective by d (Score:5, Informative)
The document contains all of these. I suggest that you read it.
By the way -- there's newly discovered undocumented Microsoft tech present in OOXML, such as SSPI ("Security Service Provider Interface") which is a proprietary Microsoft developed protocol for security providers, and OLE ("Object Linking and Embedding") which is for embedding (eg, taking an Excel spreadsheet and putting it into a Word document). This is undefined in OOXML only available on Microsoft Windows.
Re:Surely this is she, or he/she (Score:4, Informative)
Stéphane is a French male name. The female version is Stéphanie.
Re:One Question (Score:1, Informative)
Because Ecma did such a poor job and the spec is 6000+ pages it's not possible to review it in depth in the time we have.
Because ISO is an international standards organisation it's composed of national bodies rather than companies, and they need to give it their OK. There's a lot of online research to base opinions on. There's a lot of lobbying, and local tech groups who are talking to the national bodies and trying to get them to understand how OOXML is a bad standard. This has been a lot of work for me, personally. I don't like having to do this.
So it's up to the national body vote and in particular the "P" countries whose votes actually count on tech issues. Microsoft have been getting non P countries to change their status if they vote yes, in order to swing it towards yes for OOXML.
Re:This is not proof of OOXML being defective by d (Score:3, Informative)
A DOC is actually a FAT12-like filesystem (called OLE) that has files and clusters. Clusters can be lost and files can be fragmented. One of the files is the document's text; it's not plaintext but rather another obscure binary format, with text chunks seperated by some kind of metadata (my brain nearly exploded when trying to understand how to separate text from the metadata and I gave up). Images, videos and embedded objects are stored as separate files in the OLE file.
Instead of a simple *.zip file with an HTML-like text file they invented a completely fucked up format that gives people nightmares. The only point is making third-party compatible applications is extremely difficult, but the plan seems to have backfired because even Microsoft's own Word Mobile doesn't work well with native *.doc files (and ironically, Documents To Go for PalmOS works better with DOC than Word Mobile!)
Re:ODF specifies ASCII number IEEE float value? (Score:4, Informative)
<table:table-row table:style-name="ro1">
<table:table-cell/>
−
<table:table-cell office:value-type="float" office:value="123456.123456789">
<text:p>123456.12</text:p>
</table:table-cell>
</table:table-row>
Re:Can anyone repro? (Score:4, Informative)
So? It's still perfectly valid XML even without the BOM. XML it's a real standard and I suggest you read it, it's not Notepad.
And don't even start talking about malformed UTF-8 since he only used characters in the ASCII subset, so even saving it as Latin-1 would have generated valid XML.
Foresight (Score:4, Informative)
Err.. Next week news called, they want their draft story back.
There is no certain outcome of next weeks vote; and the fact that we even are discussing the defects of OOXML are proof that the ISO body will have much problems just waiving this through. Please refrain from taking sides just because this is an 'Microsoft-standard'.
I'd say it's possible that OOXML will NOT be approved next week. It will probably have to take the long road through the ISO as a real standard proposal, not just a fast-tracked 6000 page gorilla.
Re:ODF specifies ASCII number IEEE float value? (Score:4, Informative)
So, there are numbers that floating point formats do not represent well. However, the world is not floating point numbers. And computer math is not just floating point numbers.
The number is stored in the XML as an ASCII represented decimal real number. They're not stored as binary floating point numbers and they shouldn't have the kind of brain damageness that floating point has.
Let's look at what's going on here.
User enters a number in a decimal format. User sees the number in a decimal format displayed on the screen. Excel apparently does not use floating point or it's got a lot of compensation because if you do things like multiple 12345.12345 * 100000000 you get 1234512345000 and not some weird approximation. I would guess that the XML output routine is using floating point (and why would be a good question).
Why is this a problem? Well, we don't know how many digits of precision to work with here or how to round things. If I write an app to work with the spreadsheet I'd probably use something like a Java BigDecimal to handle the numbers. But, I don't know how to round things out so that I get the right numbers. If I use a BigDecimal, 12345.123449999999 is going to be 12345.123449999999. If I multiple by 100000000 I will get 123451234499.99999 instead of 1234512345000 as I would expect from looking at the values that were put into the spreadsheet.
Excel should be putting the proper values out in the XML or the standard should define the form of rounding/conversion to be applied.
Except he doesnt. (Score:4, Informative)
Stephane has for a long time presented a weak case against OpenOffice XML.
"1) Self-exploding spreadsheets"
His top issue "1) Self-exploding spreadsheets" has been discussed on Brian Jones' weblog:
http://blogs.msdn.com/brian_jones/archive/2007/08/ 15/why-there-s-no-microsoft-in-open-xml.aspx [msdn.com]
It boils down to: the fact that is XML does not mean that you can modify it in any way you want; There are rules for modifying the schema and Mr Stephane is not happy with that. Had he followed the actual rules he would have had no issue.
This is a case where two locations must be updated per the spec; He can avoid updating the two locations by removing the chainCalc.xml file (which is optional, and Excel will reconstruct). He later gets upset because if he does that, he claims performance on load will be slower.
"2) Entered versus stored values"
His second point in "2) Entered versus stored values" in an interesting distinction between entered values and stored values. It reflects the way that Excel works (and so does Gnumeric) by storing the values instead of the data that was entered by the user. This responds to the need of the spreadsheet to do something interesting with the data, for example when you enter a date, it is stored as a number with a format applied not as a string. This allows computations on dates to happen based on the underlying numeric value. The featured is used extensively by spreadsheets.
In the Excel/gnumeric case you have to generate a single value, in the ODF case you must generate and update the two values (which just a point before, Stephane was having a seizure about).
The precision issue that he brings up, I suspect is merely an issue with double format precision. He claims that the data is unusable and there is a loss of precision, but handing that out to a C compiler will produce the expected result with no loss of precision. I do not know how "atof" or the compiler work internally to cope with this issue, but at least my libc/gcc combo does not have this problem.
I would not be surprised if this is an artifact of floating point, someone with more background on doubles and floating point math could probably answer the question with more authority, but a cursory read of "What Every Computer Scientist Should Know about Floating Point" seems to validate that there is no error in the floating point representation for the values that he uses: http://docs.sun.com/source/806-3568/ncg_goldberg.h tml [sun.com]
3) Optimization artefacts become a feature instead of an embarrasment
His 3rd point is open for debate, like the 1st case, we have a case where he has to handle things differently. Stephane sells a commercial product to handle Excel files and I suspect that his product has to cope with the same patterns in different ways, which has naturally upset him. OOXML might be inspired by Excel's needs, but it does not mean that it has to be a 1-to-1 match.
4) VML isn't XML
VML is labeled as "deprecated" in the OOXML documentation (Section 8.6.2, page 25) and it states: "The VML format is a legacy format originally introduced with Office 2000 and is included and fully defined in this Standard for backwards compatibility reasons. The DrawingML format is a newer and richer format created with the goal of eventually replacing any uses of VML in the Office Open XML formats. VML should be considered a deprecated format included in Office Open XML for legacy reasons only and new applications that need a file format for drawings are strongly encouraged to use preferentially DrawingML."
So the standard basically says "VML is still in use, but its better to use DrawingML". Stephane misconstrues the above statement and tries to portray this as evil
Re:Personally.. (Score:2, Informative)
Disingeneous (Score:5, Informative)
-Q(1) What does Rodriguez's article show?
-Q(2) is OOXML in and by itself flawed?
-Q(3) What's the practical relevance of the question whether OOXML is flawed?
-Q(4) So what's in it for Microsoft? Why do they bother?
-
- Q(1) : What does Rodriguez's article show?
- A(1) : Rodriguez's article show that the OOXML format written by latest Microsoft Office applications, among them MS Excel, is:
- sorely defective in that you can't be sure to get your original data back after saving it to OOXML
- impossible to change outside MS Office applications
- tied to the MS Office way of representing internationalised versions of documents because "of the way Microsoft chose to store XML using the US English locale, no matter how good your implementation is, you have to retrofit it to work just like Office does" in order to accommodate internationalised documents
- MS Office legacy formats supported throughout, greatly (and unnecessarily) contributing to the size and complexity of the 6,000 page standard.
- Q(2): Is OOXML flawed in and by itself?
- A(2):Yes, I think so, partly because of Rodriguez's article, partly because of flaws documented elsewhere: see http://www.noooxml.org/petition [noooxml.org] The points 2,3,4,5 listed there seem especially crippling to me:
(2) There is no provable implementation of the OOXML specification: Microsoft Office 2007 produces a special version of OOXML, not a file format which complies with the OOXML specification;
(3) There is information missing from the specification document, for example how to do a autoSpaceLikeWord95 or useWord97LineBreakRules;
(4) More than 10% of the examples mentioned in the proposed standard do not validate as XML;
(5) There is no guarantee that anybody can write software that fully or partially implements the OOXML specification without being liable to patent lawsuits or patent license fees by Microsoft;
- Q(3): What's the practical relevance of the question whether OOXML is flawed?
- A(3): Enormous. We currently see that Microsoft is trying to convince the world to accepted OOXML as an ISO "standard", whereas it's no such thing. It's too loosely defined, and opposed to the existing Opendoc standard there is no open-source reference implementation. So there will be a morass of possible implementations, of which only Microsoft's own implementations will be guaranteed mutually compatible. That's a polite way of saying that Microsoft simply aims at continuing its format lock-in, only this time the under the name of OOXML.
- Q(4) : So what's in it for Microsoft? Why do they bother?
- A(4) : Well ... Microsoft has a policy whereby it quite explicitly does not want other people's software, let alone Open Source software, to render MS Office documents correctly.
For reference, see this email, (cited from Rodriguez's article):
Is that
Re:I see Miguel has flown in fast to defend Micros (Score:3, Informative)
Call me crazy, but unlike Bush I do not divide the world in "them" and "us" I like to live in a world of colors, a world of Pantone if you will and abandon the black and white mentality.
There are good and bad things about Microsoft. When they do something bad, I point it out, when they do something good, I do not see why I would not point it out. I also try to judge everyone with the same metric, I do not use one metric to judge Microsoft, and another one for us.
Stephane's article touches on a subject that I have plenty of experience on (I originally wrote Gnumeric, and later worked with Sun to open source StarOffice and over the years worked to grow the OOo team at Ximian and later at Novell).
Stephane's criticism lacks meat. If someone had done a review of Linux with this level of quality, we would have rightfully called it bullshit.
Miguel.
Re:This is not proof of OOXML being defective by d (Score:3, Informative)
Also, Mac Office supports OLE as well, so it's not "Windows-only".
And you claime that OLE is "newly discovered"? It's been around for over 13 years, and was present in the very first OOXML specs.
I don't know about SSPI, but given that your OLE knowledge is so woeful, I feel safe in assuming that your SSPI complaint is FUD as well.
That what is unspecified can not be implemented. (Score:1, Informative)
OOXML docs can incorporate other, unspecified, MS technologies. Hence, OOXML can only be implemented faithfully by MS itself.