Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Microsoft Software

Dark Corners of the OpenXML Standard 250

Standard Disclaimer writes "Most here on Slashdot know that Microsoft released its OpenXML specification to counter ODF and to help preserve its market position, but most people probably aren't aware of all the interesting legacy code the OpenXML specification has brought to light. This article by Rob Weir details many of the crazy legacy features in the dark corners of OpenXML. As it concludes after analyzing specification requirements like suppressTopSpacingWP, 'so not only must an interoperable OOXML implementation first acquire and reverse-engineer a 14-year old version of Microsoft Word, it must also do the same thing with a 16-year old version of WordPerfect.'"
This discussion has been archived. No new comments can be posted.

Dark Corners of the OpenXML Standard

Comments Filter:
  • by __aaclcg7560 ( 824291 ) on Thursday January 04, 2007 @11:58PM (#17469738)
    Until it supports WordStar [wikipedia.org] documents.
  • Length (Score:4, Funny)

    by jcnnghm ( 538570 ) on Friday January 05, 2007 @12:08AM (#17469792)
    I don't know why anyone would complain, the spec is only 6,000 pages long.
    • Size (Score:5, Funny)

      by Kadin2048 ( 468275 ) <slashdot...kadin@@@xoxy...net> on Friday January 05, 2007 @12:30AM (#17469924) Homepage Journal
      I don't know why anyone would complain, the spec is only 6,000 pages long.

      And the best part is, these [umn.edu] are the pages it uses... (I mean, why else do those specs cost so much?)
    • Well, if you were to start implementing that spec at the rate of one page per hour, you'd be done in just 6000 hours.
      • The company boss would say why not just give 8 pages each to 750 developers and by the end of the day we should have a fully working product.

        While this is rediculous, I'm sure the spec could be broken up into specs for a few different modules. Afterall if Microsoft wrote the spec, and has implemented the spec, then how difficult could it be?

        I once spent 18 months writing a 3000 page spec, and it only took a team of 5 another year to implement it. Of course since then whenever someone asks me if I would li

        • Re: (Score:3, Funny)

          by chthon ( 580889 )

          Yes, but modular programming is anti-thetical to Microsoft's way of doing things.

        • Re: (Score:2, Insightful)

          by redcane ( 604255 )
          I think they may have implemented it, and then made a spec to take into account their horrible implementation.
        • Re: (Score:3, Insightful)

          Afterall if Microsoft wrote the spec, and has implemented the spec, then how difficult could it be?

          Did you read the article. Some of the spec is things like "do what MS Word 5.1.4 did with line spaces." How exactly is anyone other than MS supposed to implement that? By reverse engineering a whole slew of old products that are not even available on the market anymore?

          I once spent 18 months writing a 3000 page spec, and it only took a team of 5 another year to implement it.

          That's fine but this spec isn

    • MS areslow learners (Score:4, Interesting)

      by WebCowboy ( 196209 ) on Friday January 05, 2007 @02:14AM (#17470484)
      ...but they do learn....slowly...eventually.

      Their "open" XML format for office docs is a prime example of this.

      I think Steve Jobs was the one who first said "Microsoft just doesn't get it". Microsoft was probably the very first third-party software developer for the Mac and this was Jobs' reaction to Microsoft's first Mac applications (I think a port of Multiplan--which was re-incarnated into Excel IIRC, and MSBasic). They really WERE "tasteless", ugly and took almost no advantage of the revolutionary GUI interface--their DOSness really showed through--I think in the case of Multiplan the mouse could be used only to jump the cursor to a certain cell and that was it--the rest was all like in DOS.

      MS Windows is another example--Microsoft didn't "get it" well enough until the third major release. Now MS is SLOWLY "getting it" with the beneficial characteristics of XML standards. Microsoft's early XML efforts are like Windows 1.0--there is some very rudmentary understanding of the mechanics but not the philosophy of XML, and I wonder if this is why SOAP ended up NOT so simple (given Microsofties were involved in its creation and seemed to be trying to make it a DCOM-in-XML-but-dumber thing). Microsoft's "Version1" XML might look like this:

      <Soap:Envelope>
      <Soap:Body>
      <wsWriteLegacyData>
        <encodedBinaryData>
      SDFgkdfkljSDFJLDFSJKLkjdfbks df jklsdfklj;hk/jkjnb.kndf
      jk.sdfjkldfsddfsdfkkjsdfh kvbkjnkjkjksdfkjsdfkeuieru903
      oijooeoefvkmefmklef lmkseflkvfeklmlmermklemleflmdvldflk
      </encodedBina ryData>
      </wsWriteLegacyData>
      </Soap:Body>
      </Soa p:Envelope>
      "See? We're using XML and SOAP! We're hip! We're cooool! You can't say we don't play by the rules now!"

      Of course, this is an obtuse, opaque and obsfucated way to use XML andtotally NOT in the spirit of interoperability and openness. I won't even go into the nifty XML tools MS has made...nifty to use but they've done a lot to obliterate the S out of SOAP in their crazy output.

      OOXML (Opaque and Obsfucated XML) standard is "version 2.0"--they're doing their best to eliminate ambiguity but now we've gone over to hyper-specificity, and the standard is being shared a bit better...problem is that they don't fully describe the interpretation of the standard elements so as to keep its advantage. All they've done is taken every formatting option and mapped it to an XML element--it is monolithic and completely non-extensible. But hey, at least its publicly available and doesn't involve weirdness like encoded-binary-blobs.

      In a few years MS will reach version 3.0 of "getting" XML...
  • by JoshJ ( 1009085 ) on Friday January 05, 2007 @12:09AM (#17469800) Journal
    This is why the Microsoft Office XML (let's not kid ourself, this is far from "open") format should not become an ISO standard.
    • Re: (Score:3, Interesting)

      Totally agree. I wonder how it managed to get approved by ECMA? IIRC only IBM didn't agree to its approval; all other parties (whoever they are) agreed. I don't understand what they felt was good about this "standard" especially given that ODF had already been approved.
      • Re: (Score:2, Funny)

        I always thought the ECMA was something to be purchased. ;o)
      • Re: (Score:3, Informative)

        by mwvdlee ( 775178 )
        Nobody takes ECMA seriously anyway.

        You probably know that JavaScript has been standardized as "EcmaScript" by ECMA; everybody just ignores that standard.
    • by _|()|\| ( 159991 ) on Friday January 05, 2007 @01:51AM (#17470366)
      Prior to reading this article, I was ambivalent about Office XML. The push to standardize Office's "DNA sequence" seemed disingenuous, but at least the format was described in detail. Now I see that the table-sagging 6,000 pages is just the tip of the iceberg: this "standard" effectively includes, by reference, the source code for every prior version of Office, to which only Microsoft has access.
    • I thought ODF was an updated version of the venerable OpenDoc standard pioneered by IBM, Apple, and others. Doesn't it mean "Open Doc Format"?

      If so, it was a defacto industry standard long, long, long before OpenOffice existed.

    • Disadvantages of ISO (Score:5, Interesting)

      by BillGatesLoveChild ( 1046184 ) on Friday January 05, 2007 @02:08AM (#17470444) Journal

      Once it is ratified as an ISO Standard, the standard is locked up and anyone that does want to a copy has to buy it from ISO. These are copyrighted. They're not cheap; thousands of dollars. Out of the reach of the average hobbyist, and not listed anywhere on the Internet. That 6,000 page draft will vanish into the mists of time.

      Larger Companies can afford this, but garage companies and hobbyists definitely can't. So what's the chance of an open source or even small upstart challenging Microsoft's Documentonopoly? Zero.

      Want another example? ISO country codes. The country codes (e.g. .us, .jp) are actually ISO, and ISO ended up backing off on a demand for royalties for this(!) But if you want state codes (e.g. California, Kantou), well, forget it unless you want to buy them off ISO. http://www.alvestrand.no/pipermail/ietf-languages/ 2003-September/001472.html [alvestrand.no]

      ISO aren't the only ones guility of doing this. IEEE do it as well. Want the latest simulation standard? Then get out your checkbook: http://standards.ieee.org/catalog/olis/compsim.htm l [ieee.org]

      ISO and the IEEE are enemies of openness. Microsoft is taking a page out of their gamebook.

      ISO or IEEE certification is a *bad* thing.

      • Re: (Score:2, Informative)

        by EvanED ( 569694 )
        Once it is ratified as an ISO Standard, the standard is locked up and anyone that does want to a copy has to buy it from ISO. These are copyrighted. They're not cheap; thousands of dollars. Out of the reach of the average hobbyist, and not listed anywhere on the Internet. That 6,000 page draft will vanish into the mists of time.

        You mean like the C++ standard (ISO:14882) which can be downloaded as a PDF for $32 or purchased hardcopy for something like $300, and for which there are multiple sources for drafts
        • Re: (Score:2, Informative)

          They wouldn't get too far gauging you for a C++ manual. Here are some examples of what I am saying: ISO/IEC TR 9126 "Software engineering -- Product quality " US$153 each volume * 4 volumes = US$612 IEEE 1278 US$151 each volume * 6 volumes = US$906 Problem is when you are told your software has to comply with one of these, these are the only shops in town. They prohibit copying or sharing the information. Anyone who wants to meet the standard has to send I$O or I money, and there are many, many of these
  • by Anonymous Coward on Friday January 05, 2007 @12:09AM (#17469804)
    The power of legacy systems is at once both Microsoft's greatest strength and greatest weakness. Nobody in OSS is going to have the patience to rebuild the same level of backwards compatibility needed to displace them but the code must be an absolute tarpit of accumulated cruft and security holes that's incredibly difficult for them to keep going.
    • Nobody in OSS is going to have the patience to rebuild the same level of backwards compatibility

      I'm sure there are plenty of people that would do it if they had access to the dev docs that Microsoft works from.

      The hitch here is that *not* having them means tons and tons of reverse engineering, and that's only after tracking down every release of every version of every MS Office ever. Reverse engineering can be fun, but I have a hard time imagining that figuring out character spacing in the Mandarin version
      • by Erris ( 531066 ) on Friday January 05, 2007 @04:31AM (#17471080) Homepage Journal

        The hitch here is that *not* having them means tons and tons of reverse engineering, and that's only after tracking down every release of every version of every MS Office ever.

        The real hitch, as the article hints, is that the releases are contradictory. For instance, the Mac version of small caps is different from others. This is part of the reason Word is so bloated and does not preserve printing type setting from one machine to the next.

        Ten years ago, a state agency I was working for was forced to move from Word Perfect to Word. Hundreds, if not thousands, of documents were painstakingly converted from one format to the other. The typesetting, which they had never had a problem with previously, was easily broken by moves from one machine to the other or by changing printers. That is the kind of thing that no program can account for - it was broken from then and can not be created correctly today. It's also probably the reason for all of the nebulous "guidance" sections that don't tell you anything other than to look at, and presumably measure, old printed examples. Not even M$ knows what it was really doing in the field. As I saw at the time, no two were alike.

        Of course, the time to get things right is not in your XML it's when you import the document. The author tells us this in so many words. The XML should be general enough to encompass any kind of typesetting. It is the importing program's task to figure out what the old format wanted things to look like. As the author points out, the spec does not do anything other create something impossible to follow. It's not going to magically make things look right no matter how hard they wish it would.

        • Re: (Score:3, Informative)

          by stg ( 43177 )
          I had a fun problem with a version of Word (for Windows 2.0, I think) many years back. Some friends came by to print a paper for a CS class, and the files they brought were made with the Brazilian Portuguese version of Word.

          I had the English version of Word. When I tried to print, I discovered (after a lot of pages, of course) that I had to fix the formatting because some of the formatting was translated... And not even logical stuff like accents - page breaks, footnotes, etc.
  • Sweet! I actually have copies of those somewhere. The reverse engineering process will begin immediately. Now where did I put my 286....
  • Basically (Score:5, Insightful)

    by DrYak ( 748999 ) on Friday January 05, 2007 @12:11AM (#17469810) Homepage
    ODF is the former SXW format that was taken and transformed into a standard by a committee comprising several Office software makers. It's suppose to describe the normal features that anyone should expect from any Word processing application, be it OpenOffice.org, KWord, AbiWord, Corel Word Perfect, etc. all this in a perfectly neutral way. It was designed with a function in mind (storing word processing documents in an open and interoperable way). Its benefits are comparable to the standardisation of HTML.

    OpenXML is Microsoft trying to translate its proprietary DOC file inside a XML container (because it's a big buzzword) and propose it as a standart to ECMA (because everyone is speaking about ODF being an ISO standard). It describes not only what is to be expected from a word processor, but also all MS-Word specific microsoftism. It was designed with a specific software in mind (and partly derives from the internal functionning of MS-Word). It's only a small improvement over the previous MS XML format (which had a lot of informations hidden in a binary blob).

    The good thing for Microsoft, is that they can pretend this limitation is "Not-a-bug-but-a-feature", and brag around that there are a lot of stuffs that MS-Word couldn't store inside an ODF and only OpenXML can carry.

    Microsoft's plan :
    1. Embrace
    2. Extend <- They are here
    3. Extinguish
    • by Anonymous Coward on Friday January 05, 2007 @12:34AM (#17469946)

      ODF spec page count: 722 [iso.org].

      OpenXML spec page count: 6000 [regdeveloper.co.uk]!!
    • OpenXML is Microsoft trying to translate its proprietary DOC file inside a XML container .... The good thing for Microsoft, is that they can pretend this limitation is "Not-a-bug-but-a-feature", and brag around that there are a lot of stuffs that MS-Word couldn't store inside an ODF and only OpenXML can carry.

      Pretend is the operative word. Translation is supposed to happen when you import the crufty old crap. M$ may have an advantage there, but you won't find that ability in the 6000 pages of their spe

      • Re: (Score:2, Funny)

        by medlefsen ( 995255 )
        When you hold down shift and slowly extend your finger towards the 4 key are you seriously thinking to yourself, "Ha, take that Microsoft!" Cause if you are you need a new hobby.
    • Re: (Score:3, Insightful)

      by megabyte405 ( 608258 )
      ODF is a nice idea in theory, but really, it's a similar situation (OpenOffice.Org internal dataformat jammed into a standard, so designed with OO.o in mind by necessity) just with more OSS-positive karma associated. There's nothing wrong with saving in a file format that matches your internal representation, in fact, it's a darn good idea (see .ABW for AbiWord, .DOC for Word, .WPD for WordPerfect I would also wager is the same idea). However, interoperability seems to work best when taken from the ground
      • Re:Basically (Score:4, Insightful)

        by blincoln ( 592401 ) on Friday January 05, 2007 @01:45AM (#17470338) Homepage Journal
        There's nothing wrong with saving in a file format that matches your internal representation, in fact, it's a darn good idea (see .ABW for AbiWord, .DOC for Word, .WPD for WordPerfect I would also wager is the same idea).

        I would argue that when it's taken to the extreme of Office prior to 2007, it *is* a bad thing. AFAIK, the old Word format is more or less a (very) partial RAM dump (which is why you can often find all sorts of interesting stuff in Word files that the authors think they've deleted). That makes for faster dev times, but because the load and save functions don't really "understand" the content of the file, IMO the developers made things a lot harder for themselves in the big picture. I imagine reproducing issues in testing is a particular nightmare.
        • Oh, I'm sure testing is a nightmare, and it can't be good for performance to be going from a binary memory dump to a binary memory dump probably encoded somehow and shoehorned into XML so you can use those three letters. (Apologies for the lack of solid knowledge - for legal reasons I'd rather not know too much about the intricacies of Microsoft OpenXML.) I was reffering fmore to the fact that .doc is reasonable for use by Word, though it certainly is a pain to load and no good for interchange even betwee
      • Re:Basically (Score:5, Interesting)

        by Nicopa ( 87617 ) <nico@lichtmaier.gmail@com> on Friday January 05, 2007 @01:46AM (#17470342)
        No. ODF has several real, factual, benefits. It might have been originated in a single product but... it reuses existing standard technologies (SVG, CSS...). It has properly designed XML tags that act as "markup", in OpenDocument xml tags act as container for chunks of data. ODF tries to separate content from style.

        And about your RTF suggestion... can I draw diagrams with RTF? Can I have a ToC? Can I do complex styling? Can I have a "galery" of styles? Can I include images? No. RTF is not a solution.
        • Re: (Score:3, Interesting)

          by megabyte405 ( 608258 )
          Actually, I think for most of the things you suggest, you can do them - I know AbiWord supports them at least. (images, complex styles, TOC) RTF's really not the old dog it seems to be - keep in mind that for copy/paste of any sort of rich text to work in any sensible manner on Windows, one _must_ support RTF well.
        • RTF can do all of the things that you mention. But not all apps support RTF as well as others. I think Word has the most complete impl of RTF.
        • Re: (Score:3, Informative)

          by dominator ( 61418 )

          Can I draw diagrams with RTF? Can I have a ToC? Can I do complex styling? Can I have a "galery" of styles? Can I include images? No. RTF is not a solution.

          Actually, you can. RTF can express most (if not all) of what the Microsoft Word format can. Let me answer your objections using excerpts from the RTF 1.8 specification:

          The \tc control word introduces a table of contents entry, which can be used to build the actual table of contents.

          The \stylesheet control word introduces the style sheet group, which conta

      • Re:Basically (Score:4, Informative)

        by iluvcapra ( 782887 ) on Friday January 05, 2007 @02:15AM (#17470486)

        After having written some tools on OS X that do stuff with RTF:

        RTF is well documented and you can make an RTF document on all manner of platforms (I've done it in Ruby and Cocoa), but many platforms have extended RTF in their own way in order to support special features. OS X has added a few special methods to RTF files to support Mac OS X typography, and I've noticed that different versions of Word handle document attributes (like headers and page numbers) in different ways.

        RTF is great if you want to make up something quick that is ONLY formatted text, but readers have all manner of different ways of interpreting the exact appearance of tables, page layouts and margins, and there doesn't seem to be any manageable common mechanism for including images or other documents, something Word and OO.org excel(pun) at. Even HTML seems to be better at this.

        I use RTF output in a few little in-house tools I have, so people can get the text+attributes they create and open them in a text editor of their choice for touching-up and delivery. When my tools have to create something that is supposed to be finished, they make PDFs.

        RTF is great for interoperability, but I never expect an RTF file to contain a "finished product," unless the recipient expects quality on par with a Selectric. It is merely a relatively-open serialization format for strings with attributes.

      • by Geof ( 153857 ) on Friday January 05, 2007 @03:03AM (#17470690) Homepage

        There's nothing wrong with saving in a file format that matches your internal representation, in fact, it's a darn good idea (see .ABW for AbiWord, .DOC for Word, .WPD for WordPerfect I would also wager is the same idea).

        Documents are worth far more than software, and they outlive the applications used to create them. See the comment [robweir.com] to the original article - reading documents after 5, 20, 30, 100 years or more is not optional. You can pay the price of developing an independent format now, or you can pay the price of reverse engineering over and over again every time you change your internal representation.

        Repeated implementation limits future change and innovation. It's expensive: it likely costs more even for Microsoft. But they can afford it; their competitors may not be able to. Plus, Microsoft already has their first implementation.

        interoperability seems to work best when taken from the ground up - when working with another application's data structure of any complexity, you simply can't do a lossless roundtrip without losing before you've started.

        Perhaps so. But compare that cost to the cost I've just outlined. It is in the best interest of users and software developers (maybe even of Microsoft) to bite the bullet now, do the conversion once, and develop a clean format for the future.

        Maybe you have in mind an argument you're not making, but I don't see any sufficient basis for your broad contention that using a file format based on an internal representation is a "darn good idea". In specific cases, yes (e.g. where the cost of development time or effort are the most important factors). In general, I very much doubt it. That successful applications in the past have taken that approach is weak evidence. They were developed when the up-front cost of development in a time of rapid innovation, the loss of customer lock-in, and a lack of open-format competition where good business reasons for making such a choice - even if it was inferior technically, increased cost in the long term, and was bad for consumers. In today's climate of slower innovation, competition from open formats, and customers who are running into their own long-term interests, the situation is different.

        Which is not to say Microsoft's apparent attempt to set the rules of the game and throw sand in the gears of change is not in their interests, or that it will be unsuccessful.

        • Re: (Score:3, Informative)

          by LizardKing ( 5245 )

          Documents are worth far more than software, and they outlive the applications used to create them. See the comment to the original article - reading documents after 5, 20, 30, 100 years or more is not optional.

          Which is why medical, legal and military records are often not held in word processor formats. For instance, the military records I have dealt with (NATO mostly) are held in SGML, conforming to carefully designed MIL DTD's that preserve structure rather than presentation. These files can be transl

      • Re: (Score:3, Insightful)

        by AuMatar ( 183847 )
        No, saving the internal representation as a file is an utterly fucktarded idea. You see, an internal representation is made to make the most sense for your implementation of various features. It changes frequently, sometimes with every patch. It has performance hacks, redundancy, etc. A file format is supposed to be a representation of the data in an easy to parse format, so it can be loaded by applications.

        So what happens when you use the internal representation as the file format? Well, you have a fi
        • Re: (Score:3, Interesting)

          by megabyte405 ( 608258 )
          Well, AbiWord serializes its internal data structure into XML, so it's not an exact dump - it lets us do things like have backward-compatible additions such as LaTeX and MathML equations and include an image preview of the equation as a fallback, for instance. There are things you can do to make your internal format more lucid, and binary->text is one of those things: I can fix almost anything that can go wrong with an AbiWord doc (usually only happens in dev releases, but sometimes strange things happe
      • by NickFortune ( 613926 ) on Friday January 05, 2007 @09:15AM (#17472688) Homepage Journal
        ODF is a nice idea in theory, but really, it's a similar situation (OpenOffice.Org internal dataformat jammed into a standard, so designed with OO.o in mind by necessity)
        The ODF format must necessarily describe the structure and layout of an office document. There's no need for it to reflect the internal data structures of any specific application, except to the extent that they too describe office documents.

        OOXML includes data elements that should be part of internal import routines rather than being enshrined in the document format, and it includes elements that are not specified except by reference to applications for which no public specs exist. This is the problem, not the fact that OOXML is derived from MS Office file formats.

        RTF. It may not get press attention, but it's actually a fairly well-documented standard, has been working as an interchange format for years, and yet is designed with enough expandability that it's still useful with the kinds of documents produced today. It's a true de-facto standard.
        Well, I was a big fan of RTF at one time. But a few years back I found that documents with any kind of formatting more complex than paragraph+justification+font just wasn't working between MS Office and back. I don't know if this was because the format couldn't cope, or because of faulty implementations. In either case, it led me to give up on RTF.

        In any event, to be a replacement, RTF would need to work for spreadsheets and presentations at a minimum - something I don't think there's a lot of support for in the current RTF specification. We'd also lose the benefits of an XML based format, which given the amount of work on the seamless integration of XML documents into databases, web services and other data management applications means losing a lot of functionality.

        for those who really want interoperability, RTF is the way to go with today's software
        Interoperability is only part of the problem. We also want a spec that can be fully and freely implemented by anyone, which isn't under the control of any single vendor.We want a format to which we can entrust documents, knowing that in twenty years time there will be an application capable of reading them.

        an unnecessary dichotomy is drawn between OpenXML and ODF with regard to their design goals - both are repurposed native formats for a single application.
        I don't know what you mean by native in this case, but the repurposing of OOXML isn't the problem. It's one of size and obfuscation, and as TFA points out specification by reference to closed formats and the behaviour of extinct proprietary software. These are non trivial problems with OOXML which are not (to the best of knowledge) found in ODF.

        There's nothing wrong with ODF. Re-creating it based on the non-XML RTF would be a waste of time and effort.

    • Not to forget:

      4. Profit!!!
  • by junglee_iitk ( 651040 ) on Friday January 05, 2007 @12:18AM (#17469854)
    You want to hire a new programmer and you have the perfect candidate in mind, your old college roommate, Guillaume Portes. Unfortunately you can't just go out and offer him the job. That would get you in trouble with your corporate HR policies which require that you first create a job description, advertise the position, interview and rate candidates and choose the most qualified person. So much paperwork! But you really want Guillaume and only Guillaume.

    So what can you do?

    The solution is simple. Create a job description that is written specifically to your friend's background and skills. The more specific and longer you make the job description, the fewer candidates will be eligible. Ideally you would write a job description that no one else in the world except Guillaume could possibly match. Don't describe the job requirements. Describe the person you want. That's the trick.

    So you end up with something like this:

    * 5 years experience with Java, J2EE and web development, PHP, XSLT
    * Fluency in French and Corsican
    * Experience with the Llama farming industry
    * Mole on left shoulder
    * Sister named Bridgette

    Although this technique may be familiar, in practice it is usually not taken this extreme. Corporate policies, employment law and common sense usually prevent one from making entirely irrational hiring decisions or discriminating against other applicants for things unrelated to the legitimate requirements of the job.

    But evidently in the realm of standards there are no practical limits to the application of the above technique. It is quite possible to write a standard that allows only a single implementation. By focusing entirely on the capabilities of a single application and documenting it in infuriatingly useless detail, you can easily create a "Standard of One".

    Of course, this begs the question of what is essential and what is not. This really needs to be determined by domain analysis, requirements gathering and consensus building. Let's just say that anyone who says that a single existing implementation is all one needs to look at is missing the point. The art of specification is to generalize and simplify. Generalizing allows you to do more with less, meeting more needs with few constraints.

    Let's take a simplified example. You are writing a specification for a file format for a very simple drawing program, ShapeMaster 2007. It can draw circles and squares, and they can have solid or dashed lines. That's all it does. Let's consider two different ways of specifying a file format for ShapeMaster.

    In the first case, we'll simply dump out what ShapeMaster does in the most literal way possible. Since it allows only two possible shapes and only two possible line styles, and we're not considering any other use, the file format will look like this:

    <document>
    <shape iscircle="true" isdotted="false"/>
    <shape iscircle="false" isdotted="true"/>
    </document>

    Although this format is very specific and very accurate, it lacks generality, extensibility and flexibility. Although it may be useful for ShapeMaster 2007, it will hardly be useful for anyone else, unless they merely want to create data for ShapeMaster 2007. It is not a portable, cross-application, open format. It is a narrowly-defined, single application format. It may be in XML. It may be reviewed by a standards committee. But it is by its nature, closed and inflexible.

    How could this have been done in a way which works for ShapeMaster 2007 but also is more flexible, extensible and considerate of the needs of different applications? One possibility is to generalize and simplify:

    <document>
    <shape type="circle" lineStyle="solid"/>
    <shape type="square" lineStyle="dotted"/>
    </document>

  • by Bob54321 ( 911744 ) on Friday January 05, 2007 @12:29AM (#17469916)
    I thought most people considered themselves lucky if there documents could open in successive versions of Office. Why would anyone want to implement support for really old versions if Microsoft does not do it themselves?
    • Re: (Score:2, Informative)

      by kfg ( 145172 )
      Why would anyone want to implement support for really old versions if Microsoft does not do it themselves?

      Nobody would. That's the point of it.

      KFG
  • My favorite quote (Score:4, Insightful)

    by IvyKing ( 732111 ) on Friday January 05, 2007 @12:50AM (#17470032)
    From TFA


    This is not a specification; this is a DNA sequence.


    Outrageously funny and to the point.

    • This is not a specification; this is a DNA sequence.

      It's appropriate to note that the 6000 pages will only fit the DNA of a few pathogens [psu.edu]:

      "Measured as Manhattan telephone books, each containing about 1,000 pages of 10-point type," said Simpson, "the genome of the bacterium E. coli is about a third of a book. Baker's yeast, which is my specialty, is a full book. The human genome will occupy two hundred books."

      Other parts of the article about genetic disorders, witches and demonic possesion are also approp

  • by PurifyYourMind ( 776223 ) on Friday January 05, 2007 @01:48AM (#17470350) Homepage
    ...14- and 16-year-olds is illegal.
  • This was a worrying, but good, article. I'm sure MS is a bit in a thight spot as well, if they really desire backwards compatibility (which is what they survive on in a way). But it would make more sence to make supporting legacy documents more optional.

    When I save a Word 2007 document to the old .doc format, it warns me that "minor loss of fidelty" may happen. Similarly, when opening a document, supporting waybackthen formats could be optional/plug-based, and the app rather warning that "minor loss of f
  • You can view all the atrocities of OpenXML that he's blogged about here [robweir.com]. Highlights include dumping bitmasks into XML as hexadecimal on a byte-by-byte basis, and an XML element for specifying whether the dates in the workbook start in 1904.

    I'm can't believe this became a ratified standard.

    "Let him who has understanding calculate the number of the beast, for the number is that of a standard; and its number is three hundred and seventy-six." Common-freaking-sense 13:16-18

    - shadowmatter
    • by kahei ( 466208 )

      IMHO those are more serious problems. They're enough to make it be what I'd call a Long Ugly Hastily Written Standard, which somehow doesn't really surprise me.

      The thing the original article is freaking out about -- legacy compatibility flags -- isn't really an issue. The standard has to include the features offered by existing wps. Sometimes those features are undocumented, obscure, and almost totally forgotten. What do you do? Find the last remaining copy of the code, figure out exactly how WP4 buggi
  • by frisket ( 149522 ) <<peter> <at> <silmaril.ie>> on Friday January 05, 2007 @04:43AM (#17471144) Homepage
    It's instructive to observe the panic-ridden frenzy with which Microsoft have approached the business of using XML as a file format. The marketing influence is all too plain to see, with the result that they feel an inner compulsion to preserve the appearance of the document at all costs, sacrificing all logic and common-sense to do it.

    OOo did the same, but with greater elegance and less haste because they were ahead of the field. Corel screwed it up with WordPerfect by keeping their stylesheet format proprietary so that transfer between WP document code and XML was made as hard as possible (a Class A blunder, given that their XML editor is actually quite good). AbiWord makes a good job of saving DocBook XML, but it's not trying to pretend it's reimportable; it screws up LaTeX formidably, though, by trying to pretend that it absolutely has to preserve line-length and font-size, which is evidence of the same neurotic attitude as Microsoft.

    The problem in all cases is not that the assorted authors and coders don't understand XML (although some of them clearly failed that test too), but that they don't understand documents. This is particularly true at Microsoft, where leaders such as Jean Paoli have been proselytizing XML for years. They still think a document is a jumble of letters; they have no idea of structure, and the DOM is simply laughable as a non-model of a document. Microsoft's particular problem with XML is that they came to it too late, and viewed it as a way of storing data, not text...indeed to this day many XML users, trained with Microsoft blinkers on, are unaware that XML can be used for normal text documents.

    With this level of ignorance surrounding Microsoft, it's hardly unexpected that they should blunder so badly.

    • To normal non-nerd users and most nerds as well, a document is a jumble of letters.

      I highlight text. I click the "B" button to make my text bold. I don't screw with styles.

      Sorry to burst your bubble, dear Holy Priest Of The Most Highest XML.
  • Where is the problem in doing the conversion (for the legacy features) in the converter, so that the new format is free from this bloat? OK, its harder to write the converter (which has to implement this old behaviors), but its Microsoft who wants to have the backward compatibility. So it only needs to be done once.
  • As often, purism is the enemy of progress here. Whilst it'd be great to be able to render, faithfully, every detail of any legacy document - it's an unnecessary and unrealistic constraint. One day, Microsoft themselves will choose to drop support for WPx or WW8 etc. They will. Really, they will. For owners of documents whose only record is held in proprietary formats - that will happen one day. Might as well happen with the adoption of a standard which prevents it happening again. Let's face it - PC'
  • If you were faced with output from a 15 year old program, what would you do? 15 years? In software, that's an eternity. These tags are essentially saying "here is where this old crap used to be". How many people are actually using these programs? Maintaining documents in the old format? I defy any of you out there in Linux-land to say you wouldn't take the same approach under the same set of circumstances. Actually, Linux people would probably just say "it may not open old documents properly, but tha

"The identical is equal to itself, since it is different." -- Franco Spisani

Working...