Microsoft Releases Office Binary Formats 259
Microsoft has released documentation on their Office binary formats. Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't. stm2 points us to some good advice from Joel Spolsky to programmers tempted to dig into the spec and create an Excel competitor over a weekend that reads and writes these formats: find an easier way. Joel provides some workarounds that render it possible to make use of these binary files. "[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts."
Joel (Score:2, Insightful)
Re: (Score:2, Insightful)
Re:Joel (Score:5, Insightful)
On the subject of the Office Document format, I believe that everything he says is also true; but with a few caveats. The first is the subject of Microsoft intentionally making Office Documents complicated. I fully accept (and have accepted for a long time) that Office docs were not intentionally obfuscated. However, I also accept that Microsoft was 100% willing to use the formats' inherent complexity to their advantage to maintain lock-in. The unnecessary complexity of OOXML proves this.
The other caveat is that I disagree with his workarounds. He suggests that you should use Office to generate Office files, or simply avoid the issue by generating a simpler file. There's no need to do this as it's perfectly possible to use a subset of Office features when producing a file programatically. Libraries like POI can produce semantically correct files, even if they aren't the most feature rich.
Re:Joel (Score:5, Informative)
I'm not going to say anything against the Microsoft doc; he's pretty much absolutely right and it's a great introduction to why older formats are how they are in general to boot.
The Hungarian thing – no, I still don't see it. Hungarian should not be used in any language which has a reasonable typing system; it's essentially adding unverifiable documentation to variable names in a way that is unnecessary, in a language which can verify type assertions perfectly well. The examples in the article are just ones where good variable naming would have been more than sufficient. It's not good enough.
Oh god I've started another hungarian argument.
Re: (Score:3, Interesting)
That's "Systems Hungarian" in the original article, and you are correct.
"Apps Hungarian", which adds semantic meaning (dx = width, rwAcross = across coord relative to window, usFoo = unsafe foo, etc) to the variable, not typing, is what is good and what he is advocating. It is exactly "good variable naming". You can see that you shouldn't be assigning rwAcross = bcText, because why would you turn assign a byte count to a coord
L&O: sFoo (Score:5, Insightful)
What is the justification for putting that semantic meaning into a variable name, instead of incorporating it into class definitions?
For example, if a string can be "safe" or "unsafe", why not have "SafeString" and "UnsafeString" classes that extend String, and use instances of those, instead of having instances of the base String class names 'sFoo' and 'usFoo'?
Hungarian Notation is a Visual Grammar (Score:3, Insightful)
Re: (Score:3, Insightful)
For strings its a little more straightforward, but it gets messy quick with numeric values. You have to overload every operator you might possibly use, including every variant where it might make sense to operate on another type. The amount of support code needed builds up fast.
An
Re: (Score:2)
I'm not going to say anything against the Microsoft doc; he's pretty much absolutely right and it's a great introduction to why older formats are how they are in general to boot.
The Hungarian thing – no, I still don't see it. Hungarian should not be used in any language which has a reasonable typing system; it's essentially adding unverifiable documentation to variable names in a way that is unnecessary, in a language which can verify type assertions perfectly well. The examples in the article are just ones where good variable naming would have been more than sufficient. It's not good enough.
Oh god I've started another hungarian argument.
Hungarian notation has nothing to do with typing systems.
Hell, I'm barely a novice programmer, but even I can see that.
Hungarian notation is a good variable naming practice — as long as you use it to mirror internal program semantics, not create redundant typing information.
So far, I have tried to implement something similar to Hungarian notation in most of my programs; this article taught me a thing or two more, though some aspects touch on things way beyond my level.
Anyway, his article on Hung
Re:Joel - Hungarian Notation (Score:2)
A "typing" system doesn't help you read and understand the code. It doesn't give you any clues to the types of data being acted upon in a section of code. While I never bought in to the whole hungarian notation thing, at the time it was an "ism" that people went nuts about, it did address a specific problem with code readability. The concepts addressed by hungarian notation are
Re: (Score:2)
Ok, I was going to respond to this but I will not get dragged into another one of these discussions. It's worse than tabs vs. spaces, I tells ya.
Since you're talking about C/C++ code though, I'm going to assert that that doesn't fall into the class of language I was talking about anyway. You're playing with essentially-untyped data there a lot more.
Re: (Score:3, Informative)
I have to disagree, tabs and spaces are easily handled with an "indent" program.
On VERY LARGE projects where there are hundreds of include files and hundreds of source files, it is not convenient or even possible in all cases to find the definition of an object that may be in use.
Context and type information in the name makes it easier to quickly read a section of
Re: (Score:2)
It's funny, I've argued in the past that Java's very verbose typing has advantages in exactly the way you list in your post. In the case of Java, in fact, you wouldn't need the type warts since the types would be readily available.
Re: (Score:3, Insightful)
The second type doesn't add type information. It adds meaning information. For example, an index to a table row might be rowIndex. An index to a column might
Re: (Score:2)
Your posts suggest otherwise. In fact, I think you read the first part of Joel's article and never got to the portion where he turned it all around. Joel himself argues against common Hungarian notation. (i.e. Systems Hungarian) That doesn't seem to be percolating through your noggin'.
Re: (Score:2, Insightful)
No, I think that the uses he proposes for Apps Hungarian are better handled by a typing system, in languages which support such things. Obviously all sorts of hilarious cludges can be used in languages where you're dealing with insufficiently-typed data.
Re:Joel (Score:5, Informative)
http://en.wikipedia.org/wiki/Hungarian_notation [wikipedia.org]
Re:Joel (Score:4, Informative)
Uhh.. There was never a "Mr. Hungarian"
It was invented by Charles Simonyi and the name was both a play on "Polish Notation" and a resemblance to Simonyi's father land (Hungary) where the family name precedes the given name.
Re: (Score:3, Funny)
It was actually all started by cHarles Hungar, and thus the "Hungarian" label.
Re:Joel (Score:4, Informative)
First, understand that nearly every bit of "Hungarian Notation" you've ever seen is misused. The original set of prefixes suggested by Simonyi were designed to convey the PURPOSE of the variable, not simply the data type. It was adding semantic data to the variable name.
This is still valuable today.
However, in days of lesser IDEs, the more common use of Hungarian Notation is still helpful, as it was a lot more work to trace a variable back to it's declaration to identify the type.
Re: (Score:2)
The original set of prefixes suggested by Simonyi were designed to convey the PURPOSE of the variable, not simply the data type. It was adding semantic data to the variable name.
Outside of HN, the only way to include this semantic information in all the super excellent languages you mentioned is by adding a comment after the variable declaration.
That's do-able, though. Not
Re: (Score:2)
Still, how is "rwPosition" any better than "rowPosition"? (from the Wikipedia article) Sure, "i" is kinda ambiguous, but use a modern for-loop instead and get rid of it altogether. Again citing Wikipedia, some of Simonyi's suggested prefixes added semantic information, but not all.
I'll say it again: Hungarian is pointless in a modern language.
/Mike
Re: (Score:2)
They're not any different. The only reason to use the former format was to save keystrokes in the days before auto-completion. If Simonyi* invented the concept today, he would have used rowPosition rather than rwPosition**.
The thing is, he was working back in the days when programmers regularly used single character variables to save keystrokes as well as to keep their code within 80 columns. (i.e. DOS console resolution.) So he tried to push a semantic stan
Re: (Score:2)
/mike
PS: 80 col display terminals were around long before DOS - VT100's ran in either 80 or 132 col mode in the 70's.
PPS: If your code today needs more than 80 cols (or arguably 132 cols), you have bigger problems.
Re: (Score:2)
So you're telling me that you never use variable names like "xdiff", "rowStart", "tabName", "currentRow", or some other combination of semantic meaning combined with a noun?
Any programmer worth his salt uses names that are descriptive. And many of those names happen to align with Simonyi's original idea. In fact, he didn't originate the concept so much as bring it over from his work with Smalltalk.
Re: (Score:2)
Row rowCurrent = getCurrentRow();
irks me, but you might see how redundant it is. _Of course_ choosing variable names is important, but Hungarian is a very specific notation, the examples above (including yours) are not Hungarian.
You need to find a balance between too terse and too verbose. Using Hungarian today can fall down in both ways. Done correctly it is too terse ("rwCur" anyone?) and if not done correctly, is redundant (see examples above) and h
Re: (Score:2)
Re: (Score:2)
Take this article, for instance - sure, he's right that trying to implement support for these specs is futile. It's the same reason why Office's OOXML "standard" is a joke. But he didn't really need to spend 6 pages saying so. And sure, the workarounds are fine if you're a Windows shop, but workarounds #2 and #3 are not simple "half day of work" if you have no experience with Microsoft technologies - it's
patent promise doesn't sound very good (Score:5, Insightful)
Re: (Score:2, Insightful)
Re:patent promise doesn't sound very good (Score:5, Informative)
Among other issues, borderlayoutmanager did not behave properly in MS's implementation. It was buggy in incompatible ways, but your right, that in and of itself wasn't the big problem. The big problem was their insistence on both not fixing the bugs, and not going along with major initiatives (such as JFC/Swing).
If by "2 or 3 years" you mean about 5 years, then I'd agree. Java development tools didn't really reach maturity until things like Eclipse came onto the scene about 5 years ago.
Eclipse vs. IDEA (Score:2)
While I agree that Eclipse did a lot to improve Java development, I have to say that, having used both it and Intellij IDEA, IDEA just seems better. Yes, this could be just another instance of vi vs. emacs, but, to me, IDEA just seems better thought out and works more smoothly. Yes, I know IDEA costs money, but I get things done faster using IDEA, and that's worth a lot.
Re: (Score:2, Insightful)
These extensions were of course, windows only - which missed the entire point of a cross platform language.
the old 'embrace,extend,extinguish' strategy has been in the microsoft playbook for quite a while.
Re:patent promise doesn't sound very good (Score:5, Insightful)
If their `implementation' different from the specs, then it was not a correct implementation. If it was supposed to be a Java implementation, then by definition it was buggy. If wasn't suppose to be one, then it had no business being called Java. That is why Sun sued them.
Re:patent promise doesn't sound very good (Score:5, Informative)
Ah, marketing. Where would we be without it?
Microsoft developed J/Direct specifically to make Java non-portable to other OSs. The MS JVM wasn't better than Suns, it was just tied heavily into the OS, and code developed for it broke if run on any other VM.
J++ was another lockin tool to ensure any "Java" developed in Microsoft's IDE would only run on Microsoft OSs. JBuilder was always a better package anyway.
Re:patent promise doesn't sound very good (Score:5, Interesting)
Ignore the vague language and develop software as you always have.
Re:patent promise doesn't sound very good (Score:5, Informative)
RTFA. That's in the FAQ. Yes they are.
In other words - if you do something related to a spec that isn't covered, it isn't covered. How could it be any different?!
I'm not saying that there aren't any flaws, but this kind of ill informed, badly thought out comment (a.k.a. "+5 Insightful", of course) has little value.
Re: (Score:2)
Re: (Score:3, Interesting)
Re:patent promise doesn't sound very good (Score:5, Interesting)
That is my primary concern with the entire promise. None of this bullshit not-tested-in-court crap that came up the other day: it doesn't cover implementations with slight variations in functionality.
This, it seems, is intentional. MS don't want to allow others to embrace & extend their standards.
Obfuscation (Score:2, Insightful)
But let's say you do. Now you have to find an API to do it for you. As an every day guy, I can write my own HTTP parser, IP connection manager and so forth, w/o requiring special API to do it. As a smarter guy, I'd look for the libraries that can do some of the heavy lifting for me. It's flexibility. The document structure is going to affect how I write code to work with ti.
W/
Re: (Score:2, Insightful)
I think Joel makes a lot of good points and gives great insight into thinking at Microsoft.
Re: (Score:2)
Word and Excel have supported CSV and RTF back into the DOS days, back into the 5.25" floppy days.
And are you honestly saying that an 8088 with 640K ram could handle XML? Assuming that the concept of interchangable markup langauges even EXISTED back then?
Jesus, it's like complaining that a 30 year old television doesn't support HDMI, therefore it's poorly designed.
One possible reason for releasing the specs now (Score:5, Insightful)
If you read Joel's blog you'll see the formats are very old, and consist primarily of C-structs dumped to OLE objects, dumped directly to what we see as an XLS, DOC and so on files.
There's almost no parsing/validation at load time.
Having this in a well laid documentation may reveal quite a lot of security issues with the old binary formats, which could lead to a wave of exploits. Exploits that won't work on Microsoft's new XML Office formats.
So while I'm not a conspiracy nut, I do believe one of Microsoft's goals here are to assist the process of those binary formats becoming obsolete, to drive Office 2007/2008 adoption.
Re:One possible reason for releasing the specs now (Score:5, Informative)
Re: (Score:2, Insightful)
Re:One possible reason for releasing the specs now (Score:5, Insightful)
Let me break your statement in pieces:
- that would increase the vulnerability of old Office
- the majority of corporate America is stuck on old Office
- you don't sell old cars by convincing old ones are rubbish
You know, have you seen those white-papers by Microsoft comparing XP and Vista and trying to put XP-s reliability and security in bad light?
Or have you seen those ads where Microsoft rendered people using old versions of office as... dinosaur-mask wearing suits?
If the majority of corporate America uses the old Office, then the only way for Microsoft to turn in profit would be to somehow convince them this is not good for them anymore, and upgrade. You're just going against yourself there.
Re: (Score:2)
Microsoft marketing (Score:4, Insightful)
You're kidding right? That's been exactly Microsoft's marketing strategy for the last ten years. Remember the Win9X BSOD ads for Windows XP? Microsoft is in the difficult position where their only real competition is their own previous products.
Re:One possible reason for releasing the specs now (Score:4, Informative)
Re: (Score:2, Interesting)
I would say it's because they get good PR for for pretending to be transparent/friendly, whilst not actually giving away any new information.
Look at page 129 of the PDF specifying the .doc format. [microsoft.com]. (The page is actually labelled 128 in the corner, but it's page 129 of the PDF). You will see there's a bit field. One of the many flags that can be set in this bit field: "fUseAutospaceForFullWidthAlpha".
The description?:
Office Doc Generation on the Server (Score:5, Informative)
Promise not a license (Score:5, Insightful)
Re:Promise not a license (Score:5, Interesting)
Here's my suggestion: someone should use these specs to create a BSD-licensed implementation as a library. Then, of course, (L)GPL programs would be free to use the implementation. Nobody gets sued, everybody is happy.
Re: (Score:2)
Re: (Score:3, Informative)
>not to sue but this is very very far from a license.
Some (hypothetical?) questions:
What would happen if those patents in some way was transfered to someone else?
Despite the promise, are you still actually infringing the patent? Just with an assurance of the current patent holder that he won't do anything?
If so, what would happen if it becomes criminal to break a patent (it was quite close to be part of an EU directive not so long ago)? Toge
Re: (Score:3, Insightful)
Why not ODF or OOo? (Score:2, Interesting)
I know OOo is not a perfect Word/Excel converter, but it has served me marvelously since the StarOffice days. I wish that there was a simple command-line driven tool that could convert
Retaliation? (Score:3, Interesting)
I thought it was pretty well known (Score:2, Insightful)
Re: (Score:3, Informative)
If you read the article you would notice that the binary solution of winword 97 (and in fact it is compatible with it predecessors) was a good solution in 1992 when word for windows 2.0 was created. Machines did have have less memory and processing power that your phone, and still had to be able to open a document fast.
my conclusion is that the open office devs are crazy that they ever supported the word
Re: (Score:2)
The fact is, Word in its early versions was NOT significantly faster than its competitors and neither was Excel. Word Perfect and Lotus 1-2-3 did everything people needed and they did it within the resource constraints of the day.
The article is leading in attempting to address the "limited resources" of the day because for most of us, we find it amazingly difficult to imagine operating in a 1MB operating environment. The article also fails to identify the ac
Re:I thought it was pretty well known (Score:5, Interesting)
"compound documents." oh no, run away! (Score:4, Interesting)
I don't see why just because something is organized filesystem-like (not such an awful idea) means it has to be hard to understand. Filesystems, while they can certain get complicated, are fairly simple in concept. "My file is here. It is *this* long. Another part of it is over here..."
Wait, I thought you were trying to convince us that this doesn't reflect bad programming...
Ah, I see, you're trying to imply that it's the very design of the Word-style of word processor that is inherently flawed. Finally we're in agreement.
Anyways, it's no surprise that it's all the OLE, spreadsheet-object-inside-a-document, stuff that would make it difficult to design a Word killer. (How often to people actually use that anyway?) It would basically mean reimplementing OLE, and a good chunk of Windows itself (libraries for all the references to parts of the operating system, metafiles, etc), for your application. However, it certainly can be done. I'm not sure it's worth it, and it can't be done overnight, but it's possible. However you'll have a hard time convincing me that Microsoft's mid-90's idea of tying everything in an application to inextricable parts of the OS doesn't reflect bad programming. Like, what if we need to *change* the operating system? At the very least, it reflects bad foresight, seeing as they tied themselves to continually porting forward all sorts of crud from previous versions of their OS just to support these application monstrosities. This is a direct consequence of not designing the file format properly in the first place, and just using a binary structure dump.
It reminds me of a recovery effort I tried last year, trying to recover some interesting data from some files generated on a NeXT cube from years ago. I realized the documents were just dumps of the Objective C objects themselves. In some ways this made the file parseable, which is good, but it other ways it meant that, even though I had the source code of the application, many of the objects that were dumped into the file were related to the operating system itself instead of the application code, which I did _not_ have the source code to, making the effort far more difficult. (I didn't quite succeed in the end, or at least I ran out of time and had to take another approach on that project.)
In their (MS's) defense, I used to do that kind of thing back then too, (dumping memory structures straight to files instead of using extensible, documented formats), but then again I was 15 years old (in 1995) and still learning C.
Re: "compound documents." oh no, run away! (Score:4, Insightful)
Re: (Score:2)
I think the design goals were flawed. That's my point. Their design goals should have included, how can we ensure that our customer's dat
Re: "compound documents." oh no, run away! (Score:4, Insightful)
And I think your ability to assess another's work is flawed courtesy of an over sized ego. That was my point.
You have yet to provide an alternative solution to the problem. Given that one constraint is memory, your inability to be concise suggests you're not capable of coming up with one either. Certainly your "squeeze out a few extra microseconds" comment suggests you have absolutely no clue what you are talking about. Yet you persist in calling it bad design. You are strangely smug about what was quite possibly an implicit assumption forced by tough constraints, with no actual interoperability requirements, at a time when they were rarely offered let alone expected. I would stop using "IMHO" - clearly there is nothing humble about your opinion.
Why the bit about metadata, out of interest? It's as if you think the more irrelevant things you can fit into the post, the more we're supposed to be impressed.
Hmm (Score:2)
Except for the "1995" part, wasn't that pretty much how Microsoft got started?
They haven't advanced from that point by much....
Re: (Score:2, Insightful)
I don't see why just because something is organized filesystem-like (not such an awful idea) means it has to be hard to understand. Filesystems, while they can certain get complicated, are fairly simple in concept. "My file is here. It is *this* long. Another part of it is over here..."
He didn't say File systems were complex, he said Ole compound documents were complex. Look it up on MSDN. It's a tad painful to work with.
"They were not designed with interoperability in mind."
Wait, I thought you were trying to convince us that this doesn't reflect bad programming...
Wholly out of context, Batman! They made a design decision to ignore interoperability and optimized towards small memory space. What part of that is hard to understand? You think everything should be designed up front for interoperability, regardless of context? In the mid to late 80s, there just wasn't a huge desire for this feature, as Joel states.
but then again I was 15 years old (in 1995) and still learning C.
Ah, now your post m
Re: (Score:2)
I didn't say this. I said I don't see why the fact that OLE documents being like file systems (according to TFA), means that they must necessarily be complex. i.e., I'm saying file systems aren't necessarily complex concepts, and therefore it's not an excuse for a convoluted file format. Anyways, maybe it's straining his analogy further than he intended, so I'll give y
Re: "compound documents." oh no, run away! (Score:5, Informative)
At my company, our users do that every day. Excel spreadsheets embedded in Word or PowerPoint, Microsoft office Chart objects embedded in everything. It's what made the Word/Excel/PowerPoint "Office Suite" a killer app for businesses. MS Office integration beat the pants of the once best-of-breed and dominant Lotus 1-2-3 and WordPerfect. When you embed documents in Office, instead of a static image, the embedded doc is editable in the same UI, and can be linked to another document maintained by somebody else and updated automatically. It saves tremendous amounts of staff time.
Re: (Score:2)
IMO the powerfull serialisation formats of modern langauges are even worse than just dumping out C structs. If an app just dumps out C structs then you can probablly figure out the binary format pretty quickly with just the source for the app and a pagefull or so of information on
Re: (Score:2)
access (Score:2)
Re: (Score:2)
Oh right, it was so easy they got it right first time and never had to update it since?
Worst. Workaround. Ever. (Score:5, Interesting)
The second "workaround" is the same as the first, only a little more proactive. Instead of saving my documents as binary files and then converting them to another format, I should save them as a non-binary format from the start! Mission accomplished! Oh wait - how do I get the rest of the world to do the same? That could be a problem.
I fail to see the problem with using the specification Microsoft released to write a program that can read and write this binary format. If Microsoft didn't want it to be used, they would not have released it. Even if Microsoft tried to take action against open source software for using the specs that they opened, how could Microsoft prove that the open source software used those specs as opposed to reverse engineering the binary format on their own? I think this is a non-issue.
Re:Worst. Workaround. Ever. (Score:4, Insightful)
That is almost the the stupidest thing I've read today (RTFA with respect to development costs to figure out why), except for this:
We can ignore the shockingly poor logic inherent to this statement and just take it at face value: doing something just because M$ wants you to would easily make the Top 10 Stupid Things To Do In IT list. It's particularly bizarre to hear it on Slashdot.
Joel being apologetic (Score:2)
Re: (Score:3, Informative)
Don't Adopt. Convert. (Score:5, Insightful)
But we're not Microsoft, and we don't have the requirements MS had when making these formats. So we should by no means perpetuate them. We should do now what MS never had reason to do: upgrade the code and drop the legacy stuff that makes most of the code such a burden, but doesn't do anything for the vast majority of users today (and tomorrow).
That's OK, because Microsoft has done that, too, already. The MS idea of "legacy to preserve" is based on MS marketing goals, which are not the same as actual user requirements. So that legacy preservation doesn't mean that, say, Office 2008 can read and write Word for Windows for Workgroups for Pen Computing files 100%. MS has dropped plenty of backwards compatibility for its own reasons. New people opening the format for modern (and future) use can do the same, but based on user requirements, not emphasis on product lines if that's not a real requirement.
So what's needed is just converters that use this code to convert to real open formats that can be maintained into the future. Not moving this code itself into apps for the rest of all time. Today we have a transition point before us which lets us finally turn our back on the old, closed formats with all their code complexity. We can write converters that can be used to get rid of those formats that benefited Microsoft more than anyone else. Convert them into XML. Then, after a while, instead of opening any Word or Excel formats, we'll be exchanging just XML, and occasionally reaching for the converter when an old file has to be used currently. MS will go with that flow, because that's what customers will pay for. Soon enough these old formats will be rare, and the converters will be rare, too.
Just don't perpetuate them, and Microsoft's selfish interests, by just embedding them into apps as "native" formats. Make them import by calling a module that can also just batch convert old files. We don't need this creepy old man following us around anymore.
Re: (Score:2)
Just don't perpetuate them, and Microsoft's selfish interests, by just embedding them into apps as "native" formats. Make them import by calling a module that can also just batch convert old files. We don't need this creepy old man following us around anymore.
Be very careful down that road. Particularly, don't confuse "I can import it and save it in MY format" with "this document is now accessible". The application doing the import might die off just the same in 10 or 15 years; and XML is not a wonderpill that makes a document format interchangeable. If you want to do the user a favour, don't just support full import of Office documents, but full export into a standardized format as well (and not just lip-service export).
Interoperability goes both ways; this is
Re: (Score:2)
doing the right thing (Score:5, Insightful)
When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures. Yes, you'd have had to convert some dates between 1900 and 1904 formats (and maybe, detect cases where the old 1-2-3 bug could have affected the result) but at least you wouldn't be trying to maintain two formats for the rest of time.
If this is an example of programmers throughout history always doing exactly the right thing, I'd hate to see an example of code where the original author regretted some mistakes that had been made.
Re:doing the right thing (Score:4, Interesting)
Why? Because, as the article states, Excel 4.0 was the first version that would let you go back. You could just try out Excel and if it didn't work no big deal, just go back to Lotus 1-2-3. It seems completely counter-intuitive to do so, and it apparently wasn't the easiest thing to convince Microsoft management to do so, but it worked and now everyone uses Excel and Lotus 1-2-3 is ancient history.
The programmers did both the right thing and the thing which would be successful. With all due respect to the OpenOffice folks, they're not in the business of selling software. If people don't move to OpenOffice in mass numbers it doesn't spell doom for the company, because there is no company. Doing what you suggest might be the right thing in a programmer's perspective (and I agree), it's not compatible with a company that is trying to make a product to take over the market with. This is why Microsoft is so successful - they're staffed by a large number of people (like Joel) who get this.
Re: (Score:2)
When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures.
Remember, these were the XT/AT/x386 days. It was easier to munge than waste CPU cycles and memory doing conversions.
Enjoy,
I will gladly pay anyone (Score:5, Funny)
Seems that these aren't the full specs (Score:3, Interesting)
No insanely bad programmers ? (Score:2, Insightful)
Re: (Score:2)
Were they using 32 bit machines? Seems to me that 32 bit machines can only address 4GB of memory total. Allowing for the OS and other apps running in memory, you can't use that last bit in addressing anyway. (ie, the OS's and machines of the day maxed out at 4GB of RAM. You could make the whole thing of memory addressable, but it was not needed).
2GB is the limit on a lot of OS's. Right now, I can think of several filesystems that limit file sizes to 2GB. (FAT16, AIX's jfs). The first of those listed filesy
The file format is not really important (Score:5, Interesting)
No, the difficulty with writing a filter for these file formats is that you have no freaking clue what the *formatter* does with the data once it gets it. I'm pretty sure even Microsoft doesn't have an exact picture of that. Hell, I barely ever understood what the WP formatter was doing half the time (and I had source code). File formats are only a small part of the battle. You have all this text that's tagged up, but no idea what the application is *actually* doing with it. There are so many caveats and strange conditions that you just can't possibly write something to read the file and get it right every time.
In all honesty I have at least a little bit of sympathy for MS WRT OOXML. Their formatter (well, every formatter for every word processor I've ever seen) is so weird and flakey that they probably *can't* simply convert over to ODF and have the files work in a backwards compatible way. And lets face it, they've done the non-compatible thing before and they got flamed to hell for it. I honestly believe that (at some point) OOXML was intended to be an honest accounting of what they wanted to have happen when you read in the file. That's why it's so crazy. You'd have to basically rewrite the Word formatter to read the file in properly. If I had to guess, I'd say that snowballs in hell have a better chance...
I *never* had specs for the word file format (actually, I did, but I didn't look at them because they contained a clause saying that if I looked at them I had to agree not to write a file conversion tool). I had some notes that my predecessor wrote down and a bit of a guided tour of how it worked overall. The rest was just trial and error. Believe it or not, occasionally MS would send up bug reports if we broke our export filter (it was important to them for WP to export word because most of the legal world uses WP). But it really wasn't difficult to figure out the format. Trying to understand how to get the WP formatter (also flakey and weird) to do the same things that the Word formatter was doing.... Mostly impossible.
And that's the thing. You really need a language that describes how to take semantic tags and translate them to visual representation. And you need to be able to interact with that visual representation and refer it back to the semantic tags. A file format isn't enough. I need the glue in between -- and in most (all?) word processors that's the formatter. And formatters are generally written in a completely adhoc way. Write a standard for the *formatter* (or better yet a formatting language) and I can translate your document for you.
The trick is to do it in both directions too. Things like Postscript and PDF are great. They are *easy* to write formatters for. But it's impossible (in the general case) to take the document and put it back into the word processor (i.e. the semantic tags that generated the page layout need to be preserved in the layout description). That also has to be described.
Ah... I'm rambling. But maybe someone will see this and finally write something that will work properly. At Corel, my friend was put on the project to do just that 5 times... got cancelled each time
Re: (Score:3, Interesting)
In Windows 2000, you can open the printers control panel, choose "printing preferences" on your HP, poke the "Advanced..."
Some "solutions" from TFA (Score:2, Insightful)
In many situations, you are better off reusing the code inside Office rather than trying to reimplement it. Here are a few examples.
1. You have a web-based application that's needs to output existing Word files in PDF format. Here's how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It'll work. The first time you launch Word it'll take a few seconds. The second time, Word will be kept in memory by the COM subsystem for a few minutes in case you need it again. It's fast enough for a reasonable web-based application.
2. Same as above, but your web hosting environment is Linux. Buy one Windows 2003 server, install a fully licensed copy of Word on it, and build a little web service that does the work. Half a day of work with C# and ASP.NET.
So if you are on a Linux system, you are screwed . I think this article is written by some M$ fanboy. Nothing wrong here. But saying that Linux user should just dump their software, and go for Microsoft stuff , just because
It's very helpful of Microsoft to release the file formats for Microsoft and Office, but it's not really going to make it any easier to import or save to the Office file formats.
I think it's wrong wrong wrong.
Re: (Score:2)
No, he's just saying that it might be cheaper to buy a goodyear than to reinvent the wheel.
Chunky File Format (Score:5, Interesting)
The office format based on the chunky file format does not have a format, per se' It is more similar to the old TIFF format. You can put almost anything in it, and the "things" that you put in it pretty much define how they are stored. So, for each object type that is saved in the file, there is a call out that says what it is, and a DLL is used to actually read it.
It is possible for multiple groups within Microsoft to store data elements in the format without knowledge of how it is stored ever crossing groups or being "documented" outside the comments and structures in the source code that reads it.
This is not an "interchange" format like ODF, it is a binary application working format that happens to get saved and enough people use it that it has become a standard. (With all blame resting squarely on M$ shoulders.)
It is a great file format for a lot of things and does the job intended. Unfortunately it isn't intended to be fully documented. It is like a file system format like EXT2 or JFS. Sure, you can define precisely how data is stored in the file system, but it is virtually impossible to document all the data types that can be stored in it.
Outlook (Score:2, Interesting)
old code costs nothing.. (Score:3, Informative)
You better believe it costs Microsoft quite a bit to keep it around. At the lowest level, having the codebase that big means the tools and practices needed to manage it have to be equal to the task. Here's a hint: MS does not use SourceSafe for the Office codebase. (They use the Team tools in visual studio, so they do eat their own dogfood, but not the lite food).
Far more insidious is the technical debt incurred by carrying around that backwards compatibility with Version-1-which-supported-123-bugs-and-all. Interdependencies that mean a bug either can't be fixed without introducing regressions, or can only be fixed dint of a complex scheme involving things like the 1900 vs. 1904 epoch split that Joel discusses.
Oh yes, it costs a small fortune to carry around that baggage, and only a company as big as Microsoft with Microsoft's revenues can afford it. The price might seem like 'nothing' in the billions of dollars that flow in and out of Microsoft, but ignoring the elephant in the room doesn't make the elephant go away.
Re:first post? (Score:4, Insightful)
Re: (Score:2)
* I can't actually remember how long ago it was
Re:first post? (Score:4, Informative)
As far as I remember, they only insisted on protocols (it was on the basis of a complaint from server OS vendors that MS was tying their market-leading desktop OSs to their server OSs and gaining an unfair advantage).
Re: (Score:2)
Is there some secret conspiracy I am not aware of to butcher the use of this word? Why does every attempt to use it end up in miserable failure.