Writing Code for Spacecraft 204
CowboyRobot writes "In an article subtitled, "And you think *your* operating system needs to be reliable."
Queue has an interview with the developer of the OS that runs on the Mars Rovers. Mike Deliman, chief engineer of operating systems at Wind River Systems, has quotes like, 'Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet.' and, 'The operating system and kernel fit in less than 2 megabytes; the rest of the code, plus data space, eventually exceeded 30 megabytes.'"
hmm... (Score:2, Interesting)
Re:hmm... (Score:2, Informative)
As to releasing of thier source code? From Wind Ri
Re:hmm... (Score:2, Funny)
i>and the cost of development and deployment is almost 3x that of an embedded linux
When a spacecraft millions of kilometers from Earth packs it in I'm sure a project leader at NASA would be happy they saved 2/3 of the price on a relatively small ticket item.
Re:hmm... (Score:4, Insightful)
Probably not as big a cost as losing a Mars rover because your OS wasn't reliable enough.
Re:hmm... (Score:2)
Re:hmm... (Score:2)
Re:hmm... (Score:2)
Re:hmm... (Score:4, Interesting)
You get what you pay for... I've used VxWorks for a few years now, and while it does have it's share of problems, and while they are sometimes difficult to deal with, it is a great platform for development. You get much better control of the system as opposed to Linux (the main problem with using Linux in an embedded environment is the user to kernel relationship. It's solved neatly in vxWorks by getting rid of it (everything is in kernel space)). This works out very nicely for MIPS processors, which I deal with most of the time. Threading (or tasks as vxWorks has) is much better than Linux - you can at least somewhat guarantee when your tasks run, unlike with the default Linux scheduler.
I am very interested in trying QNX out, to see how it compares to vxWorks, one of these days.
-- Joe
Re:hmm... (Score:2)
Secondly, yes Tornado is rather dated, but I don't believe the debugging environment any worse than say ddd and gdb server. In fact I generally find it quicker to develop in To
Re:hmm... (Score:2)
Re:Chinese Threat: Keep the Source Code Secret! (Score:2)
Re:hmm... (Score:5, Funny)
Mars Rover HaX0r3d and OS replaced with Linux.
Shortly thereafter, Micro$oft claims that they can enforce patent infringement on Mars...
Re:hmm... (Score:3, Funny)
More likely: "Mars Rover Draws Goatse In Sand"
hard to imagine.. (Score:2, Interesting)
CBB
Re:hard to imagine.. (Score:4, Informative)
Re:hard to imagine.. (Score:3, Interesting)
Re:hard to imagine.. (Score:2)
Re:hard to imagine.. (Score:5, Interesting)
Subject: What really happened on Mars Rover Pathfinder
The Mars Pathfinder mission was widely proclaimed as "flawless" in the early
days after its July 4th, 1997 landing on the Martian surface. Successes
included its unconventional "landing" -- bouncing onto the Martian surface
surrounded by airbags, deploying the Sojourner rover, and gathering and
transmitting voluminous data back to Earth, including the panoramic pictures
that were such a hit on the Web. But a few days into the mission, not long
after Pathfinder started gathering meteorological data, the spacecraft began
experiencing total system resets, each resulting in losses of data. The
press reported these failures in terms such as "software glitches" and "the
computer was trying to do too many things at once".
This week at the IEEE Real-Time Systems Symposium I heard a fascinating
keynote address by David Wilner, Chief Technical Officer of Wind River
Systems. Wind River makes VxWorks, the real-time embedded systems kernel
that was used in the Mars Pathfinder mission. In his talk, he explained in
detail the actual software problems that caused the total system resets of
the Pathfinder spacecraft, how they were diagnosed, and how they were
solved. I wanted to share his story with each of you.
VxWorks provides preemptive priority scheduling of threads. Tasks on the
Pathfinder spacecraft were executed as threads with priorities that were
assigned in the usual manner reflecting the relative urgency of these tasks.
Pathfinder contained an "information bus", which you can think of as a
shared memory area used for passing information between different components
of the spacecraft. A bus management task ran frequently with high priority
to move certain kinds of data in and out of the information bus. Access to
the bus was synchronized with mutual exclusion locks (mutexes).
The meteorological data gathering task ran as an infrequent, low priority
thread, and used the information bus to publish its data. When publishing
its data, it would acquire a mutex, do writes to the bus, and release the
mutex. If an interrupt caused the information bus thread to be scheduled
while this mutex was held, and if the information bus thread then attempted
to acquire this same mutex in order to retrieve published data, this would
cause it to block on the mutex, waiting until the meteorological thread
released the mutex before it could continue. The spacecraft also contained
a communications task that ran with medium priority.
Most of the time this combination worked fine. However, very infrequently
it was possible for an interrupt to occur that caused the (medium priority)
communications task to be scheduled during the short interval while the
(high priority) information bus thread was blocked waiting for the (low
priority) meteorological data thread. In this case, the long-running
communications task, having higher priority than the meteorological task,
would prevent it from running, consequently preventing the blocked
information bus task from running. After some time had passed, a watchdog
timer would go off, notice that the data bus task had not been executed for
some time, conclude that something had gone drastically wrong, and initiate
a total system reset.
This scenario is a classic case of priority inversion.
HOW WAS THIS DEBUGGED?
VxWorks can be run in a mode where it records a total trace of all
interesting system events, including context switches, uses of
synchronization objects, and interrupts. After the failure, JPL engineers
spent hours and hours running the system on the exact spacecraft replica in
their lab with tracing turned on, attempting to replicate the precise
conditions under which they believed that the reset occurred. Early in the
morning, after all but one engineer had gone
mutex's always cause trouble (Score:4, Interesting)
And you'll never ever seem me coding an infinite wait for a mutex. That's just asking for trouble.
Bad: in Windows, FindNextChangeNotification()
requires those IPC operations and I always gives me grief.
Good: The Linux File Activity Monitor (FAM). Lets you open and read a pipe of actions. Nice!
Re:mutex's always cause trouble (Score:2, Insightful)
Re:hard to imagine.. (Score:5, Interesting)
Other problems in the Mars Pathfinder were related to using the VxWorks filesystem. VxWorks basically only supports FAT on top of flash. For flash, FAT is a poor choice since some areas of the disk like the root directory and FAT tables will quickly wear out. Also, I don't think VxWorks has much support for working around bad sections of flash.
As far as VxWorks memory allocation support, in an ideal world one would statically allocate all memory, but oftentimes things are not ideal. In the product I work on, we have to have dynamic memory allocation, since depending on how the product is being used at the time, different data structures are required with no way of knowing beforehand how many of a particular type are needed, and this changes dynamically. For a simple device, it's easy to statically allocate everything, or if you have enough memory where you can statically allocate everything.
In our case, while we statically allocate memory where we can, however, in many cases we cannot. For example, I have to maintain a data structure keeping track of all of the network gateways connected to an output interface. We can have many thousands of gateways and thousands of output interfaces. There could be anything between one and thousands of gateways on an interface. In this case, I use static arrays for information on each gateway and each output interface, but must use dynamic data structures to list all the gateways connected to an output interface. It would be prohibitive to allocate storage for 30,000 gateways with 30,000 interfaces! I also can't use a linked list of gateways per interface since it doesn't scale, a linked list having access time O(n).
Also, we use third party libraries that perform dynamic memory allocation and it would be prohibitive to change that.
By replacing Wind River's malloc code with Doug Lea's code we eliminated fragmentation problems and saw our startup time jump from 50 minutes to 3 minutes. Doug Lea's malloc code is the basis of malloc in glibc and is very effecient. We also added support for tracing heap memory allocations to keep track of which task allocated a block and where it was allocated. This alone helped tremendously in tracking down a number of memory leaks since we can just walk the heap and see exactly where all the memory is being allocated. This is a sorely missing feature in VxWorks.
The lack of memory protection is another major problem for complex tasks. We have a bug we've spent weeks trying to track down the cause without any luck where random memory locations get corrupted.
Needless to say, all new projects where I work will not run on VxWorks. All of the chip vendors we're looking at are either dropping support for it or have already dropped it and are focusing on Linux.
BTW, this is one feature I would *REALLY* love to see added to Linux. The company I'm working for is looking at writing our next generation platform on top of an embedded Linux. We have not yet decided which one to use, but want something 2.6 based.
With priority inheritance, if a mutex is held by a low priority task and a high priority task tries to grab it, the low priority task is automatically boosted to the highest priority task that has attempted to acquire the semaphore. When the semaphore is released, the low priority task's priority is restored.
Some other nice features are interrupt scheduling and better priority based message passing support (which may already be present, I'm still looking into this).
Finally, one very useful feature would be the ability to guarantee a real-time thread a certain percentage of the CPU, with the option of placing a hard limit if it tries to exceed that or temporarily lowering it's priority to non-realtime so as to not starve no
Re:hard to imagine.. (Score:4, Interesting)
Re:hard to imagine.. (Score:3, Funny)
Oh shit, i forgot to rerun 'lilo' before rebooting!
Re:hard to imagine.. (Score:2)
CB
Re:hard to imagine.. (Score:2)
Efficiency (Score:2, Funny)
Re:Efficiency (Score:3, Interesting)
Things like the Linux kernel has to know about hundereds and thousands of different devices which is why it's so big.
Re:Efficiency (Score:3, Interesting)
Re:Efficiency (Score:3, Insightful)
Now, any particular instance of the kernel gets compiled for a specific processor, and only includes the drivers it needs. Which does save on some space. But a lot of that extra space comes from things like a dynamic loader/loader, graphics packages, local shells (usually in multiple flavors,) and host of other applications that are "standard."
The thing that saves *that
Re:Efficiency (Score:5, Interesting)
Okay, so this was some 14 years ago - but it was doing a lot of work. 2 megabytes is a lot of memory! There's a phenomenal amount of code and data that can be stored in 2 meg. Maybe it's good by current standards, but - personally - I would suggest that current standards is a bad place to start from.
Re:Efficiency (Score:2)
Agreed. Take the FCS MK 98/2, which controls the Navy's Trident II missiles and performs the prelaunch guidance calculations. It takes about 20 mins to calculate a launch package (24 missiles x 8 warheads ea) from a standing start, and controls the launch sequence in real time. (Including assembling a complex data preload for the
Re:Efficiency (Score:5, Informative)
> megabytes; the rest of the code, plus data space,
> eventually exceeded 30 megabytes." This should be used as
> the example for efficient coding
You've GOT to be kidding, right? 2 meg of OS code? That's ULTRABLOAT compared to most spacecraft. In fact, for the vast majority of the space age, that would have exceeded the resources of the computer by several orders of magnitude.
I've done this kind of programming for a living (for 10 years, moved up to controls design) but the last system I programmed for has 372k of memory, total. That includes data, code, OS, everything. Runs at 432 KIPS. And it performs what it probably one of the most complex in-flight autonomous control operations ever.
Most are even more restrictive. For example, 8K of PROM and 1k of volatile memory (and 28 WORDS) of non-volatile memory. This more than adequate for most applications, if you do it right.
Many spacecraft OS's are more akin to this:
hardware interrupt
external electronics power up processor.
external electronics set PC = 80hex
run
{execute all the code}
halt
power down
Once every 1/4 a second for 15 years.
The project I am currently working on uses VxWorks (and so we were quite interested in the Mars Rover problem) and it's so bloated with unnecessary features it's absurd. This is not a Windows box, it's a spacecraft processor.
I can't argue with the 30 meg of data space. Using the memory as a data recorder would be quite useful and a good picture takes a lot of space. But it's alarming to me that you could figure out how to waste maybe 4-5 meg on code. If you started with a bare home-brew OS, I would guess (and I get paid for this sort of guess) that you could do the entire flight code in 512K, with maybe 8k of data space, excluding the science data.
Only recently have space-qualified rad-hard processors with this kind of capability become available. Until then, if you said you needed 2 meg for the OS alone, you would have gotten fired on the sopt and referred to mental health professionals. The availability of these processors enabled people to use high-level languages with tremendous overhead (like C++) to be used. And this was only done for employee retention purposes during the bubble. For years it was done at the assembler or even machine level. It's still not at all uncommon to do, and we've done MANY flight code patches, with only a processor handbook, an engineering paper pad, and by setting individual bits one-by-one.
Brett
Re:Efficiency (Score:2)
Note: This was in 1997... I believe that 32-bit space-hardened CPUs d
Summary of OS code (Score:4, Funny)
Re:Summary of OS code (Score:5, Funny)
Not quite bug free yet.
Re:Summary of OS code (Score:2)
Re:Summary of OS code (Score:5, Funny)
make: *** [roveros] Error 1 I'm sorry, your rover is lost in space. Insert $1 billion and press any key to try again.
Re:Summary of OS code (Score:2)
for (;;)
{
Dig(); Picture();
}
Re:Summary of OS code (Score:5, Funny)
Re:Summary of OS code (Score:2)
25
Of course you still have some kind of carriage returns/line breaks.
Re:Summary of OS code (Score:2)
$ echo "1 DIG:PICTURE:RUN"|wc -m
18
Re:Summary of OS code (Score:2)
Note: will only work as expected on platforms where the callER cleans up the stack (cdecl). "Pascal"-style calling convention (common in m$ crap) where the callEE cleans up the stack will probably run out of heap eventually.
Re:Summary of OS code (Score:2)
George Neville-Neil (Score:5, Informative)
The interviewer George Neville-Neil co-authored "The Design and Implementation of the FreeBSD Operating System" with Marshall Kirk McKusick.
Reinventing the wheel. (Score:5, Funny)
Carmack (Score:4, Interesting)
In the beginning he got into 3d game applications for a similar reason. The cutting edge is always the very outer area of human development, and Carmack makes a good example of a programmer who has taken aim at the edge of what is known to programmers. Maybe Mr. Carmack would care to comment?
Much like how Id Software develops engines, the space craft programming is new an innovative, although the difference is that space craft have systems have no room for error.
Wait a minute? (Score:4, Insightful)
VXworks does not even offer memory protection and the ram can get fragmented. Not to sound trollish but I would pick something like Qnx or NetBSD for any critical app or embedded device.
Its amazing the engineers fixed it and got it to work reliably but better more mission critical operating systems would be a better choice.
Re:Wait a minute? (Score:3, Interesting)
I would pick something like Qnx or NetBSD for any critical app
Okay, let's turn NetBSD into a real-time OS. Add some "hardening" features like watchdogs etc. Hmm... what should we call it? Perhaps: SpaceBSD?
Re:Wait a minute? (Score:2)
I think QNX is a valid alternative. But is NetBSD hard-real-time?
Re:Wait a minute? (Score:2)
Also Dynamic Memory allocation makes for ... "Interesting" testing "Oppurtunities". That's not to say I've never done it, only that I sort of wish I hadn't
Re:Wait a minute? (Score:2)
Um, yeah, but we're talking about spacecraft here. I think that qualifies as an application that needs true Real Time behavior.
Re:Wait a minute? (Score:2)
Re:Wait a minute? (Score:5, Insightful)
Dynamically allocating memory is usually a big no-no in real time systems.
Re:Wait a minute? (Score:4, Insightful)
Why would you even want memory protection in a system like this? Memory protection is great to prevent crappy apps on your PC from doing too much damage, but in a system like the Rover it's pure overhead.
As for ram getting fragmented, it all depends on how you program it. Often, you don't even need memory allocation, so you won't have any problem with fragmentation.
Hell yes! (Score:5, Insightful)
Exactly!
The problem is that most /.ers are used to thinking of an OS as something that needs to run any arbitrary program under any arbitrary conditions and survive any arbitrary crash in those programs.
For a Rover, none of those are true. They know exactly what code is going to be run. They know exactly where it's going to sit in memory. And they test it. (This is the part that /.ers can't quite understand.) They test these programs far more rigorously than any bog-standard x86 Linux OSS program ever gets tested. Those programs have their problems, but they will be mistakes in logic (metric/imperial conversions, or thread priority inversions), not segfaults because of derefing a null pointer.
I wonder how many undergrand CS degree programs still teach correctness proofs? Not "yeah, I ran it lots of times and it didn't crash," but "I ran it 100,000 times with 100,000 different inputs, all random, and it didn't crash, but while it was running I also sat down and mathematically proved the code is correct."
Embedded programming is just plain different than "normal" progrmming. It's usually a mistake to try to generalize from one to the other.
(All that said, the next version of VxWorks is advertised to optionally support a "traditional Unix" process model, and I think protected memory boundaries are one of the features. In case your embedded app needs to run arbitrary third-party software which probably doesn't get stress-tested at JPL :-), you can turn all that stuff on and live with the overhead.)
Re:Hell yes! (Score:2)
In our work we have to use a component supplied by what is essentially a parent company. One high level developer/manager is very proud of the fact that he runs tests with random input. The component often still has serious, basic problems when we get it. I'm not convi
Re:Hell yes! (Score:2)
I mostly agree with you, but was trying to make a rhetorical point.
compilation error found (Score:3, Funny)
int main() {
printf("Hello World!\n");
return 0;
}
marsrover.c: 3: You are no longer on the planet Earth.
Will they quit using FAT? (Score:4, Informative)
Remember sometime ago Spirit was continously rebooting due to a flash memory problem. The usage of FAT file system in the embedded systems was partly responsible for the mess.
The problem, Denise said, was in the file system the rover used. In DOS, a directory structure is actually stored as a file. As that directory tree grows, the directory file grows, as well. The Achilles' heel, Denise said, was that deleting files from the directory tree does not reduce the size of the directory file. Instead, deleted files are represented within the directory by special characters, which tell the OS that the files can be replaced with new data.
By itself, the cancerous file might not have been an issue. Combined with a "feature" of a third-party piece of software used by the onboard Wind River embedded OS, however, the glitch proved nearly fatal.
According to Denise, the Spirit rover contains 256 Mbytes of flash memory, a nonvolatile memory that can be written and rewritten thousands of times. The rover also contains 128 Mbytes of DRAM, 96 Mbytes of which are used for data, such as buffering image files in preparation for transmitting them to Earth. The other 32 Mbytes are used for code storage. An additional 11 Mbytes of EEPROM memory are used for additional program code storage.
The undisclosed software vendor required that data stored in flash memory be mirrored in RAM. Since the rover's flash memory was twice the size of the system RAM, a crash was almost inevitable, Denise said.
Moving an actuator, for example, generates a large number of tiny data files. After the rover rebooted, the OSes heap memory would be a hair's breadth away from a crash, as the system RAM would be nearly full, Denise said. Adding another data file would generate a memory allocation command to a nonexistent memory address, prompting a fatal error.
Source: DOS Glitch Nearly Killed Mars Rover [extremetech.com]
BTW, there is another interview of Mike Deliman [pcworld.com] I read sometime ago in PCWorld.
Other options being considered (Score:5, Insightful)
There are a few groups at JPL that have been actively experimenting with other options, including RTLinux and a few different variants of hard-real-time Java (basically Java with explicit memory management and no garbage collection).
Re:Other options being considered (Score:2)
Re:Other options being considered (Score:3, Informative)
Huh, its easy.. (Score:5, Funny)
GO NORTH..
you are in a red rocky landscape..
DIG.
ok. you see some red sand.
it is getting dark.
GO NORTH..
you were eaten by a grue.
my satellite debugging experience (Score:5, Interesting)
The misalignment meant the spacecraft was unable to look directly at the sun's center to record the amount of radiation streaming toward Earth. To accurately measure sunlight, the darn thing needed to be pointed to within a quarter of a degree of dead center.
It took about four and a half months to fix that problem, due to uplink difficulties. Ground controllers from first had to slow the spacecraft's spin in order to transmit a series of software "patches" and then gradually speed it up to see how well the commands worked.
Then things were fixed.
Moral of the story: it is a tough job indeed!
Marketing crap (Score:4, Insightful)
I used vxworks on a reasonably large project several years ago, it's a fine piece of work, but nothing special, it's no where close to the quality of a recent linux kernel.
About half-way through our project we developed a need for a local filesystem on our box. We bought a FAT filesystem add-on from wind river that was annoyingly poor quality, lots of bizarre little problems, memory leaks, and of course no source to look at. In the end we didn't use it, we put together our own filesystem from freely available sources.
When I read the articles about vxworks filesystem problems nearly borking the entire Mars rover mission I laughed and laughed. I'm sure that it was the same crappy code (although I don't really know for sure).
For me it's a case study on why you shouldn't use closed source software, you can't evaluate the quality of the code on the other side trade-secret barrier and you wind up trusting things like glossy brochures.
jeff
Re:Marketing crap (Score:2, Interesting)
I do embedded software for a living as well, and run like heck away from any project involving WindRiver.
WindRiver is great for those people who don't know what they are doing in the embedded space. And it's useful as a red flag for telling one as such.
But for people who actually know what they are doing, and who actually do understand OS's, Linux solutions are a far better choice. The time-to-market is absolutely unbeatable; as well as all the choices that one has in order to get a
Ditto (Score:3, Insightful)
Part of the problem in my case was that VxWorks is for smaller embedded systems, which my project is NOT. I need fast disk storage, I need graphics, I need networking, I need things that VxWorks just doesn't provide very well.
Were I able to change one decision about the design of my project, I would have gone with Linux instead.
WRS *used* to have something to offer, in that they provided a r
Huh? (Score:4, Insightful)
"...and even though I chose the wrong tool for the job, it's still the tool's fault for not doing everything I need."
Re:Huh? (Score:4, Informative)
"WindRiver portayed their tool as being able to do those things, thus I made the wrong decision based upon the false claims of the manufacturer."
You see, WRS would have you believe that VxWorks has a reasonable disk subsystem, even though they have no option of using DMA for the data transfers, a fact they convienently don't make available.
WRS had a port of XFree available for VxWorks. However, they did not release the source for it, and they stopped supporting it, and thus it fell behind in support for the video chips now in use. Of course, they did not inform developers of their impending decision to drop support until it was too late.
WRS has a TCP/IP stack. However, they did NOT have support for DHCP, nor DNS, and on certain platforms their stack has gross errors (e.g. packets being shifted by one byte so that when the reach the application they are corrupted.)
WRS claims to have board support packages so that you don't have to develop them. They don't mention that they don't support half the hardware on most boards (e.g. they don't enable the cache on XScale processors, halving the speed of the processor).
WRS claimed they would support development under Linux as a host OS "within a couple of months" - that was back in 1998. They started supporting development under Linux this year - and then not very well.
Yes, I choose the wrong tool for the job - because WRS did not correctly represent their tool's capabilities and there was no other way to evaluate the capabilities of the tool.
Open source spaceware (Score:5, Insightful)
If that was open source, there are so many space nerds who are programmers that flaws of that magnitude would never get by the army of testers.
Many would help out simply because hey it's the *space program* and that's good enough for them. Other would want their name listed next to some obscure bug fix on a NASA site; it's good for the ego or your CV.
Simply put, even a binary distribution of that code would allow unlimited free testing for crashes. Why wouldn't NASA do it?
Because there are still people in washington that think code mysteriously get damaged by being public - even if such code isn't modifiable by the public who reads it.
This is evidence of advanced cluelessness in Washington and maybe independant anti-free-source advocates (spelled M-i-c-r-o-s-o-f-t) are at cause.
But I've learned not to bash. Never explain by Microsoft malice what could be explained by stupidity. Such as using DOS on a space thing...
Re:Open source spaceware (Score:3, Informative)
Re:Open source spaceware (Score:3, Insightful)
Open Source is great and all, but it's hardly the answer to everything.
Re:Open source spaceware (Score:2)
Re:Open source spaceware (Score:5, Interesting)
Sure you can. We [terma.com] make that kind of software. The reason you won't ever see it as open source is because the various instruments on the spacecraft are covered by confidentiality agreements (or worse, in case of military hardware). And as hardware goes it is typically rather obscure stuff, requiring significant domain knowledge as well to emulate correctly.
Another issue is that these systems are rather CPU-intensive - we have a 16-CPU box for the spacecraft instruments plus a dedicated PC to emulate the flight computer itself. But you could run it on simpler hardware if you are willing to run at less than realtime speed.
Interestingly, the closest we ever get to seeing the actual flight software is binary images of it. While that is a lot closer than most slashdotters are likely to get, it is still far removed from being able to do something useful with it.
Of course the other good reason why this isn't going to be open source is because of price. For details you should really contact a salesperson, but let me give you a clue: (raises little finger to mouth) "Mwuhahaha!" ;-)
Re:Open source spaceware (Score:2)
It just adds to my point that Open Source of space software just isn't really viable. Your 1 MILLION DOLLARS!
I suppose I should rephrase my statement... You just don't dump some satellite code onto your average OSS hacker's PC and "test" it.
Re:Open source spaceware (Score:3, Informative)
Re:Open source spaceware (Score:2)
Re:Open source spaceware (Score:2)
Re:Open source spaceware (Score:2)
Re:Open source spaceware (Score:3, Insightful)
Almost certainly not, as none of that army of geeks would have the specialized hardware that the Rovers use.
Few would accomplish anything, as few would bother to study, and learn, and analyze the structure of the program.
Debugging in space: a case for dynamic systems. (Score:5, Interesting)
Perhaps not surprisingly for anyone who has heard about the management at NASA, C++ was selected for the successors to the Remote Agent on the grounds that it is supposed to be more reliable (this despite the fact that the Remote Agent was originally to be developed in C++, an effort that was abandoned after a year of failure). This caused more than a few people to be upset [google.com] (including a very personal account [flownet.com] by one of the aforementioned designers). Clearly the debugging facilities of Common Lisp are far superior to static systems like C++, something which is very useful in diagnosing unexpected error conditions in spacecraft software (read the first question on p. 3 of the interview to see what pains the JPL staff went through to adapt similar, ad-hoc methods to VxWorks). It's also clear from this interview (question: "How is application programming done for a spacecraft?" Answer:"Much the same as for anything elsesoftware requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually its sent off to do its job.") that NASA has in no way tried to adapt formal verification methods for it's software, prefering instead to rely on the "tried and true" (at failing, maybe) poke-and-test development "methods."
Clearly, formal verification methods to eliminate bugs before critical software is deployed, and deployment in a system with advanced debugging facilities is a clear win for spacecraft software, and should be adapted as the standard model of development. Unfortunately, like in many other software development enterprises, inertia keeps outdated, inadequate systems going despite a strong failure correlation rate.
Re:Debugging in space: a case for dynamic systems. (Score:4, Interesting)
Having said all of that, I'll agree that formal verification at NASA is in its infancy, and is facing an uphill battle for acceptance (witness how long the Langley group has been trying to push formal methods). It'll be interesting to see what happens with JPL's LaRS.
Out of curiousity (Score:2, Interesting)
I can't imagine it would be the cost of the memory... I mean I know it costs much much more to make chips to a very strict specification, but if you are already producing so few units, isn't your cost of production going to be extrodinarily high whether you are making 64KB
Re:Out of curiousity (Score:5, Informative)
Last I read (maybe a year ago?), NASA still used 386 and 486 chips because they didn't generate a lot of heat (compared to todays machines) and could be made to withstand higher than normal forces (through extra padding on the device I imagine). They were more resiliant to the issues you might see in space than newer processors.
Simply put, if they put the latest CPU with tons of RAM in there, and it fails, how are they going to fix it?
-- Joe
And on top of all that... (Score:3, Informative)
...the memory inside the Gameboy Advance and whatnot isn't radiation-hardened.
The grandparent poster needs to RTFA, and note what had to be done to protect circuits from Marvin the Martian's cosmic rays. The chips get physically bigger (sometimes a lot bigger), and that builds up quickly.
Re:Out of curiousity (Score:4, Informative)
Smaller memory capacity for a given surface area implies larger feature size.
By the way, the class I took was 1-on-1 with Prof. Stephen McGuire at Cornell. Extremely cool guy.
Re:Out of curiousity (Score:4, Insightful)
In the real world, once you get up in the vicinity of the Van Allen belt, you get into hard radiation. If you use typical modern high density chips, with 0.15 micron die spacing, a single particle will short/damage half a dozen traces on the chip on a single impact. If you use really old stuff, with 5 micron die spacing (and higher), a particle will be to small to get multiple traces in a single impact. you may still get a single bit flip, but, ecc will catch that, and you can deal with it. In the former case of a high density die, the failure would end up being catastrophic when a particle impacts the chip. There are practical limits to the size of die that can be mounted on a carrier, and the trace density defines the capacity of that die. Yes, it's possible to cram 32 meg of ram into that space, but, it wont last but a few minutes in a hard radiation environment. Take that same silicon wafer, using 5 micron traces, and it'll last years exposed to the same environment, but, it'll only have 1 meg of useable ram locations due to the decrease in density. you cant just throw more of them on, because then power consumption becomes the issue, in overly simplified terms, the chip is going to use power relative to it's surface area, matters not if it's got 1 or 32 meg of addressable locations in that area. Clock frequency is the other major contributor to power consumption, hence its not uncommon at all to see space hardware measured in KHZ rather than MHZ and GHZ like most folks are used to, and there are damn good reasons to leave it that way.
An all up spacecraft platform has hard limits on physical size (constrained by the physical limits of the launcher), and hard limits on total mass, determined by the launch vehicle capability to the final trajectory required. The final design will budget a portion of it's mass allowance to power generation, and that power is in turn budgeted to various systems. the folks doing the controllers will have a hard limit on power consumption, another on volume, and a third on mass. working within those limits, they have to design and deploy a system that is expected to have 99.999999% reliability, operating in conditions more extreme than it's possible to actually simulate on earth.
Its a shame, but there is one thing they dont seem to teach in computer science courses anymore. Out here in the real world, reality gets in the way of all the theory. Moore's law may well say chips will get faster, and density higher as time goes on, but it becomes irrelavent when other limiting factors get in the way. until gamma particles start to shrink, or we come up with an effective way of making sure they dont hit the electronics, 10 year old and older stuff is going to remain 'state of the art' for use in space. Die density and ability to shield are hard limitations, cant get past them, and you wont see more modern equipment going into the reaches of space till those limitations are overcome. That's not likely to happen in the forseeable future, the research in that area is all 'nuclear research' and that's all out of vouge these days, gonna take a couple more generations or a severely critical power shortage to change that.
Re:Out of curiousity (Score:5, Informative)
Spacecraft (Score:4, Funny)
My first thought was "Spacecraft? is that a new Starcraft clone I hadn't heard about?". It was then I realized I've been hanging out on the Game Programming Wiki [gpwiki.org] too much lately.
Similar, though terrestrial, problems (Score:5, Interesting)
About five years ago, I worked for a major test equipment manufacturer who was contracted to deliver a test system for POTS lines (which could eventually do ADSL prequalification) to a national telco in a major European country. The idea was to test every POTS line in the system (millions of them) every night to detect early signs of degradation so repair crews could be dispatched before dialtone was completely lost.
As you can imagine, this involved a distributed system of test heads in each central office, networked back to a central command and control site. The sysem worked well, but had one flaw: downloading new firmware to the test heads was fraught with problems, and often led to the test head "locking up", even though a backup copy of firmware was always present, along with a hardware watchdog timer (though it was possible to lock out the watchdog interrupt, particularly when reprogramming flash, so it was a less than perfect watchdog). In these situations, one had to dispatch a "truck roll" to the affected central office, and replace EPROMs by hand.
Needless to say, the customer was pissed. More worrying was that even if we fixed the software download problem (which we were unable to reproduce in the lab), was that we'd be paying for truck rolls all over the country. This was a not insignificant amount of money.
Management frittered away time, instead of authorizing a root cause analysis, by requesting tweaks to TCP/IP operating parameters, and testing to see if the problem was getting better or worse. This did not prove illuminating, time was wasted, and the customer was getting royally angry.
Finally, a small team of us were permitted to undertake a root cause analysis to find and fix the problem: the engineer responsible for the embedded flash file system, the telecom engineer on the control side, and I: responsible for the embedded O/S, and TCP/IP stack (inherited from the supplier of the embedded O/S). We wanted a month. We got two weeks. Remember, deploying experimental software to live COs requires so many layers of approval, it isn't funny, and we were worried that would be our biggest bottleneck.
Finally, the controller telecom engineer was able to reproduce the problem, by attempting to download software from our controllers to deployed equipment in a single central office (getting permission was a feat in itself -- while there was little danger of affecting telephone service, this was a live CO).
The problem was clear: the data network was slow (9600 b/s over an X.25 PVC, carrying PPP-encapsulated TCP/IP), resulting in the use of large MTUs to minimize packetizing overhead (latency wasn't an issue - throughput was). Because of the way the controller's TCP/IP stack worked, it misestimated the packet/ack round trip time: it used a one byte payload for the first packet, and full MTUs after that. The resulting packet ACK timeout and retransmissions exposed an inconsistency between controller and embedded TCP/IP stacks that caused the embedded system to lock up.
Great. Now, how to fix it?
The fix wasn't a big deal (I implemented a fix in the embedded TCP/IP code since we didn't have source to the controller TCP/IP stack), but deploying it was: remember we couldn't download the code sucessfully, and we didn't want to pay for a truck roll.
At this point, I proposed something daring: download a small patch, in as few packets as possible (we could send three full MTUs safely). which would patch the existing code in place, which would be good enough to reliably download a complete replacement.
The thought of "self-modifying code" freaked management out to no end: it went against every rule in the book. But all three of us stood our ground: the only other alternative was a truck roll to each central office in the country. Reluctantly, we were allowed to proceed with that fix.
At this point, we had about ten days left. I had managed to get approval to pipeline the dev and tes
Re:Similar, though terrestrial, problems (Score:2)
Re:Similar, though terrestrial, problems (Score:2)
No brainer: TCP/IP (already supported) over Ethernet or PPP (serial). I quoted 2-3 weeks to implement the application layer stuff over that.
Idiot insisted that "for an embedded system", TCP/IP was "too fat" of a footprint: replace it with a home grown solution: we "only" had 128 MB RAM, "after all".
My protests that anything I could do in three weeks would be unlikely
Sig reply (Similar, though terrestrial, problems) (Score:2)
Re:Sig reply (Similar, though terrestrial, problem (Score:2)
Honestly, it never came up.
Look, a resume proves you can "talk the talk".
An interview is your opportunity to prove that you can "walk the walk" as well.
Re:Similar, though terrestrial, problems (Score:3, Funny)
Some years ago, I started being waked up haphazardly by the phone ringing. The day of the month was random, the day of the week was random, the time of the night was random between 2 and 5 AM but it sure freaked me, and my wife, out.
Calls to the telco had no effect. They tested (or at least pretended to) the line and said: "Oh no Sir, everyt
Re:Similar, though terrestrial, problems (Score:2)
The automated tests were not designed to make the phone ring, unless it was way too sensitive.
Of course, that might have been the case, if you had a defective phone.
Re:Similar, though terrestrial, problems (Score:2)
For all our apparant risk-taking and bravado, I think that technical people are ultra-conservative: we're just better at evaluating technical risks than most.
So, when someone counters with a "what if" that challenges a theory of a problem that we have, we give it disproportionate attention: anything possible and risky must be disproven before we can proceed.
The right thing to do, of course, is to
but a crash shutdown Spirit for two weeks! (Score:4, Interesting)
The nice thing about software is that JPL was able to upload a patch and get both rovers working properly again. They reconfigured the Galileo mission to the bypass the broken high gain attenna and use the hundred times slower low gain attenna with software patches and achieved most of the mission objectives.
Re:2MB Kernel (Score:2)
vmlinux files are 3-4MBytes (2.6) AFAIK. And, as the other poster pointed out, that doesn't include the modules.
Re:2MB Kernel (Score:2)