Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Bug Programming Transportation

Toyota Acceleration and Embedded System Bugs 499

An anonymous reader writes "David Cummings, a programmer who worked on the Mars Pathfinder project, has written an interesting editorial in the L.A. Times encouraging Toyota to drop claims of software infallibility in their recent acceleration problems. He argues that embedded systems developers must program more defensively, and that companies should stop relying on software for safety. Quoting: 'If Toyota has indeed tested its software as thoroughly as it says without finding any bugs, my response is simple: Keep trying. Find new ways to instrument the software, and come up with more creative tests. The odds are that there are still bugs in the code, which may or may not be related to unintended acceleration. Until these bugs are identified, how can you be certain they are not related to sudden acceleration?'"
This discussion has been archived. No new comments can be posted.

Toyota Acceleration and Embedded System Bugs

Comments Filter:
  • by homer_s ( 799572 ) on Saturday March 13, 2010 @01:33PM (#31464868)
    From here [washingtonexaminer.com]:

    In the 24 cases where driver age was reported or readily inferred, the drivers included those of the ages 60, 61, 63, 66, 68, 71, 72, 72, 77, 79, 83, 85, 89—and I’m leaving out the son whose age wasn’t identified, but whose 94-year-old father died as a passenger.

    These “electronic defects” apparently discriminate against the elderly, just as the sudden acceleration of Audis and GM autos did before them. (If computers are going to discriminate against anyone, they should be picking on the young, who are more likely to take up arms against the rise of the machines and future Terminators).

    Some more data here [theatlantic.com]

  • by WrongSizeGlass ( 838941 ) on Saturday March 13, 2010 @01:46PM (#31464960)
    Exactly. Even a minor revision in a FPGA could result in unforeseen consequences. Who knows, maybe a chip manufacture failed to document a very small change to a product line (or had a typo in the docs). The problem may not be in Toyota's code, just in their cars.
  • by A_Non_Moose ( 413034 ) on Saturday March 13, 2010 @01:53PM (#31465026) Homepage Journal

    I've said time and time again, "Never replace hardware with software" because
    something dedicated to the task will always work better, or be less failure
    prone (more often than not).

    Would Toyota be having these problems with an accelerator cable vs electronic?

    99% sure the answer is "no"...heck the solution is add some grease, make sure
    it isn't pinched/looped too tightly and/or add tension to the pedal side.

    Or, replace the damn cable with a new one...a 20 to 30 minute task.
    (less than 10min on a motorcycle)

    Oh, well, what do I know? I'm just a CS major with real world experience, pay
    no attention to the man behind the keyboard!!!

  • by zappepcs ( 820751 ) on Saturday March 13, 2010 @02:03PM (#31465108) Journal

    There are a couple of things that should be mentioned here. NASA has shown what it takes to make very small, very good code. Sure, they too have failures, but 'nearly' bug free code is quite expensive. Second, writing code is not quite like trying to create a hand crafted dashboard, if the dashboard fades, no one dies. Embedded software is quite a different beast from your normal desktop applications. When you add motion control and interaction with the code, it difference between them gets even more complex. Software in vehicles should be two things:

    Open - let lots of folk see what could be wrong
    Audited - audited to meet specific standards of safety and operation. Not quite the self-defeating government regulations, but more of a case by case issue: if the software has control or input to the control mechanism for the engine, braking system, suspension etc. it must meet minimum standard testing requirements. Any action that _could_ arbitrarily apply mechanical action must be tested and controlled beyond all reasonable testing/doubt. Everything should be tested, down to a pet chewing on the control cable harness.

    Consumers are encouraged to think the vehicles they buy are safe and require no special knowledge of engineering or mechanics to operate. As long as they are given to think that, then passenger vehicles should be made to be just this way.

    The problem for Toyota now is multifaceted. One, they have a PR shitstorm to deal with. Two, there is a dollar effect of this problem. Three, it's now on the shoulders of Toyota to get this part right for the rest of the passenger vehicle making industry.

    It's possible that they might walk away from this fire with only minor long term burns and the reputation for building the safest vehicles. BUT, reading the article of this post and paying attention while doing so is necessary... IMO

  • by Anonymous Coward on Saturday March 13, 2010 @02:04PM (#31465116)

    So I work on a robotics team on all of the firmware which interfaces the main computer with the actuators and sensors it needs to move. It drove me crazy because I had a bug that I couldn't replicate outside the robot.

    If you were driving the robot around under normal operating circumstances, suddenly, the motor driver board would stop working. I had a simple state machine on the motor board which would update the motors one by one, but I had a bug which I hadn't noticed because it seemed innocuous at the time:

    I updated the state machine immediately *after* I initiated an action which would eventually start interrupt code. Now, what ended up happening was that message about new speeds coming from the main board were also interrupts, and so there was this very odd cascading failure, where the motor board would initiate the action, the message from the main board would come in, we would jump into the message receive interrupt, and then immediately we would jump into the motor control routine. Since the state machine hadn't been updated yet, we would then execute the wrong part of the state machine, which would cause an error. The motor control routine would initiate a "stop" command and then go into an error state. Because this was a time-critical operation, I didn't care if we had failed to send a single packet, so I ignored the error, and jumped to the next motor. Unfortunately, the "stop" command would finish right as I tried to reset the state machine, which would put it in a different error state. And then when I tried to communicate with the next motor, I would do the same thing, over, and over, and over.

    The whole system would lock up, but by itself, all the code was completely reasonable. If you took that segment of code out of the system to verify that it worked, it would work 100% of the time. Put it into the system, and it generally took something like 10 to 20 minutes to suddenly die.

  • by roman_mir ( 125474 ) on Saturday March 13, 2010 @02:23PM (#31465284) Homepage Journal

    A year ago I was watching one of Discovery programs I think and they had a couple of guys who supposedly implemented a piece of software, that would allow an airplane to fly and land safely if for some reason, while in the air, the tale would brake off or rudder would just stop working. They relied on a fly by wire airplane of-course and controlled the yaw with all other surfaces by applying very slight changes to the motion. They were saying a human could do this if extremely lucky, but software was able to do it almost always.

    Just something to think about.

  • by bitingduck ( 810730 ) on Saturday March 13, 2010 @02:28PM (#31465316) Homepage

    I used to have a car where the engine would suddenly turn off for no reason while driving, often at exciting moments like getting onto the freeway. It was pretty easy to put it into neutral (it was an automatic), turn the key to "acc" and try to restart the engine (usually with success) without accidentally locking the steering wheel.

    It went on for some time until I convinced the repair guys to clean all the electrical connections from the computer to the fuel pump. The car had lived most of it's life in cold places with salty roads, and then the problem appeared in california where mechanics don't think of the effects of salt water. Once the connections were clean it behaved fine.

  • by Beryllium Sphere(tm) ( 193358 ) on Saturday March 13, 2010 @02:38PM (#31465382) Journal

    I worked on an embedded flight system there, and deeply respected people like your dad.

    Boeing works under the eye of a certification authority who has to approve the safety of a design including, at least in the system I worked on, human factors. If there's anything comparable for cars, I haven't heard of it.

    Boeing would not have made a pilot have to guess at how to turn an engine off (people with older cars, it's no longer a matter of turning a key).

    Inputs were checked for consistency and validity. The specs would have anticipated what to do if the accelerator and brake were both full on at the same time.

    There was a culture of worst-case planning and redundancy.

    Also, if Boeing built a car, it would have a flight data recorder which investigators could examine and say for example "Looks like both(*) potentiometers on the accelerator went hard over at the same time, so we go look on the branches of the fault tree where there's a common-mode failure in the potentiometers or the pedal is down due to mechanical or pilot error".

    (*) If I remember correctly from my obsessive pre-purchase research on Priuses, there are two separate sensors for accelerator position.

  • by dbc ( 135354 ) on Saturday March 13, 2010 @03:00PM (#31465564)

    ... why does everyone assume it is a software bug? I agree that it very well could be an undiscovered software bug. But there are so many more sources of erroneous behavior in an embedded system that *even* *if* the software were flawless (ummm... just go with me a minute... :) an automotive environment can cause all manner of strange glitches. I work with robots, lots of DC motors causing commutation noise on the power supply, long (several inch) distances between units that must talk to each other and therefore may have a different opinion as to ground reference voltage... many things can get wacky. Even flawless code needs a watchdog timer to get you out of weird states that power glitches that put you into. Power supply spikes can cause the program counter to jump to very odd places, with odd, corrupted stuff in RAM. Ground level shifting can cause communication glitches. CAN bus is *extremely* robust, so bad data should not get through... but what does get through? Does the system as a whole get into a weird state if packets drop?

  • by cynyr ( 703126 ) on Saturday March 13, 2010 @03:05PM (#31465602)

    I'm still failing to see how the cars got locked in gear? every car i have driven has allowed the driver to shift the car into neutral regardless of everything else. This is both in automatics and definitely in my manual transmission cars(does anyone make a drive by wire clutch, outside of performance/race cars?) I fail to see why this is a huge issue that needs to be solved in the next 10 minutes and be 100%? how is a sticky peddle (software or otherwise) any different from the throttle body getting stuck in the wide open position? what would these people do in that case? The handling after the problem occurs seems to be 100% driver error. TBH the first that that would happen if my car started doing that would be for me to press on the clutch, regardless of it not having one or not, thats a very hard habit to break after driving manuals(not this behavior in an auto usually results in the the break being depressed all the way to the floor). After that i'd put it in neutral and then use the break to slow down and pull over. somewhere in there i'd hopefully get the 4 ways on, but thats a long long way down the list of things that would happen.

    The "PR shitstorm" is way way over-hyped, it would be simple for the news to simply state "Toyota has confirmed an issue effecting the engine speed controls, and have issued a recall. If this happens to you while driving Toyota advises drivers to shift the car into neutral and engage the 4 ways and pull over in a safe location. If your car has a push button start be aware that you will need to hold it down for up to 5 seconds to shut down the engine." The fact that some people have died as a result of poor driving ability is no different than every fall here when it snows and some dumb person forgets that snow is slippery, a terrible thing to have happen, but usually 100% their fault.

    I think the driving tests in this country need to be much more rigorous. They should be done in an unfamiliar car, setup for simulating things like sudden loss of engine power, loss of 90% of the breaks, etc. There should also then be a road test(with other real cars if the closed course goes well.) to see how the driver can handle the conditions of real driving. Places with snow/ice need to have some driving on a slippery surface(rear wheels that can be allowed to rotate would work.) This sort of testing should be repeated every 3-5 years, to ensure that drivers maintain a certain level of driving skill. There needs to also be a raw reflexes/object tracking with a fairly high level of skill needed, anything below a certain level will see that you fail your driving exam. Fines for driving without a license should be steep and the test facilities need to be open at least 2 shifts of the day, if not 2.5 shifts.

  • by X0563511 ( 793323 ) on Saturday March 13, 2010 @05:15PM (#31466732) Homepage Journal

    I have an example.

    In a simulator (yes, a simulator) I was flying a VTOL type aircraft. Pulled a turn at too great a speed and broke off a few control surfaces. Maddening spin, completely unrecoverable (at least for me).

    Tapped the button to enable "artificial stabilization" - which in this craft, enabled "puffers" charged with compressed air (driven by the engines, which still worked) - the computer control algorithms managed to use the remaining control surfaces and these puffers to level the craft and reduce the yaw spin to about 5 degrees/second.

    Because of the VTOL nature of this craft, I was able to land it with a very survivable sink rate (gear even survived the landing).

    My point? Good luck doing that manually. This simulation anecdote should help back-up your thoughts :)

    (simulator: X-Plane 9)
    (craft: Verticopter [verticopter.com])
    (note that a scale model actually flies, it's not just a pipe dream)

  • by RobinH ( 124750 ) on Saturday March 13, 2010 @05:30PM (#31466836) Homepage

    I do industrial automation for a living, and machine guarding/safety is a major component of the job. There are now, in the last few years, software based safety products that are provably just as safe as a hardware only safety products. The key is that it's not just about rigorous testing, it's about correct design. If you want category 4 protection, you need to be sure that:

    1. No single component failure can leave the system in an unsafe state
    2. The component failure is detected
    3. The system cannot be restarted without correcting the failure

    Software becomes another component. Therefore you need to have redundancy in your software. Government regulators that certify these safety systems as compliant want to see you prove that a single component (i.e. unit of software) can't malfunction and leave the system in an unsafe state. What a lot of companies do is they have two independent processors each monitoring the inputs to the system in parallel, and each generating the required outputs. The processors are typically sourced from different companies, and the circuit boards are designed by different teams. The software running on each processor is written by a different team. If both processors agree on the outputs, the system drives those outputs, and if not, all power is dropped to everything and the system can't be restarted (may need to be replaced, etc.).

    Those of us in the industry were skeptical of software based safety at first, but given the above facts and a decent amount of regulatory oversight, I'm satisfied that it will live up to the design criteria. That doesn't mean an error can't happen, but it makes the probability low enough that we can live with it.

    The latest thing is safety systems running their I/O across networks like DeviceNet and even Ethernet/IP (the IP stands for Industrial Protocol, not Internet Protocol). Again, I was at first skeptical, but they use a protocol layering on top of the network using timestamps and redundant processors on both ends with reasonable failure modes that the system is provably safe, within reasonable limits.

    So you can make safe embedded systems, but without being able to inspect the design and see that it lives up to these guidelines, Toyota can't ever *prove* that the system is safe.

  • by Stormy Dragon ( 800799 ) on Saturday March 13, 2010 @05:46PM (#31466956)

    The Times also helpfully provides a list of all the people who have died in "sudden acceleration" accidents involving Toyotas:

    Toyotas, deaths and sudden acceleration [latimes.com]

    If you look through the list at the ages mentioned, one begins to notice a rather odd pattern: 18, 21, 32, 34, 44, 45, 47, 56, 57, 58, 60, 61, 63, 66, 68, 72, 72, 77, 79, 83, 85, 89

    This is a most peculiar bug indeed in that it seems occur primarily when the driver is elderly. Or perhaps, as with previous "sudden acceleration" scares, this will ultimately turn out to be the result of people slamming on the gas when they menat to slam on the brake and then trying to blame the car for their error.

  • by Anonymous Coward on Saturday March 13, 2010 @06:28PM (#31467264)

    I have done similar type of power sequencer design monitoring ~15 rails. I had worked on the idea on using a microcontroller a year before that, but I realized the problem is how to verify the design to work correctly.

    When finally I was actually tasked at the design, I did not use a microcontroller. It uses a special chip with PAL logic and programmable analog comparators. I even have to fake a state machine using time delays. It was very primitive, but it worked exactly as intended. It even save a few board during the initial prototype stages.

    Oh yes, it also work beautifully on sequencing down all the power rails when you pull the card out or the 2 main power fails. I can do that in a few microseconds, but your typical software type cannot guarantee their interrupt latency to do.

  • by John Hasler ( 414242 ) on Saturday March 13, 2010 @07:18PM (#31467694) Homepage

    > Would Toyota be having these problems with an accelerator cable vs
    > electronic?

    GM once had a very similar problem with a 70s car with a cable. An engine mount failure would allow the engine to rotate under acceleration in such a way as to yank the cable to full throttle and then jam it, causing the car to run away. The resulting collision would knock the cable free and as collisions often break engine mounts, the evidence disappeared.

    Computerized systems are usually more reliable than mechanical ones, but they must be competently engineered.

  • Re:Infallible fail. (Score:3, Interesting)

    by jc42 ( 318812 ) on Saturday March 13, 2010 @08:18PM (#31468162) Homepage Journal

    i'd feel much better with drivers who know they should pop the car into NEUTRAL if it starts accelerating out of control for any reason, ...

    Except we have testimony from any number of the Toyota acceleration victims that they had put the transmission into the "N" position, but the car just ignored it and kept accelerating. They also claimed that they knew how to use the brake, but the car also ignored that.

    As a software guy, I'm quite familiar with ways that software will do things like this, and I find the testimony quite credible. But we might consider that the victims might be lying to us. Or that the auto company might be lying. Or both.

    Myself, I'd believe them all better if there were "black box" recording devices with logs of the incidents. Rumors have it that some auto manufacturers are considering such gadgetry. But of course it would add to the price of a car and make it less competitive in a price-conscious market. So we might not see such recording devices soon, or if we do, they'll be very limited and won't be able to answer the sorts of questions that people are asking here.

    In any case, it's a bit bemusing to see people automatically attributing failures of a "drive-by-wire" vehicle to 100% driver error. The cars we're talking about are controlled by computers, not by people; the driver is just there to give advice to the computers. How often have you pressed a button on your computer, and seen nothing happen in response? We're hearing the testimony of people saying that this happened with their computer-controlled car. Anyone with any experience at all with commercially-built computer systems would believe the "user", not the manufacturer, because we have far too much experience with commercially-built computers to believe otherwise.

    OTOH, the CS guys will tell us that are a lot of really stupid computer users out there. And most of them are also drivers ...

  • by dokebi ( 624663 ) on Saturday March 13, 2010 @10:48PM (#31469184)

    You do realize that Prius's gear level is just a joystick, right? There is nothing mechanically connected, which means that if the computer is confused, then _there is nothing the driver can do_, except stomp on the breaks.

    Actually there is. The car turns itself off if you hold the power button for 3 seconds. But in a panic situation, a person would most likely press the button repeatedly, instead of holding it steady for 3 long seconds. Other manufactures turn off on rapid button press in a short time, instead, which seems better.

  • by FleaPlus ( 6935 ) on Sunday March 14, 2010 @02:51AM (#31470368) Journal

    (The last case on the news - a driver called 911 on his cell phone because his car was accelerating out of control. When prompted by the operator if he had tried putting the car in Neutral, he said no and even refused to do so when ordered to do it by the operator.)

    It's starting to look increasingly likely that this latest case was a hoax:

    http://en.wikipedia.org/wiki/Toyota_Prius#Brake_fix_and_acceleration [wikipedia.org]

    On March 8, 2010, a 2008 Prius allegedly uncontrollably accelerated to 94 miles per hour on a California Highway (US), and the Prius had to be stopped with the verbal assistance of the California Highway Patrol as news cameras watched [86]. Subsequent to the event, media investigations uncovered suspicious information about the alleged runaway Prius driver, 61-year old James Sikes, including false police reports, suspect insurance claims, theft and fraud allegations, television aspirations, and bankruptcy.[87][88] Sikes was found to be US$19,000 behind in his Prius car payments and had $US700,000 in accumulated debt.[87] Sikes stated he wanted a new car as compensation for the incident.[87][89] Analyses by Edmunds.com and Forbes found Sikes' acceleration claims and fears of shifting to neutral implausible, with Edmunds concluding that "in other words, this is BS",[90] and Forbes comparing it to the balloon boy hoax.[88]

  • by metaforest ( 685350 ) on Sunday March 14, 2010 @04:06AM (#31470608)

    could be any number of causes.
    Here's an example from my own experience:

    I once had some embedded CODE that failed only when the system was off, and exposed to sub-freezing temps for several days.
    If it was off under any other conditions it would start up and work correctly.

    Turns out it was an uninitialized global variable, the compiler was not warning me about it, though it would for local variables.

    Now one might expect that this would fail randomly on power up since the contents of SRAM is undefined... most programmers will assume it's random. It's not really random. There's often only a few values that a SRAM cell will take on typically at room temperature. Additionally as others have noted, SRAM can have a slight "memory effect" it also will sip holding current from decoupling capacitors long after other devices around it are well below their minimum forward Vth voltage. So in every case during in house testing the functions that used this variable managed to see a positive value on power-up which caused their first calculation on power-up to be wrong, but had no material impact on the output since it wasn't ever wrong enough to be noticed, or cause the system to mis-operate.

    Then the system was deployed. The first few weeks the system was fine. It was late summer at the customer's site. The customer feedback was that it was working perfectly. We deployed another few units. Also working fine. Then summer ended. Fall First cold-snap of fall took place over a weekend. On monday morning all of the systems were started by their operators and crashed. Then restarted normally.

    Lots of head scratching. No more failures that whole week. But something was clearly up. We couldn't reproduce the problem in the lab... We didn't know about the cold snap either. A week went by. No one spotted the bum variable. The following Monday no failures. No failures for a few more weeks. Still no one spotted the bum variable, code and hardware passed review.

    Then the site got their second sub-freezing cold-snap on a weekend... This one we knew about before hand... our on-site rep mentioned during a conference call that it might snow over the weekend. Internally our team was still looking for answers, but without more failures we just didn't know what to look for. We'd left systems off for weeks at a time and they never failed to start.
    The hardware guys froze them, and over heated them. Kicked them around the lab.... none of them showed the failure seen by the customer.

    The Rep called on monday and said all of the units failed to start the first time. The rest of the week they were fine.

    Hardware team put three units in the freezer and left them there for three days. To replicate the weekend conditions. All three failed to boot. Hardware probed the frozen boards... code checksum was fine.... everything looked fine. The code was crashing though, and we got a call chain out of the device.

    So I went in and looked again.... this time I spotted the uninitialized variable.

    Turns out it could only crash when the variable had a negative value in it. Turns out if really cold AND the system was left off long enough for all the capacitors to drain then on power-up the SRAM in this particular device would come ready with a majority of it's bits set high. If it was above freezing... the majority of bits were low. In over-night freezes there was enough capacitance on the board to keep the RAM holding.

    We had the on-sight rep re-flash the systems with the new code.... bug killed.

    This was not a complex system and it was not required to meet Safety-of-Life standards. It had about 8K of code and 2K of active data, and three active IC packages.

I've noticed several design suggestions in your code.

Working...