Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Bug Programming

App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com) 167

According to a new study, 43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production. BetaNews adds: The survey carried out by ClusterHQ found that a quarter of respondents report encountering bugs discovered in production one or more times per week. Respondents were also asked to identify the most common causes of bugs. These were, inability to fully recreate production environments in testing (33 percent), interdependence on external systems that makes integration testing difficult (27 percent) and testing against unrealistic data before moving into production (26 percent). When asked to identify the environment in which bugs are most costly to fix, 62 percent selected production as the most expensive stage of app development to fix errors, followed by development (18 percent), staging (seven percent), QA (seven percent) and testing (six percent).
This discussion has been archived. No new comments can be posted.

App Developers Spend Too Much Time Debugging Errors in Production Systems

Comments Filter:
  • No surprise (Score:4, Insightful)

    by tomhath ( 637240 ) on Thursday November 03, 2016 @03:07PM (#53208281)

    43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production

    That seems like an odd metric, but it doesn't surprise me. Production support has always been expensive. Especially if you can't create a full production-like environment with real world data and stupid users to test with.

    • by lgw ( 121541 )

      As a (fairly large) devops team, we probably spend 1/3 of our time in "production support", from bug investigation to various kinds of automation, it never seems to end.

      On thing we don't do though, unless there's no other way, is "debug in production". If anything seems off, you roll back, no question. (And if it's not a recent deployment, you should know that very quickly, too, from logs and metrics.) Figuring out exactly what went wrong can wait, reverting the change before it becomes a customer-visible

      • by tomhath ( 637240 )

        When I said "odd metric" I meant it sounded like a Yogi Berra comment: "Half of the game is 90% mental".

        Why not say that on average, development organizations spend 5 to 10% of their time fixing production bugs?

    • It does surprise me. With vendors like Adobe, Oracle, and Microsoft, I'd have thought 100% of their errors are in production (or, as other vendors might call it, "consumer-based alpha-test").
  • by MooseTick ( 895855 ) on Thursday November 03, 2016 @03:09PM (#53208287) Homepage

    This is due to finance cheaping out and not allowing the purchase of an exact "test" system to work on. Also, the rush to production is often more important than checking to be sure it all works.

    That said, its all a risk/reward thing. Maybe its often better to screw up production here and there than to spend tons of money and time on testing. It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?

    • Even if you can reproduce all of the hardware exactly, you are never going to get the same kinds of results that putting software in the hands of real users will get you.

      I'd say far more important than exact hardware duplication of a production environment, would be ease of replication of real production data into QA servers for replication of issues.That includes at any time in QA being able to use the account of any real production user as if you were them... The ability to do that easily saves SO MUCH

      • Re:Never can though (Score:4, Informative)

        by Cro Magnon ( 467622 ) on Thursday November 03, 2016 @03:50PM (#53208591) Homepage Journal

        That brings back memories:

        Me: "It works for me"
        Production: "It gives me this error"
        Me: "Can you show me the data"
        Prod: "It was in Missouri's data for 2014"
        Me: "It still works. Can you show me a screenprint of your data?"
        Prod: "I'm using this dataset"
        Me: "I don't have access to that (expletive unsaid) dataset. Can you show me a (more unsaid stuff) screenprint??"
        Prod: *mumbles something about privicy*
        Me: *thinks about shooting someone*

        • I've run into that as well, then I made the comment that it may be unique to that personal information pertaining to the "person". I suggested to obfuscate the personal information, but not other data to reproduce. This will usually pinpoint the cause, but if the error still can not be produced, the error is most likely attributed to that specific personal data that was obfuscated.
        • So you are one of those that keep asking for the screen print! :)

          I was talking to the first level help desk of the company that handled the credit card transactions for the websites of the government department I was working at. Their servers were down and I was trying to get them to start working on it. The process for a transaction was that we sent their server an XML file with information about the transaction, they returned an XML file with an URL to give to the user, the user completes the transactio

      • I'd say far more important than exact hardware duplication of a production environment, would be ease of replication of real production data into QA servers for replication of issues.

        It's a bugger because of dependences (it goes pear shaped when this customer orders this product, which is made from parts supplied by these vendors... and none of them are in your test system), but you can cover most of it by making a copy of prod once or twice a month.

        Unless, as someone mentioned above, you're too cheap

      • Even if you can reproduce all of the hardware exactly, you are never going to get the same kinds of results that putting software in the hands of real users will get you.

        There's different kinds of buys, which is why you have different kinds of systems and testing environments.

        A dev should be able to have an isolated environment in which to be able to test the various parts. Each part should be able to have a sufficient emulation of external parts to be able to have its own unit and functional testing. From there, several parts should be integrated at a time to do functional and integration testing, eventually building up to the entire system being fully integrated and us

    • by TWX ( 665546 )
      I spent some time working QA on a carrier-level system that was being developed for what was at the time Cingular. The biggest problem is that the investors that propped-up the company wanted it to ship as absolutely as soon as possible, so the company could go from a money-sink to a money-producer for them. Our investor was some heir to a fortune that was made in chemicals back in the day, he didn't really know anything about the technology of telco-grade communications systems. He was ill-qualified to
    • by zifn4b ( 1040588 ) on Thursday November 03, 2016 @03:52PM (#53208609)

      That's not the most prevalent issue. The main issue is the malpractice of Agile methodologies. What happens when you jam a 2 week task into a 1 week time box? Corners get cut in the code, the unit tests, QA test plans and technical debt accrues creating unpredictable results when someone changes brittle code in the future. Most companies are not interested investing in REAL environments and continuous delivery pipelines with:

      • - Adequate infrastructure
      • - Adequate workstation and tools
      • - Adequate product training
      • - Reasonable time to do the work
      • - Reasonably well-defined work
      • - Development best practices: code reviews, unit tests, testing in general (yes dev's it's also your responsibility to test, you don't just throw your crap over the wall)
      • - Automatic builds either nightly or on commit with automatic unit and integration tests using Bamboo/Jenkins/whatever, perhaps even usage of source control at all!
      • - Investment in some type of test case database like TestRail or Zephyr so you actually know what your software is expected to do and it can actually evolve over time. This can replace traditional test plans that people put in Confluence that become stale almost immediately and lose value.
      • - Good documentation

      All of this takes a lot of effort and you don't get it for free running around like a chicken with your head cut-off. Ignore it and you reap what you sow especially in larger scale software efforts.

      • by Altrag ( 195300 )

        Jamming 2 weeks of work into 1 week is going to result in cut corners no matter what methodology you're using (or even what line of business you're in, for that matter.)

        If you switch to a methodology where you're estimating in 6 month blocks and you're off by 100% like that, you're now 6 months off schedule instead of one week off -- that's even worse!

        Not to say agile isn't misimplemented regularly, but if you're not schedules are off by that much of a margin, you need to start by looking at how you're gene

        • by plopez ( 54068 ) on Thursday November 03, 2016 @04:58PM (#53209023) Journal

          I have never seen a methodology survive its first contact with sales.

          • by zifn4b ( 1040588 )

            I have never seen a methodology survive its first contact with sales.

            That's because sales always asks for flying, sparkly unicorns the defecate gold bricks without considering whether it's actually a reasonable expectation to have. Oh and they promised it to a client so if you could make that happen so they could get their commission and gain favor with the CEO, that would be great. k thx bye

        • by zifn4b ( 1040588 )

          If you switch to a methodology where you're estimating in 6 month blocks and you're off by 100% like that, you're now 6 months off schedule instead of one week off -- that's even worse!

          You apparently have never worked at a place like this: https://www.youtube.com/watch?... [youtube.com]

          Trust me that happening once every 6 months is waaaay better than every 2 weeks.

      • Why is it that since lately people who obviously have no clue about agil methods are bashing them on /.?

        Actually every point you bring up are corner stones of all agile methods I'm aware about.

        • Comment removed based on user account deletion
          • If someone is "fragile" it is hardly a problem of the "method" used ... such teams will be producing bad software ... or good software in a badly way produced, regardless of method.

          • by zifn4b ( 1040588 )

            But, I think it shows the industry is just really poor at executing it and end up with Fragile instead.

            Oh that would be a step up in the environment I'm in. It's all cargo cults here. You have it really good if you get at least Fragile. That means someone is actually trying to do it but failing at execution.

      • I can think of real-world examples where this sort of thing happens:

        Video Game industry - Sure some older games had re-releases that fixed some issues, and some games were crazy buggy (youtube Sonic 3 glitches). However, the games typically "just worked" in the 80's and 90's. Compare that to today with multi-GB day-one patches that *should* have been part of the gold disk... had sales/marketing/management not put an improbably deadline on development.

        OS Development - See all the zero-day bugs in Window
    • by jlowery ( 47102 )

      It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?

      For the Mars mission:
      a) about 186mph
      b) no

      http://www.space.com/34472-exo... [space.com]

    • by ghoul ( 157158 ) on Thursday November 03, 2016 @05:51PM (#53209313)

      Where I used to work - big telco software firm whose software generates 80% of the phone bills in the US we had a simple solution to the problem of testing to scale.

      We had two identical setups one for production and one for staging. After UAT was almost over we would deploy to staging and then continue UAT on the staging with real world data till the day of cutover (Use Oracle Active-Passive to keep both in sync for the production data while not copying over UAT data to prod)

      On cutover day we would change the network switch to now point to the new setup and run scripts to delete the dat created by UAT.

      The nice part was now the Prod setup (a bank of 8 servers with 4 quad core CPUS each) was now our backup machine. We would switch it to passive and continue to keep it in sync with prod for at least 7 days. If something horrible went wrong with the new setup. Changing back to the earlier prod machine was a network switch flip. The scripts were a little more difficult this time over especially if the software bug had messed up the data but it was still easy.

      Once a production was stable the old prod was now used as staging for the next prod.

      What this meant is we did UAT on machines with identical config as the prod machines . It solved a lot of issues and since we also used the machines as the prod backup machine during cutover the cost was taken from the operations budget and not the testing budget.

      Our System test and UAT environments were almost as good but not as good as prod and most testing and UAT was done there but the last batch of UAT on the big iron gave good confidence and made cutover day a lot less stressfull than it used to be.

  • by Anonymous Coward

    The in-house ERP system I work on has a great test environment and a huge suite of unit tests, and the corporate environment is pretty well defined, so few (if any) show-stopper bugs ever make it to production and those that do I can reproduce relatively easily.

    On the other hand, I also spend time programming machine automation and the idea of having a completely separate independent machine to test your changes on is impractical. Machines that cost hundreds of thousands of dollars don't have test systems

    • by plopez ( 54068 )

      Why can't you use virtual environments?

      • You can but,

        1) Virtual machines create their own variable
        2) Variable for every different possible configurations of hardware (CPU, GPU, RAM, number of storage drives and type, ports, etc,)
        3) Variable for every OS version
        4) Variable for every OS configuration option (these can number in the millions per version)
        5) Variable for every 3rd party software installation, not limited to virus scanners, disk management tools, 3rd party installers, active applications at time of install, etc.

        Just how many different v

        • by plopez ( 54068 )

          there is no such thing as 100% coverage. But virtual environments give you much more flexibility and can improve you coverage if done properly. You can also define templates which allow you to spin up basic combinations of software configurations as base lines and for regression. I wouldn't use them for performance testing but thye have made things easier in many ways. And increased flexibility and coverage.

          I agree there are things you should not use them for, but there are many things they can be used for

      • Virtual environments for the expensive machines? Partly because they're not perfect, but largely because the behavior of the machines tends to depend on the real world. They're nice when we can get them.

  • by Anonymous Coward on Thursday November 03, 2016 @03:15PM (#53208347)

    How is "management telling people to put it into production as soon as the basic functionality works" not one of the common causes of bugs? At almost every job I've worked at, QA and Engineering would say "We need this much time to test and fix bugs before launch", and management would say "Too bad! Sales already told someone we're launching tomorrow, so we're going live with whatever we have then!"

    It isn't the lack of a good test environment, or good test data, it's being told by management that you aren't going to have any time to test...

    • That's not always unreasonable, though. If the company has already announced the launch date or signed on customers, then pushing out that date can be costly in terms of money or reputation. At that stage, assess the data from whatever tests you have performed thus far, and tell management whether there may still be bugs that will make the whole thing crash and burn, or that it's likely that any undiscovered bug will only cause minor issues. In other words: inform them of the risks, and of the uncertaint
      • by Altrag ( 195300 )

        That's understandable (to a degree) in the situation where there was a set schedule ahead of time and things ran over. Then management has to make a decision whether its going to be a bigger hit to their reputation to delay vs releasing garbage.

        What we often see though, especially in direct B2B-type software where there's a more intimate relationship between vendor and customer, is that the conversation goes more like this:

        Manager: "We want something that does X"
        Engineer: "OK that will take 6 months"
        Manage

        • by ghoul ( 157158 )

          If your manager is padding your estimate by 2 either you are a really bad estimator or the non dev portions of the org are really dysfunctional.

          As for the sales issue we solved it by making sure the sales bonuses are not paid out for at least 1 year after production. If there was a major screwup the customer has probably cancelled the contract and no bonuses were paid if the revenue actually did not come to the company.

        • I worked on an application that allowed certain people at large organizations to manage the firms phones on the switch instead of going through the telephone company. The application was provided by the telco and worked with Centrex type phones.

          I spent 6 months adding in support for some phone options that the sales team said one customer wanted. After I finally got everything done and tested the customer was informed and decided not to go with it. The options dealt with group pick up and were used for cr

      • If the company has already announced the launch date or signed on customers, then pushing out that date can be costly in terms of money or reputation.

        Doctor, it hurts when I do *this*.

      • And make damn sure about that last part: never put yourself or your team in a position where the responsibility of an overly hasty launch comes back to you.

        That presumes ethical management that is willing to accept responsibility for their part in any problems that arise. You're 100% correct that our job is to merely implement the solution and advise of potential problems, and theirs is to accept that advice, balance the pros/cons of a particular course of action, and make the decisions regarding what
  • by __aaclcg7560 ( 824291 ) on Thursday November 03, 2016 @03:23PM (#53208407)
    I did a six-month contract as an software tester internship after college, where I came across a crash bug on the test server that I could reproduced 100% of the time. My supervisor could not reproduced the bug, and approved the patch for production server. The production server crashed immediately from the patch. Engineers determined that a major code rewrite was required to fix the underlying problem. The production server was offline for three days and cost the company $250K in lost revenue. My contract wasn't renewed, one-third of the division got laid off after I left, and further budget cuts doomed the project. As for my supervisor, he got promoted into management.
  • Yeeeeeup, exactly this. Goddamn do I ever wish I had the resources available to me to actually do my job properly. Company wont provide resources for unit testing the hundreds of variables for our data entry forms that all inter-relate to one-another. Think of it as a massive fucking configuration matrix that shits all over itself. I've proposed for years entirely replacing said system with something extremely simple, but am always shot down. And since we don't have the resources to properly unit test the s

    • Given data entry with hundreds of variables that all interrelate to each other, is it possible to do unit tests? If you have ten boolean values that interact with each other, you need a thousand unit tests. It gets worse from there.

  • by Maxo-Texas ( 864189 ) on Thursday November 03, 2016 @04:06PM (#53208687)

    I wrote a awesome testing program that resolves the problem of differences between test and production but I can't get it to run in a production environment.

  • It's the standard triangle. You can cut from one at the increased detriment of the others. As long as the others are finite resources you always have to cut somewhere. The problem so many developers can't understand is that the 'where' is a business problem, not a theoretical engineering issue.

    If it's more important to remain under budget, or be first to market, yeah, quality might suffer big time, and it's easy to ignore the academic's concept of a perfect engineering development lifecycle with a full Q

    • by ghoul ( 157158 )

      I think its not time, money,quality .

      The iron triangle is time,money,scope. You can increase or decrease one by changing the other 2 . But if you try to reduce one without changing the other 2, the iron triangle breaks open and the magic smoke which is quality inside the triangle escapes and once it escapes you cant get it back in even if you close the triangle.

  • The prime directive of anyone associated with building software for end users must be to create bug free, secure systems that are effortless for people to use.

    This needs to flow throughout an organization - whether you are the architect, designer, marketer, developer, tester, accountant, whatever. Everyone must be on the same page when it comes to this goal. Everyone needs to really understand what that entails in practice.

    I've been both on the building, and receiving end of things when this goes wron

  • That's what happens when your entire development pipeline aims to put a prototype into production.

    Also, "Just validate user input on the front end, it'll be fine once it hits the server" is a recipe for disaster.

  • It's a well known exponential curve, that apparently needs to be re-learned for every "new technology".

    A bug found during requirements analysis has a cost to fix of 1
    A bug found during high level design has a cost to fix of 10
    A bug found during detailed design has a cost to fix of 100
    A bug found during coding/unit test has a cost to fix of 1000
    A bug found during system test has a cost to fix of 10000
    A bug found in production has a cost to fix of 100000

  • by plopez ( 54068 ) on Thursday November 03, 2016 @04:57PM (#53209015) Journal

    We get hung up on developer costs but never on rework and fix costs. There is constant pressure to deliver untested features to make sales but never much accounting for customers who will walk at the first opportunity or sales which get cancelled due to bugs.

    And it has never changed. Watefall, 6 signma, kanban, agile, rapid proto=typing, devops etc. has not made a difference. I have seen no improvement at all over close to 30 years. And people wonder why I drink.

    • Tge development method as in agile or waterfall has absolutely nithing to do with the bugs you oroduce, fund or ship.
      (*facepalm*)
      If you are 30 years in business and can't graps the difference between an opcode (method) and its operands (data, reuslts) then you are since 30 years in the wrong business.

      And btw. devops, they have nothing to do with software development, they provide infrastructure, and are 'required' for every orgamizations, ragardless what process you use. Again you are mixing up 'tools' and

      • by plopez ( 54068 )

        Ok, so what's the point? If they have nothing to do with code what's the point? I thought the point was to deliver better product and code, or rather functionality is the product. Methodologies and philosophies try to deliver the result better but it always seems to go wrong. I spewed out a laundry list of of things which I have always seen breakdown.

        The point I was making is that every time some one tries to improve process and therefore hopefully the end result, it gets subverted. I have ideas as to why t

        • What is the point of what? DevOps?
          Someone has to provide the infrastructure. Instead of having mediocre admins or forcing the developers to do that themselves and hence subtracting some of their work time, you have DevOps. And that is a "role" as in a position in a team and not a method, agile or not.

          The point I was making is that every time some one tries to improve process and therefore hopefully the end result, it gets subverted.
          Then stop the developers subverting it, plain and simple.

        • Ok, so what's the point? If they have nothing to do with code what's the point?

          Just because they have nothing to do with the coding part of the process, that does not mean they have no part in the development process. Because (tada) the development process is more than just coding. If you haven't seen a difference in 30 years, man, you are working with some shitty employers.

    • The problem is sales people having too much power.
  • Respondents were also asked to identify the most common causes of bugs

    Surely the cause of bugs is programmers getting it wrong (or, if you want to go to a higher level, errors in the design or specification). All the cited reasons don't cause bugs, they merely prevent their detection.

    As for the most environment where bugs are most costly to fix, I would suggest that would be once they reach the consumer and can only be fixed by a product recall? Although once they reach orbit, that can be a pretty expensive place to apply a fix, too.

  • In all my years developing apps, I only had one live bug and it was basically due to uploading the wrong version to production. Some of my apps are over 50k lines of code! Yet, I can't find anyone hiring a software engineer. Its rough to know your stuff, and hr to not be able to tell you know how to code properly.
  • In the old days, if you found a bad bug in production code it often meant you had to stop the assembly line of shrink-wrapped boxes filled with floppy disks or CDs and pull back everything that was unsold in the channel. If your channel was big enough, it could cost $ millions. Today, you just patch your download sites and have the running software on your customer's machine automatically download and update it sometime after hours (or even during work hours). Companies are much less hesitant to ship buggy
    • Zero days cost more than all the floppy disks and CDs combined. Back in the day - most things were not networked. Today that's all there is*. Those flaws hurt the customer and the company, and depending on what we are talking about (e.g. network connected cars and industrial control systems) may cause loss of life and property.

      The problem space isn't as cut and dried as you would make it out to be.

      ( * Note: I know that isn't all there is...but I would argue the amount of stand alone non-networked app

  • Only 25%. You'd be lucky if you have a QA env. In a small shop here is how it goes. Idea, code, does it compile? yes = production! Then the bugs present themselves in all the splendor and you repeat the process. Move fast and break things.

    • This is why I recommend an independent rating system in the post above. [slashdot.org]

      With a system like that - the consumer/user of software would have an idea about what is the best software to use from the standpoint of quality, user interface, and integration capabilities.

    • That may actually work well for a small shop. We've become a lot more risk-averse as we've grown. It takes a little fun out of things, but I'd rather my checkins were good.

  • by cerberusss ( 660701 ) on Friday November 04, 2016 @01:55AM (#53210941) Journal

    App developer here.

    Something is missing here; namely we spend more time debugging issues found in production, because they get reported. Almost every app nowadays has a crash logger that reports all crashes. Libraries like Twitter's Crashlytics are awesome like that. You get all crashes reported to you, including a ring buffer of the last 100 log messages. It's really, really awesome and I've solved problems in production that wouldn't ever be found normally.

    • QA Analyst here. QA uses the same loggers and then some during testing and reports the issues. It is just that devs, product owners, and management rather cram in another badly designed and untested feature a day before release rather than to fix the deficiencies already reported. And we QA folks would report more if we'd be given the same time that devs get. Nobody questions devs when they say they need three weeks to do X. QA is not even asked for an estimate. Now go and look at the open bugs list that QA
  • It's just impossible to test everything in your test enviroment. There is NO test suite that will allow you to test everything with a 100% gaurantee it will be bugfree in production. Yes, maybe in a simple very limited app on a very limited OS it will be possible. But with the very extensive configurations (settings/drivers/etc) with modern OSses which run on a gazillion different configured hardware, it's just impossible.
    Ofcourse the marketing people of those test suites (and a lot of developers who swear

  • Meanwhile, here in the valley, the latest trend in testing is not testing. Eliminate your QA department today!

  • I blame Agile and the mockery that management made out of it. We are agile now means that we can put stuff on the sprint backlog at will and declare it done after three weeks. Agile also makes documentation a bad thing, so we stop writing stuff down, including requirements. We are agile now, we can change our minds at will and everything that is inconvenient (such as fixing bugs found by QA before release) is postponed, reason: we hit it early next iteration. Agile enables those who do not want to commit to
    • My experiences with Agile have been much more positive. I think it's a completely valid approach that is really easy to misunderstand and screw up, kinda similar to C++ programming.

  • App Developers Spend Too Much Time Debugging Errors in Production Systems

    One thing I don't see anyone talking about here is the need to reduce the amount of change (SLOCs, FPs, whatever), done between QA cycles. Any serious system is going to have some sort of QA cycles before becoming available to the consumer.

    Take SLOCs for instance. Say, from last release to production to the first QA cycle, you have 1000 SLOCs (or FPs) of change. And experience tells you that, in your organization, it takes 2 weeks of QA to test that much code change. And your first QA cycle is 2 weeks.

    S

You are always doing something marginal when the boss drops by your desk.

Working...