实打实的软件错误修复
Software quality, or the lack thereof, is something everybody loves to gripe about. Now that I have my own company I finally decided to do something about it. Over the last two weeks we stopped everything atFog Creek to ship a new incremental version of FogBUGZ with the goal of eliminating all known bugs (there were about 30).
软件质量,通常是最缺乏的,通常是每个人都喜欢诟病的。 现在我终于成立了自己的公司,我最终决定为软件质量做点儿什么了。 为了清除所有已知的软件错误(大约30个),上两周我们停止了FogCreek公司的所有事情来发布FogBUGZ的一款增量版本。
As a software developer, fixing bugs is a good thing. Right? Isn't it always a good thing?
作为软件开发人员,修复软件错误通常是件好事。对吧?难道不一定是一件好事?
不一定!
Fixing bugs is only important when the value of having the bug fixed exceeds the cost of the fixing it.
是的。
修复软件错误这件事情只有当修复软件错误产生的价值超过了修复软件所需的代价时才是最重要的。
These things are hard, but not impossible, to measure. Let me give you an example. Suppose you operate a peanut-butter-and-jelly sandwich factory. Your factory produces 100,000 sandwiches a day. Recently, due to the introduction of some new flavors (garlic peanut butter with spicy Habanero jam), demand for your product has gone through the roof. The factory is operating full-out at 100,000 sandwiches, but the demand is probably closer to 200,000. You just can't make any more. And each sandwich earns you a profit of 15 cents. So you're losing $15,000 a day in potential earnings because you don't have enough capacity.
Building a new factory would cost way too much. You don't have the capital, and you're afraid that spicy/garlicky sandwiches are just a fad which will pass, anyway. But you're still losing that $15,000 a day.
这些事情很难讲透,但也绝非不可能。举个例子:假设你运营一家“花生-黄油-果冻 ”三明治工厂, 工厂每天生产100,000个三明治。 最近,由于新口味(大蒜花生黄油蘸辣哈瓦那酱)的引入,你的产品需求订单已突破天际。 工厂以全部产能,每天100,000件运作,但是需求大概是200,000. 你也没法再做更多了。每个三明治的利润是15美分。所以每天因为产能不足而导致的潜在盈利损失大约是15,000$. 造一座新工厂花费太多。你没有资本,而且你也害怕那种辣/大蒜口味的三明治不过是一时流行,它们最终要过气,不管怎样,你每天都会损失15,000$ 。
It's a good thing you hired Jason. Jason is a fourteen year old programmer who hacked into the computers that run the factory, and believes that he has come up with a way to speed up the assembly line by a factor of 2. Something about overclocking that he heard on slashdot. And it seemed to work in a test run.
幸运的是你雇了Jason。Jason是个14岁的程序员,他黑进了运营工厂的电脑系统,然后他觉得已经知道到了能够两倍速流水线的方法,这是他从slashdot上听来的超频之类的东西。而且在测试的时候确实有效。
There's only one thing stopping you from rolling it out. There's a teeny tiny wee little bug that causes a sandwich to be mushed once an hour or so. Jason wants to fix the wee bug. He thinks he can fix it in three days. Do you let him fix it, or do you roll out the software in its bug-addled state?
只有一件事情阻止你把这种方法付诸实施。 这种方法有个很小很小的纰漏,导致每一个小时会压烂一个三明治。 Jason想要修复这个错误。他觉得能在三天内修复好。你是让他修复那个错误呢?还是以这种瑕疵状态实施这种方法?
Rolling out the software three days later will cost you $45,000 in lost profits. And it will save you, um, the cost of raw materials for 72 sandwiches. (In either case Jason will get the bug fixed three days later). Well, I don't know how much sandwiches cost on your planet, but here on Earth, they're a lot less than $625.
晚三天实施这种方法会导致你损失45,000$利润。 而同时这也会节约你,恩,72个三明治的原材料费用。(两种情况Jason都可以在3天内修复错误)。呵呵,不知道你们星球上三明治卖多少钱一个,不过在我们这儿地球上,总共花费远少于625$。
Where was I. Oh yeah. Sometimes it is not worth fixing a bug. Here's another bug that's not worth fixing: if you have a bug that totally crashes your program when you open gigantic files, but it only happens to your single user who has OS/2 and who, for all you know, doesn't even use large files. Well, don't fix it. Worse things have happened at sea. Similarly I've generally given up caring about people with 16 color screens or people running off-the-shelf Windows 95 with no upgrades in 7 years. People like that don't spend much money on packaged software products. Trust me.
刚刚说到哪儿了?哦,对了,有时候不值得修复软件错误。这儿有另一个例子关于不值得修复的软件错误:如果你有一个错误当且仅当你的程序打开巨大文件的时候导致程序崩溃,而且它仅仅发生在你唯一的使用OS/2的用户身上,而大部分用户甚至都不会使用巨大文件,那就别修正这个错误了。还有一堆更糟糕的事情呢。 类似的,我基本上放弃了对使用16色显示屏或者是运行着Windows95七年没更新的用户的支持。相信我,那样的人不会花很多钱购买软件产品的。
But mostly, it's worth fixing bugs. Even if they are "harmless" bugs, they may reduce the reputation of your company and your product, which, in the long run, will have a significant impact on your earnings. It's hard to overcome the reputation of having a buggy product. When you do want to do that .01 release, here are some ideas for finding and fixing the right bugs: the ones that it is economically worth fixing.
除此之外大多数情况下,还是值得去修复软件错误的。哪怕是“无害”的软件错误,它们会对你的公司以及公司的产品的信誉带来影响,而这在长远的未来将对你公司的营收产生至关重要的影响,很难克服公司产品到处是错误的负面形象带来的影响。当你确实决定要发布一个.01版本的时候,下面是一些关于寻找和修复恰当错误的建议:要寻找那些经济上值得去修复的软件错误来修复。
Step One: Make Sure You Find Out About The Bugs.
第一步:确定你找到了那些错误。
- In the case of FogBUGZ, we have two ways of doing that. First, we trap all bugs on our free demo server, capture as much information as we can, and email the whole thing to the development team. That found an awful lot of bugs, which was very cool. For example, we discovered a bunch of people who didn't enter dates where they were supposed to in the "Fix For" screen. We didn't even have an error message in that case, we just "crashed" (which, in a web app, just means you got an ugly IIS error instead of what you expected). Oops.
- 在FogBUGZ的例子中,有两种方法来做这件事情。首先,我们会在演示服务器上捕获所有的软件错误,获得尽可能多的信息,然后把所有这些信息发电子邮件给开发团队。这个过程发现了一堆软件错误,超级酷。例如,我们发现有一堆人在“修复”页面应该要输入日期的地方没有输入日期。我们甚至没有考虑到这种错误情况,只好让程序崩溃(对于网站应用意味着一个丑陋的IIS出错页面,不是你想像的结果)。哎哟呵。
- When I worked at Juno, we had an even cooler system in place to collect bugs "from the field" automatically. We installed a handler using TOOLHELP.DLL so that every time Juno crashed, it stayed alive just long enough to dump the stack into a log file before going to its grave. The next time the program connected to the Internet to send mail, it uploaded the log file. During betas, we gathered these log files, collated all the crashes, and entered them into the bug tracking database. This found literally hundreds of crashing bugs. When you have a million users, it is amazing what will crash, often because of severe low memory conditions or severely crappy computers (can you spell Packard Bell?) You could have code like this:
- 当我在Juno工作的时候,有个更“酷”的自建系统自动的从输入表单域搜集错误。我们给Juno用TOOLHELP.DLL安装了一个句柄,这样每次Juno崩溃的时候,它都有足够的时间在它消失之前把调用栈信息导出到一个日志文件。下一次程序连接到因特网的时候,它上传日志文件,发送邮件。在Beta测试的时候,我们搜集了所有这些日志文件,校对了这些崩溃,然后把它们输入到软件错误跟踪系统数据库。这个过程几乎发现了上百个会导致程序崩溃的软件错误。当你有了上百万用户的时候,你会很吃惊什么样的场景造成程序崩溃, 通常会是极低的内存或是超级破的电脑(你能想象小霸王么?)你可能会写这样的代码:
int foo( object& r )
{
r.Blah();
return 1;
}
and you would get crashes there because the r reference was NULL, even though that's completely impossible, there's no such thing as a NULL reference, C++ guarantees it, and you don't have to believe me but when you wait long enough and have millions of users and religiously collect their stack dumps, you will find crashes in places like that and you won't believe your eyes. (And you won't fix them. Cosmic rays, man. Get a new computer and this time don't install every cool shareware taskbar lint gizmo you find. Sheesh.)
你会发现有个软件崩溃因为r引用是空的,即使看起来完全不可能,根本没有这种“空引用”的概念,C++保证这点,你可以不相信我但是当你等了足够长时间开始有上百万用户了,然后开始虔诚地搜集崩溃时调用堆栈转储,就会发现崩溃会出现在你无法直视的地方。(你没办法修复,宇宙射线导致的,亲。重新买个电脑吧,这回不要把你找到的每个很酷的共享任务栏小插件软件全部都装到电脑上。天哪)
The other thing we do is consider each and every tech support call to be evidence of a bug. When we take the call, we try to figure out what we could have done to eliminate it. For example, the old FogBUGZ Setup used to assume that FogBUGZ would run under the anonymous Internet user account. That was a good assumption 95% of the time, and a bad assumption 5% of the time, but every one of those 5% cases ended up in a call to our support line. So we modified Setup to prompt for an account.
另外一件事就是要把每一通打来的技术支持电话都当作是软件错误的证据。当我们接了这种电话,通常会尝试用已知方法来消除这些现象。例如,旧的FogBUGZ安装过程会假设FogBUGZ以匿名因特网账户运行,95%的情况下这都是很正确的假设,不过还有5%的情况是不正确的假设,那些5%的假设里的人最后就会通过技术支持热线打进来找你。所以我们最后修改了安装过程,提示用户来选择上网账户。
Step Two: Make Sure You Get Economic Feedback
第二步:确保你获得经济实用的反馈
You may not be able to figure out exactly how much it's worth to fix each bug, but there's something you can do: charge the "cost" of tech support back to the business unit. In the early nineties there was a financial reorganization at Microsoft under which each product unit was charged for the full cost of all tech support calls. So the product units started insisting that PSS (Microsoft's tech support) provide lists of Top Ten Bugs regularly. When the development team concentrated on those, product support costs plummeted.
你也许没办法精确的计算出修复每个软件错误的代价,但是你还是有些能做的事情的:把每次“技术支持”的花费全部算到业务部门的账上。在90年代早期微软有这样的财务重组,每个产品部门都会为所有技术支持花费买单。这样产品部门才会开始坚持让PSS(微软的技术支持部门)有规律的提供10大软件错误列表。当开发团队集中处理这些软件错误的时候,技术支持费用就直线下降了。
This is a bit in contradiction with the new trend of letting the tech support department pay for its own operation, something that most large companies do. At Juno tech support was expected to break even by charging people for tech support. By moving the economic burden of bugs onto the users themselves, you lose what limited ability you might have had to detect the damage they were causing. (Instead you get irate users who resent having to pay for your bug, who tell their friends, and you can't even measure how much that costs you. To be fair to Juno, the product itself was free, so stop yer bitchin.)
这和现在的新趋势—“大部分大公司通常都让技术支持部门自负盈亏”有点矛盾,在Juno,技术支持费用由用户均分。通过把软件错误的经济负担转移到用户身上,你失去了自己仅剩的那点找出导致损失的软件错误那种能力。(相反,你获得的是那些后悔付钱购买了你的错误软件,恼羞成怒的用户。他们会告诉朋友,这种代价对你将造成的损失甚至无法衡量。为了对Juno公平起见,他们家产品本身是免费的,别在那儿嚼舌头了。)
One way of resolving the two is to not charge the user when the support call was caused by a bug in your own product. Microsoft does this, and it's quite nice, and I've never paid for a call to Microsoft :) Instead, charge the $245 or whatever one developer incident costs these days back to the product unit. That blows away their profit completely for the product they sold you (several times over), and creates exactly the right economic incentives. Which reminds me of one reason DOS games were a terrible business... to get them to look good and run fast, you usually needed strange video drivers, and a single tech support call about the video drivers would blow away the profit you could make from 20 copies of your product, assuming Egghead and Ingram and the ad on MTV hadn't already guzzled awayall your earnings.
解决这两个问题的一个方法是:当用户的技术支持电话是因为你软件本身的错误导致的,那么就不要让用户付钱。 微软就是这样做的,这样显得很友好。我反正从来没因为技术支持电话给微软付过钱。相反:把一个约245$的开发支持费用记回产品部门头上,就能彻底干掉产品销售获得的利润(甚至是超过几倍)。而且这个过程完全符合经济学原理。这让我想起了为什么DOS游戏是烂生意。。。。。。为了让这些游戏看上去不错而且能快速运行, 通常需要奇怪的视频驱动,一个简单的关于视频驱动的技术支持电话就足以完全抹去你销售20份产品挣来的利润,这还是在假设在知识,理论和MTV广告上的开销没有耗光所有盈利的情况下。
Step Three: Figure Out What It's Worth To You To Fix Them All.
第三步:搞清楚全部修复这些软件错误带来的价值在哪里。
At Fog Creek Software, well, we're a tiny company (except in our own minds), and the development team just takes the tech support calls. The cost was running about 1 hour per day, which, based on our consulting rates, is somewhere around $75000 a year. We were pretty confident that we could get that down to 15 minutes a day by fixing all known bugs.
在FogCreek软件公司,恩,我们是个小公司(除了我们脑袋里的想法外),开发团队负责技术支持电话。花费是每天约1小时,根据我们的咨询费用,大约是每年75000$。我们相当自信如果能把所有的已知软件错误都修复的话这个花费能降到大约15分钟。
Using very sloppy numbers, here, that means that the net present value of the savings would be about $150,000. That justifies 62 days of work: if you can do it in less than 62 person-days, it's worth doing.
通过非常粗略的估算,这意味着节省的费用净现值约为150000$,也就是大约62天的工作时间:如果你能在62人天之内修复所有的软件错误,那么这就是值得做的。
Using the handy estimation feature built into FogBUGZ, we calculated that it would take 20 person-days (two people two weeks) to fix everything - that's $48,000 "spent" for a return of $150,000, which is a great return on investment just on the basis of the tech support savings. (Observe that you could substitute the cost of programmer's salaries and overhead instead of our consulting rate and get the same 3:1 result, since it cancels out).
使用FogBUGZ内建的便捷估算功能,我们计算出大约要花20人天(两个人俩礼拜)来修复所有的软件缺陷 – 这大约是 48,000$ “花费”,为了150,000$的回报。仅从节省技术支持费用的角度出发,这就是一比相当可观的投资回报。(注意,因为两者相等,你可以将程序员的薪水及其他支出取代咨询费,获得大约是3:1的比例)。
I haven't even begun to count the value from having a better product, but I can start doing that, too. We had 55 crashes on the demo server during the month of July with the old code, representing 17 distinct users. You have to imagine that at least one of those people decided not to buy FogBUGZ because they thought it was buggy when they ran the demo (although I don't have real statistics for that.) In any case the lost sales was probably costing us somewhere between $7,000 and $100,000 in present value. (If you were serious enough, it wouldn't be too hard to get a real number).
我都还没考虑获得更好的产品带来的价值,不过也可以那样做。 在7月份部署的旧代码演示服务器上,产生了55次崩溃,分别代表了17个不同的用户。你可以想象他们中至少有一个人,运行演示版本的时候,因为程序崩溃了,进而觉得软件到处都是错误, 进而决定不购买FogBUGZ。(当然我没有真实的统计数据。)不管怎么样,销售损失带来的损失大约在7000$到10000$净现值之间。(如果你相当严谨,要算出准确的数字也不难)。
Next question. Can you charge more for a less buggy product? That would add a whole bunch of value to debugging. I suspect that at the extremes, bug count does affect price, but I am hard pressed to think of an example from the world of packaged software where this has been the case.
下一个问题,可以因为软件错误减少了而提高软件的售价么?那会给调试代码带来一整套新的价值。我怀疑在极端情况下,软件错误数量确实影响价格,不过我怎么也想不出现实中打包销售的软件中这样的例子。
Please Don't Beat Me Up!
不要打我。
Inevitably people read essays like this and come to silly conclusions, like, Joel doesn't think you should fix bugs. In fact I think that for most of the kinds of bugs that most people fix, there's a clear return on investment. But there may be an even higher monetary value to doing something other than fixing every last bug. If you have to decide between fixing the bug for OS/2 guy and adding a new feature that will sell 20,000 copies of your software to General Electric, well, sorry, OS/2 guy. And if you're dumb enough to think that it's still more important to fix OS/2 than to add the GE feature, maybe your competitors won't be and you'll be out of business.
不可避免的,人们阅读了这样的散文然后得出简单愚蠢的结论,例如,Joel觉得你不应该修复软件错误。实际上,我觉得对于大部分人在修复的各种各样软件错误里的大多数,都有着明确的投资回报。但也许存在着比修复最后的软件错误更有价值的事情。如果你要决定:是帮那个OS/2家伙修复软件错误,还是要开发一个新功能,这个新功能可以多卖20,000分软件拷贝给通用电器公司。 那么,对不起了OS/2用户。如果你蠢到还是觉得修复OS/2比增加GE特性更重要的程度的话,竞争对手也许没那么蠢,然后不久你就要失业了。
With all that said, I'm optimistic at heart, and I believe that there is a lot of hidden value to producing very high quality products that is not very easy to capture. Your employees will be prouder. Fewer of your customers will send you back your CD in the mail after microwaving it and chopping it to bits with an ax. So I tend to err on the side of quality (indeed, we fixed every known bug in FogBUGZ, not just the big bang ones) and take pride in that, and feel confident, by the complete elimination of errors from the demo server, that we have a rock-solid product.
综上所述,我本质上是个乐观主义者,而且我相信创造高质量的软件产品隐含了很多不容易发现的价值。员工会更自豪。把软件安装CD先用微波炉烤了然后用大斧砍成一片一片再寄回给你的用户会越来越少。所以我会诟病软件质量(实际上,我们修复了FogBUGZ里的所有已知软件错误,不仅仅是那些重量级的错误),为之骄傲感到自信,通过从展示版本服务器上完全消除错误,我们获得了坚如磐石的软件产品。