This report found on UseNet and written by DataQuest (nobody knows them) was distributed by Intel Corp. (probably to the PC manufacturers). __________________________________________________ SPECIAL REPORT: The Great Pentium Fire Drill DQ MONDAY December 12, 1994 In this article, we cover the following topics related to the Pentium floating-point divide bug: o The bounds of the bug o The importance of the bug o The perception of the bug o The role of the Internet o Why a total recall is dangerous to the industry o Intel's competitors o What Intel should do, and what it is doing o The software fix o The long-term effects of the problem o What a system manufacturer should do The Bounds of the Bug It turns out that there were some small design errors in the divide hardware in the Pentium's floating-point calculation unit. These errors lead to incorrect results in certain calculations that involve floating-point division. The errors are extremely small--in many cases, insignificant--and occur in approximately one out of every billion calculations involving a reciprocal. For the set of all divisors and dividends, we accept Intel's claim of one failure in every nine billion pairs. The most famous calculation, now known as Coe's Ratio, and undoubtedly the most common single division performed on the Pentium, is 4195835/3145727. Developed from a thorough study of the Pentium's problem and selected for its worst-case nature, this calculation should generate a result of 1.33382045; Pentium chips afflicted with the divide problem will result in an answer of 1.33373907, representing an error in the fifth significant digit. Internet correspondence indicates that this is about the largest error obtainable, and that there are approximately 1,738 unique pairs that can generate an error with this big. As the total number of pairs is about 4x10(27), the chance of actually hitting an error this big is vanishingly small. The other errors occur in the sixth decimal place and further out, reportedly randomly distributed. The Importance of the Bug It is extremely difficult to contrive a situation in which the divide error could manifest itself in a way that is material to an end result. This is because the error is normally very small, occurs at a rate that is extremely rare, and should have no effect on the huge majority of Pentium users. The only class of user likely to be affected is that of mathematicians working in the field of number theory, the class of user from which the bug was originally reported. Clearly, users of a Pentium-based system designing devices where a divide error could conceivably be life-threatening-- for example, a bridge, a ship, or a plane--should consider an upgraded part even though the error is unlikely to occur and, if it does, is effectively certain to be insignificant. Much engineering design work involves iterative calculations that work toward a solution. In these cases, the likelihood of the bug occurring is increased, but its effect is totally negated by the subsequent iterative calculation. If the flaw were to somehow cause a nonconvergence, the process would be started over with a new set of initial conditions, and the error would be eliminated. Financial users could conceivably be affected by the bug. However, the size of the error is again likely to be immaterial, representing hundreds of dollars in a million- dollar transaction. Although larger transactions could be affected, no financial institution would allow a transaction to take place without some level of cross-checking. The cross- check process would inevitably surface the error, and the rarity of the bug would prevent compensating errors from occurring or for multiple errors to compound. In summary, the low level of the error relative to real-world situations, coupled with the extreme rarity of the bug, guarantees that even calculations involving lives and money are safe on a Pentium. Numerologists and some extremely theoretical physicists, however, stand at risk of producing bad results and should request replacements. On a final note, we observe that many Pentium users--and all Power Mac users--run their systems without parity checks on the memory subsystem. Dynamic memories exhibit a very rare soft failure in which a bit flips from a one to a zero or vice versa, the result of an encounter with an alpha particle. A typical system will encounter this change once every few years; all systems are susceptible to the problem, and it is completely random. Most of the time, this flip doesn't matter because it is overwritten by new data or in an area of the memory that is not in use. However, intensive numerical calculations that fill memory with intermediate values could turn up an occasional incorrect answer once every few hundreds of machine years. We believe that this class of error is as significant or more so than the Pentium divide bug, as it occurs more frequently and can result in a much larger change in the number. The Perception of the Bug Having established that the bug itself is immaterial (and we would be happy to talk about this at great length--it is extremely difficult to contrive a real-world calculation in which it matters), we now turn to the perception of the bug in the press. Many of the recent writings on the subject--in many cases by people who should know better--treat the problem as if the Pentium cannot perform simple arithmetic, and that all applications that do calculations are exposed to some significant risk. Part of the problem here lies with Intel's engineering-driven PR effort, which created the impression that Intel would anoint the chosen few who would receive updated parts; this tactless bungle touched off the initial Internet furor from which most of the press articles drew. The Role of the Internet There is also a lesson on the use of the Internet here. The nature of electronic mail is such that it strips personal interaction from communication. Without this interaction, simple statements or even typographical errors can be interpreted as strong opinion and can touch off torrents of strongly-worded messages. There are few low-energy opinions on the Internet, and just a few angry people can stir up an electronic uproar. We believe that the single greatest PR error was the posting of Andy Grove's response on the Internet. The Internet response was strong and negative; many writers took their tone from the Internet correspondence. We believe that a press conference would have been much more successful and could probably have dampened much of the negative reporting that was in essence driven by the Internet. Why a Total Recall Is Dangerous to the Industry Several writers have suggested the idea of a total recall of the part. We at Dataquest believe that this is an imprudent and irresponsible suggestion; the severity of the bug does not merit the slightest consideration of the idea, and a recall could damage the structure of the entire PC industry and all its participants, not just Intel itself. The size of the problem involves not only a couple of million Pentium chips in the field, but a total of perhaps 6 to 7 million units built and in the pipeline. Although Intel could afford the monumental cost of replacing existing parts, not shipping the pipeline inventory would delay shipments of Pentium processors by several months. This delay would result in a stall in sales as Pentium leaders such as Packard Bell, Gateway, and Dell stopped selling Pentium systems (which are rapidly becoming the bulk of their revenue). The inventory that these companies and other Pentium leaders hold would drop in value, resulting in a substantial loss and potentially forcing them out of business. This shock wave would propagate back through the supply channels and result in fewer competitors in the market; higher prices and slower technological progress are the final result of such a change. In short, a bad thing for users and all players in the PC business (including the journalists whose salaries are funded by advertising). Intel's Competitors Intel's competitors have remained silent on the issue; this could have happened to any of them. IBM pulled a coup by offering to replace all Pentium chips that it has shipped, sneaky in that IBM has not shipped very many Pentium chips (we would suspect that Ambra machines aren't covered under this deal). AMD, Cyrix, NexGen, and even the PowerPC group have wisely kept their heads down, choosing not to pillory Intel (although Intel has done a good job of making its own burden on this one). [NOTE: THIS ARTICLE WAS PREPARED EARLY MONDAY, editor] We also note that a Pentium vacuum is an opportunity for no one. The competing parts aren't available, and even if they were, it would not be economically viable for competitors to ship low-yield advanced die (that is, early versions of the M1 from Cyrix and K5 from AMD) instead of profitable 486 parts. What Intel Should Do, and What It Is Doing The way to be clear about something is to put up money. In stating that the Pentium is good for almost all users, Intel stands liable for a financial or other loss that a user may incur as the result of a faulty Pentium calculation. Intel's obligations to its shareholders mandate that this decision be a good one. The first line of defense is the Intel Lifetime Warranty, under which Intel promises to replace a Pentium if ever the user needs it because of new software or new applications. Coupled with a fair return policy, this is a good compromise for the user and Intel. In fact, we believe that the Lifetime Warranty is a natural extension of Intel's newly found relationship with its customers' customers and will propagate through succeeding generations of processors and turn into yet another entry barrier. Intel has established an effective telephone system to handle the replacement policy. The lines are open 24 hours a day, and Intel has contracted third-party telephone support to screen incoming calls, resolving the issue of busy phone lines. We believe that a user with a genuine need for the part could get a replacement without difficulty, and that Intel is erring on the side of caution. Numerologists, financial users, and designers working on projects that could conceivably put life or property at risk are almost certain to qualify for replacement parts; our discussions with Intel indicate that the problem will be handled fairly and efficiently, and we suspect that users who make enough fuss will get a new part whether they need it or not. Intel clearly intends to make its customers' customers happy. The only other thing that needs to be taken care of is the PR side. We believe that Intel has made advances in cleaning up its previously ruthless image, but this debacle shows that there is much progress to be made. Intel will certainly have to put up with--and learn to smile at--many Pentium jokes over the next couple of years. This development of a softer side will be a good thing for Intel, and for the industry as a whole. The Software Fix Intel, in conjunction with some of the individuals that have developed the best understanding of the problem, is developing a software fix for the problem. This fix will take the form of a floating-point library that will be used in place of the standard Pentium floating-point calls and will be provided to software manufacturers. The library will use fast screening techniques that let most calculations proceed using the Pentium hardware; divide pairs that could cause a problem will be checked in more detail, then numbers that truly fail will be processed in such a way as to generate the correct result. This approach will slow down the Pentium's divide operations somewhat, but it is unlikely to be noticeable. Software manufacturers will incorporate the library routines into their software and will provide patches via BBS or online services that guarantee accurate results in all cases for users who depend on floating-point accuracy. The Long-Term Effects of the Problem Our belief is that this issue will subside over the next few weeks as users realize that, despite the press hype and PR fumbles, this error just doesn't matter. It is the equivalent of a whoopee cushion, noise and action leaving Intel red-faced and floundering but with no real damage to its business or the PC business as a whole. However, all processor designers have learnt a hard lesson about testing of their parts and why engineering approaches to PR problems don't seem to work out too well. What a System Manufacturer Should Do Keep shipping the product, but work with Intel to ensure that some supply of replacement parts will be available to satisfy critical users. Train customer service agents to help customers understand the problem, passing out the Intel numbers if necessary. The Intel FaxBack documents are a good first line of defense, and system manufacturers can send these out on their own. Intel would like to do all the screening itself, but we believe that system manufacturers should be responsible for their own destiny and need to play an active part in helping their customers understand the issues. It is entirely possible that Intel's part in resolving the issues for unhappy customers could further consolidate Intel's brand equity in the user base; this is not necessarily in the best interest of system manufacturers. _____________________________________________________________________