Suggestions for Pentium FDIV Bug Software Workarounds _____________________________________________________ Stand: 3.DEZ.1994 - V01.3 There is very high activity in UseNet's "comp.sys.intel" now, and naturally a lot of noise in such an outstanding situation. It isn't easy looking for important postings. Therefore I've collected some of the postings concerning the software issues of a workaround for the infamous FDIV Pentium bug, called by Intel a "subtle flaw" (from the official statement). There is an important note in poster no.3: "Intel set up a world-wide conference call yesterday (1-dec-1994), where we discussed...". Yes, the fact, Intel is discussing the Pentium FDIV design bug issues with independent hardware and software gurus known from the Internet community only, shows the serious side of bug. Enjoy! -- ||||||| Karl-Heinz Dittberner | Arnimallee 22 | | | | Scientific Data Processing | D-14195 Berlin, Germany | | | | Free University Berlin | Phone: (+4930) 838 2531 |w|d|v| E-Mail: wdv@vaxser.grumed.fu-berlin.de (Internet) ------------------------------ Included are the postings of: 1.) Cleve Moler, The MathWorks Inc., USA 1.DEC.1994 - 05.12 GMT 2.) Terje Mathisen, Norsk Hydro, Norway 1.DEC.1994 - 09.01 GMT 3.) Terje Mathisen, Norsk Hydro, Norway 2.DEC.1994 - 08.53 GMT -------------------------------------------------------------------------------- 1. Subject: A New Hardware/Software Pentium FDIV Workaround Xref: math.fu-berlin.de comp.sys.intel:23073 comp.soft-sys.matlab:9814 sci.math.num-analysis:15575 Path: math.fu-berlin.de!zib-berlin.de!news.mathworks.com!not-for-mail From: moler@mathworks.com (Cleve Moler) Newsgroups: comp.sys.intel,comp.soft-sys.matlab,sci.math.num-analysis Subject: A New Hardware/Software Pentium FDIV Workaround Followup-To: sci.math.num-analysis,comp.soft-sys.matlab Date: 1 Dec 1994 00:12:02 -0500 Organization: The MathWorks, Inc., Natick, MA 01760 Lines: 226 Distribution: inet Message-ID: <3bjlv2$r3v@acoma.mathworks.com> Summary: If at-risk, multiply by 15/16. Keywords: Tim Coe, Terje Mathisen, Cleve Moler Greetings. Last week, I suggested a hardware/software workaround for the Pentium FDIV bug. And, I promised that we would release a Pentium-aware version of MATLAB that incorporated the workaround. Since then, I have heard from several people suggesting improvements. In particular, Terje Mathisen and Tim Coe have made such good suggestions that I want to consider a different algorithm. I have enough confidence in this new algorithm that we are now in the process of using it for the Pentium-aware MATLAB. Mathisen is the PC programming expert from Norsk Hydro in Norway who confirmed Thomas Nicely's original bug report, and who wrote P87TEST. Tim Coe is the semiconductor design engineer from Vitesse Semiconductor whose model of the chip's behavior led him to examples of the worse-case behavior, including the 4195835/3145727 example. With this posting, the three of us are providing a status report on a more ambitious project. We believe it should be possible to modify software so that it can run on defective Pentiums and provide IEEE compliant floating point arithmetic with very little, if any, degradation in speed. This is work in progress. We're not absolutely sure yet that it is guaranteed to do everything we want. We will continue to study, refine and check the ideas and the code. (I particularly want to thank Audrey Benevento and Marc Ullman for their contributions to our work here at the MathWorks.) The algorithm that I was so enthusiastic about a week ago starts with an FDIV to compute z = x/y. Then the residual, r = x - y*z, is computed. If the residual is "small", z is deemed to be OK. If not, x and y are both scaled by 3/4, and the process is repeated. It is very difficult to define "small" in such a way that the resulting z is certainly the correctly rounded result. Our new algorithm completely avoids such delicate decisions. One personal comment: As far as I am concerned, the activity in the comp.sys.intel newsgroup over the past three weeks has been a mixed blessing. It brought this problem to the attention of a wide audience quickly. It hooked me up with experts like Mathisen and Coe that I would never meet otherwise. It definitely contributed to the creation of the solution we are proposing. But now I am finding it impossible to read everything being posted in the group. In fact, it has become counter productive. I may be missing some important posts, buried in among all the stuff of questionable technical merit. Years ago, Alex Chorin, a math professor at Berkeley, began a book review with one of my all-time favorite quotes, "This book detracts from the sum total of human knowledge" I am afraid we may have now reached that point with the activity in comp.soft-sys.intel. I will be looking to comp.soft-sys.matlab and sci.math.num-analysis for future discussion of the aspects of this problem that I am interested in. -- Cleve Moler ----------------- A New Hardware/Software Pentium FDIV Workaround Tim Coe, coe@vitsemi.com Terje Mathisen, Terje.Mathisen@hda.hydro.com Cleve Moler, moler@mathworks.com One of us (Coe) has developed a model which explains all the known examples of FDIV errors. We believe we now understand the bit patterns well enough that we can identify certain bands in the space of floating point numbers which are "at risk". The test involves looking only at the denominator, although the numerator also affects whether or not an error actually occurs. If the denominator is found to be outside the at-risk bands, the FDIV instruction can be safely used to produce the correctly rounded IEEE result. If the denominator is in one of the at-risk bands, then scaling both the numerator and denominator by 15/16 eliminates the risk. Moreover, if the scaling and subsequent FDIV instruction are carefully carried out in the extended precision format, it is still possible to produce the correctly rounded result. No attempt is made to predict before the FDIV whether or not an error will actually occur, and no attempt is made after the FDIV to assess the size of the residual. Where, exactly, are the bands of at-risk denominators used in our new algorithm? Between each two powers of two, Coe specifies five small intervals to avoid. Consider the interval 4096 <= y < 8192. For values of y in this range, the intervals to avoid have width one. Here they are, in decimal and in hex. 4607. 4607.999999... 5375. 5375.999999... 6143. 6143.999999... 6911. 6911.999999... 7679. 7679.999999... 40b1ff0000000000 40b1ffffffffffff 40b4ff0000000000 40b4ffffffffffff 40b7ff0000000000 40b7ffffffffffff 40baff0000000000 40baffffffffffff 40bdff0000000000 40bdffffffffffff A picture of the interval would look something like this. Each of the five *'s represents a subinterval to be avoided. The length of each subinterval is 1/4096-th of the length of the overall length. [.....*........*........*........*........*.......) It is interesting to note that scaling the third subinterval by 3/4 sends it into the first subinterval. This is another weakness of Moler's original proposal. But scaling by 15/16 shifts each subinterval to a safe region. Think of the 64 bit floating point representation of the denominator as consisting of 16 four-bit half bytes, or "nibbles." These are represented by the 16 characters in the hex printout. Also, think of the binary point as being just after the third byte or sixth nibble. That's between the pair of f's and the string of zeros in the hex numbers given above. Then, with the hidden bit and the binary point shown explicitly, the numbers are of the form: 1aff.xxxxxxxxxx * 2^e The algorithm says to avoid denominators which have eight consecutive ones in the positions indicated by ff and a = 1, 4, 7, 10, or 13 in the first nibble of the fraction. The exponent and the low order 40 bits of the fraction are not involved in the decision. Here is some pseudo C. The input is a pair of 64 bit double precision floating point numbers, x and y. The output is intended to be the 64 bit double precision floating point number which is closest to the exact quotient x/y. double x,y,z long double xx,yy,zz char ychar[8] nibble ynib[16], a /* Use same 64-bit storage for y, ychar and ynib */ /* Check for 8 consecutive ones if (ychar[3] = 0xff) { /* Check for the five bands a = ynib[4] if (a == 1) || (a == 4) || (a == 7) || (a == 10) || (a == 13) { xx = 0.9375*x yy = 0.9375*y zz = xx/zz z = zz return } } z = x/y return This is probably best implemented in assembly language, both because we want it to be fast, and because some C compilers don't give us proper access to the long double format. Here is today's (11/29) version of the assembly code written by one of us (Mathisen). It is intended to be used with the in-lining facility of the WATCOM 10 C compiler. Its unique feature is a 32 entry table which drives the test. The table can be set to all zeros for chips without the bug and to the appropriate nonzero quantities if a defective chip is detected at startup. The code also deals with denormals, which we have so far ignored in this discussion. We include the code here primarily to show our approach. It is not expected to be the final, complete solution. Anybody interested in details should contact Terje.Mathisen@hda.hydro.com via email. #include #include static float _fdiv_scale = 0.9375; static char _fdiv_risc[32] = {0}; long double lfdiv(long double x, long double y); #pragma aux lfdiv = \ " sub esp,12"\ " fld st"\ " fstp tbyte ptr [esp]"\ " mov eax,[esp+4]"\ " shr eax,15"\ " cmp al,255"\ " jne ok"\ " mov al,ah"\ " and eax,31"\ " cmp _fdiv_risc[eax],ah"\ " jz ok"\ " fld [_fdiv_scale]"\ " fmul st(2),st"\ " fmulp st(1),st"\ "ok:"\ " fdivp st(1),st"\ " add esp,12"\ parm reverse [8087] [8087]\ modify nomemory exact [eax 8087]\ value [8087]; void _fdiv_risc_init(void) { int i; if (fdiv_fail()) { for (i = 1; i < 16; i += 3) _fdiv_risc[i+16] = i; } } void main(void) { double z; z = lfdiv(425678.0, 3142567.0); printf("z = %G\n",z); } -------------------------------------------------------------------------------- 2. Subj: Re: A New Hardware/Software Pentium FDIV Workaround Xref: math.fu-berlin.de sci.math.num-analysis:15580 comp.soft-sys.matlab:9816 comp.sys.intel:23122 Path: math.fu-berlin.de!zib-berlin.de!news.mathworks.com!udel!gatech!howland.re ston.ans.net!pipex!sunic!trane.uninett.no!nac.no!telepost.no!hydro.com!usenet From: Terje.Mathisen@hda.hydro.com (Terje Mathisen) Newsgroups: sci.math.num-analysis,comp.soft-sys.matlab,comp.sys.intel Subject: Re: A New Hardware/Software Pentium FDIV Workaround Date: 1 Dec 1994 09:01:16 GMT Organization: Hydro Data, Norsk Hydro (Norway) Lines: 174 Message-ID: <3bk3cs$51@vkhdsu01.hda.hydro.com> References: <3bjlv2$r3v@acoma.mathworks.com> Reply-To: Terje.Mathisen@hda.hydro.com (Terje Mathisen) In <3bjlv2$r3v@acoma.mathworks.com>, moler@mathworks.com (Cleve Moler) writes: >Greetings. > >Last week, I suggested a hardware/software workaround for the Pentium >FDIV bug. And, I promised that we would release a Pentium-aware >version of MATLAB that incorporated the workaround. Since then, >I have heard from several people suggesting improvements. In >particular, Terje Mathisen and Tim Coe have made such good suggestions >that I want to consider a different algorithm. I have enough >confidence in this new algorithm that we are now in the process >of using it for the Pentium-aware MATLAB. > >Mathisen is the PC programming expert from Norsk Hydro in Norway >who confirmed Thomas Nicely's original bug report, and who wrote >P87TEST. Tim Coe is the semiconductor design engineer from >Vitesse Semiconductor whose model of the chip's behavior led him >to examples of the worse-case behavior, including the >4195835/3145727 example. > Here is my currently best (in a numerical sense) version of the FDIV workaround: It will handle all single and double precision numbers exactly, including Nan, Inf, Denormals and Zero. With long double (80-bit) numbers, it can be maximally 1ulp off from the IEEE spec. It will also handle denormalized long doubles, with the same maximum error term. A key part of the algorithm is a 16-byte table that contains non-zero values for the five critical nibble values that must occur directly after the decimal point in the mantissa. This table is initially zero'ed. During startup, the chip is tested. If the FDIV bug is found, entry number 1,4,7,10 and 13 in the table are made non-zero, which will trigger the rescaling of at-risk numbers. If the chip is determined to be bug-free, the table is left empty, and no scaling will occur. This means that the same code can be used on all cpu's, with very little overhead. Here are the results from a test run: I:\C\FDIV>fdivtest You have the FDIV bug, installing patch table: Testing 5678901.000000 / 4789012.000000: FDIV : 39 S = 1.185818912126342 3ff2f91d4068f99b FDIV + 1 iteration: 56 S = 1.185818912126342 3ff2f91d4068f99b FDIV + 2 iterations: 68 S = 1.185818912126342 3ff2f91d4068f99b FDIV + Moler test: 78 S = 1.185818912126342 3ff2f91d4068f99b fdiv + Coe test: 60 S = 1.185818912126342 3ff2f91d4068f99b FDIV + inline Coe: 48 S = 1.185818912126342 3ff2f91d4068f99b LFDIV 49 S = 1.185818912126342 3ff2f91d4068f99b Testing 4195835.000000 / 3145727.000000: FDIV : 39 S = 1.333739068902038 3ff556fec7254ed1 FDIV + 1 iteration: 56 S = 1.333820449136241 3ff557541c7c6b43 FDIV + 2 iterations: 68 S = 1.333820449136241 3ff557541c7c6b43 FDIV + Moler test: 192 S = 1.333820449136241 3ff557541c7c6b43 fdiv + Coe test: 68 S = 1.333820449136241 3ff557541c7c6b43 FDIV + inline Coe: 62 S = 1.333820449136241 3ff557541c7c6b43 LFDIV 60 S = 1.333820449136241 3ff557541c7c6b43 Testing 1.000000 / 824633702449.000000: FDIV : 39 S = 1.212659624879394e-012 3d7555555bfa8e3b FDIV + 1 iteration: 56 S = 1.212659629396902e-012 3d7555555d4fe391 FDIV + 2 iterations: 68 S = 1.212659629396902e-012 3d7555555d4fe391 FDIV + Moler test: 192 S = 1.212659629396902e-012 3d7555555d4fe391 fdiv + Coe test: 68 S = 1.212659629396902e-012 3d7555555d4fe391 FDIV + inline Coe: 62 S = 1.212659629396902e-012 3d7555555d4fe391 LFDIV 59 S = 1.212659629396902e-012 3d7555555d4fe391 The different labels correspond to: FDIV : Compiler-generated x/y FDIV + 1 iteration: One Newton-Raphson (sp?) iteration FDIV + 2 iterations: Two NR iterations FDIV + Moler test: Cleve Moler: test result of x/y by back-multiplication fdiv + Coe test: Coe test, written in C + asm (Does not handle denormals) FDIV + inline Coe: #define version of Coe test LFDIV Long double and denormal-capable inline asm version The LFDIV code is included below. This is what I currently recommend people use instead of FDIV to work around the bug. The first pair of numbers are not in one of the critical regions, so the fixup code is not invoked. As seen from the timing coloumn, a naked FDIV does take 39 cycles, as shown in the Pentium manuals. For normal numbers, not needing rescaling, or when running on a bug-free cpu, the LFDIV code has a 10-cycle total overhead. If the divisor is found to be at risk, I rescale by 15/16, which increase the running time by another 10 cycles. This means that the workaround have a normal overhead of about 25%, increasing to about 50% for worst- case numbers. ======== Pentium FDIV bug workaround (Watcom inline version) ============ /* Lookup table for critical nibble patterns, initally empty: */ static char _fdiv_risc_table[16] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; /* Rescaling factor for at-risk numbers: 15/16 */ static float _fdiv_scale = 0.9375; /* Rescaling factor for long double denormals, 2^51, defined as a long: */ static long two_to_the_power_of_51 = 0x72000000; /* Denormal-capable long double fdiv: */ long double lfdiv(long double x, long double y); #pragma aux lfdiv = \ " sub esp,12"\ "restart:"\ " fld st"\ " fstp tbyte ptr [esp]"\ " mov eax,[esp+4]"\ " add eax,eax"\ " jnc denormal"\ " shr eax,20"\ " cmp al,255"\ " jne ok"\ " mov al,ah"\ " and eax,15"\ " cmp _fdiv_risc_table[eax],ah"\ " jz ok"\ " fld [_fdiv_scale]"\ " fmul st(2),st"\ " fmulp st(1),st"\ " jmp ok"\ "denormal:"\ " or eax,[esp]"\ " jz zero"\ " fld [two_to_the_power_of_51]"\ " fmul st(2),st"\ " fmulp st(1),st"\ " jmp restart"\ "zero:"\ "ok:"\ " fdivp st(1),st"\ " add esp,12"\ parm reverse [8087] [8087]\ modify nomemory exact [eax 8087]\ value [8087]; /* Define a single FDIV instruction as an inline asm macro, to avoid any problems with the C optimizer messing up the check for faulty FDIV's: */ double _fdiv_test(double x, double y); #pragma aux _fdiv_test = \ " fdivp st(1),st"\ parm reverse [8087] [8087]\ modify nomemory exact []\ value [8087]; #define a_bug 4195835 #define b_bug 3145727 /* Startup code to check for FDIV bug, and install nibble table if needed */ void test_fdiv(void) { double z; z = _fdiv_test(a_bug,b_bug); if (a_bug - z*b_bug > 1e-10) { int i; /* Error found, so init the pattern table! */ #ifdef DEBUG printf("You have the FDIV bug, installing patch table:\n"); #endif for (i=1; i < sizeof(_fdiv_risc_table); i += 3) _fdiv_risc_table[i] = i; } } -Terje Mathisen (include std disclaimer) "almost all programming can be viewed as an exercise in caching" -------------------------------------------------------------------------------- 3. Subj: Re: A New Hardware/Software Pentium FDIV Workaround Xref: math.fu-berlin.de sci.math.num-analysis:15618 comp.soft-sys.matlab:9848 comp.sys.intel:23441 Path: math.fu-berlin.de!zib-berlin.de!news.mathworks.com!hookup!swrinde!pipex!sunic!trane.uninett.no!nntp.uio.no!nac.no!eunet.no!nuug!telepost.no!hydro.com!usenet From: Terje.Mathisen@hda.hydro.com (Terje Mathisen) Newsgroups: sci.math.num-analysis,comp.soft-sys.matlab,comp.sys.intel Subject: Re: A New Hardware/Software Pentium FDIV Workaround Date: 2 Dec 1994 08:53:20 GMT Organization: Hydro Data, Norsk Hydro (Norway) Lines: 64 Message-ID: <3bmna0$n85@vkhdsu01.hda.hydro.com> References: <3bjlv2$r3v@acoma.mathworks.com> <3bk3cs$51@vkhdsu01.hda.hydro.com> Reply-To: Terje.Mathisen@hda.hydro.com (Terje Mathisen) In , mcgrant@rascals.stanford.edu (Michael C. Grant) writes: >I'm somewhat concerned about the cost of even this new FDIV workaround. >Can the authors post the relative speeds of a single division in >the following circumstances: > >1) an FDIV call >2) a call to the FDIV replacement on a bug-free chip >3) a call to the FDIV replacement on a buggy chip, with a > non-offending denominator >4) a call to the FDIV replacement on a buggy chip, with an > offending denominator Intel set up a world-wide conference call yesterday, where we discussed the possible workarounds. The hardware gurus, Tim Coe & Peter Tang agreed that we should rescale more numbers, since they hadn't been able to prove conformance for all the bitpatterns cleared by my initial test. The current version just checks the first 6 bits of the mantissa, including the sign bit, via a direct lookup table. Passing numbers (27 out of 32) are sent on immediately, while those at risc are rescaled. Denormal numbers are handled separately. Intel are hard at work putting this code into a format that is directly usable by compiler vendors. They will send out the official version. My latest tests, using an inline Watcom asm macro, indicate the followings timings: 1 Naked FDIV: 40 cycles 2 FDIV, no bugs: 44 cycles 3 FDIV, bug, no fix needed: 58 cycles 4 FDIV, bug, fix needed: 64 cycles So, we're talking about a 10% slowdown for bug-free chips, to be able to use the same code on all cpus. The average running time on a buggy P5 will be below 60 cycles, which corresponds to a 50% slowdown. If I specialcase inner loop code for the bugfree and buggy FDIVs, I can get the overhead down by about 3-4 cycles for both cases. > >I'm aware that Amdahl's law tells me that, since the number of divides >in a typical program is so small, the slowdown due to this new FDIV >will be minimal. I bet you won't be able to measure any differences, unless you use the internal Pentium cycle counters. > >But, for example, if you're using MATLAB more than occaisonally, I >certainly think you qualify for a free replacement! If Intel says no >to such a person, they're probably going to get into hot water. MATLAB will probably have a FDIV-bug aware version available within a few days, they intended to put my fixup code into their inner loops yesterday so they could start verifying it's correctness. I do agree though that Intel will probably have to bite the bullet and replace all the Pentiums for those who demand it. -Terje Mathisen (include std disclaimer) "almost all programming can be viewed as an exercise in caching" --------------------------------------------------------------------------------