Subject: Effects of Pentium Division Flaw and its Software Workaround From: dgh@validgh.com (David G. Hough at validgh) Newsgroups: comp.arch.arithmetic,sci.math.num-analysis,comp.sys.intel,comp.benchmarks Message-ID: <678@validgh.com> Date: 23 Feb 95 00:47:48 GMT Followup-To: poster Organization: validgh, PO Box 20370, San Jose, CA 95160 Lines: 571 The diffs listings at the end of the following report have been deleted to reduce its length. tbl | troff -ms source for the full report, including diffs listings, is available from dgh@validgh.com. The ASCII version that follows has some wide tables that may be easier to read in a window about 100 characters wide. A B S T R A C T The infamous Intel Pentium floating-point division flaw is seldom visible in the results of realistic technical applications, nor does it perceptibly affect performance. But some programs skillfully designed to look for arithmetic flaws can find it. In contrast, the expected harmless differences between 486 and Pentium elementary transcendental functions, due to the improved approximations in the latter, are often evident in ordinary applications. Intel and Cygnus published a recommended compiler workaround that reduces the effect of the Pentium division flaw to at most one unit per division, in the least significant bit of extended precision. Intel has also modified its math library product libm.a to avoid the division flaw. The compiler and libm workarounds do not affect results of CPU chips other than flawed Pentium chips. The two modifications to the compiler and libm avoid any severe effects of the Pentium flaw, but sometimes cause harmlessly different results in real- istic technical applications. The modifications degrade performance of flawed CPU's by a median of 1%, and SPECfp92 ratios by about 9%. Scope of Report This report describes the results of an investigation into the Intel Pentium CPU division flaw and Intel's compiler and library workarounds for it. The Intel Pentium CPU division flaw is caused by an incomplete table in the division hardware of all Pentium CPU's shipped prior to late 1994. On rare occasions it causes incorrect results on floating-point divisions, and even less frequently on remainder, tangent, and arctangent operations. The relative error can very rarely be as large as several parts in 10**5. In the context of technical computing applications, this report addresses the questions: Correctness: How does the Pentium flaw and its workarounds affect program correctness compared to an unflawed Pentium with no workarounds? Performance: How does the Pentium flaw and its workarounds affect program performance compared to an unflawed Pentium with no workarounds? Summary effect of Pentium flaw with unmodified software By using the same executables, produced by unmodified compilers and li- braries, on flawed and unflawed Pentiums, no performance difference was observed, and no difference in output was observed for any application program tested. The flaw was only manifest in programs specifically devised to carefully test floating-point division and elementary tran- scendental functions, especially the UCBTEST programs devised by Prof. W. Kahan and his students. Conclusion: The Pentium flaw will very rarely affect the results of typical scientific applications. These programs perform many floating-point operations but perhaps relatively few of those operations are likely to be vulnerable to producing incorrect deci- sions as a result of the Pentium flaw. The flaw may arise much more fre- quently than the frequency of misleading output, to the extent that sub- sequent calculations obscure flaw effects. In contrast, typical commer- cial spreadsheet users perform fewer floating- point operations, but an incidence of the flaw may have a higher probability of affecting the visible output, particularly if the data contains many numbers slightly different from small integers. Fortunately almost all these spreadsheet users may employ modified software to work around the flaw or even dis- able floating-point hardware entirely with little perceptible performance loss. effect of Pentium transcendental functions improvements In contrast to the foregoing, the same executables, produced by unmodi- fied compilers and libraries, on 486 and unflawed Pentiums, revealed differences in five realistic applications and five transcendental func- tion test programs. Conclusion: Persons attempting to verify flawed Pen- tium scientific calculations on 486 systems are far more likely to dis- cover the Pentium transcendental improvements than the division flaw. effect of software modifications on unflawed Pentium Unflawed Pentium systems produce identical results when compiled either with unmodified compilers and math library, or with compilers and math library modified to avoid the division flaw. There is no average perfor- mance difference. Conclusion: On unflawed Pentiums, the software wor- karounds do not affect the correctness or performance of a variety of realistic technical applications. effect of software modifications on flawed Pentium Comparing results of a flawed Pentium with software modifications to to an unflawed Pentium without software modifications revealed some signifi- cant differences. Five realistic applications printed slightly different numerical results. Whereas a flawed Pentium with no software modifica- tions failed ucbdivtest by six units in single precision after 750000 test cases, the flawed Pentium with software modifications failed ucbdivtest by one unit in extended precision after 61 test cases. As advertised, the software workarounds reduce the worst-case relative error from a few parts in 10**5 to to one part in 10**19. The median perfor- mance degradation was 1%, and the first and third quartile points were at 86% and 100% compared to 98% and 100% for the previous case. This per- formance degradation was due to the extra overhead invoked by the software modifications to check division arguments and work around poten- tially hazardous situations. Conclusion: On flawed Pentiums, the software workarounds do not affect the correctness of a variety of real- istic technical applications, and avoid any severe correctness problems caused by the division flaw. The performance penalty due to the software modifications on flawed Pentiums is occasionally noticeable but usually tolerable. Test Configurations Test configurations consisted of a host PC EISA system with an Intel CPU, a compiler and libraries, and specific compilation options. All test confi- gurations used Solaris 2.4 for x86 with Driver Update 3 as the operating sys- tem, with 200MB of tmpfs /tmp+swap on Seagate ST3600N, ST3620N, or ST11200N FAST SCSI disks. Host details are as follows: HOSTS Name Vendor CPU RAM fix1 Intel 90 MHz Pentium unflawed 64MB fix2 Intel 90 MHz Pentium unflawed 64MB flaw1 Intel 90 MHz Pentium flawed 64MB flaw2 Intel 90 MHz Pentium flawed 64MB gateway Gateway 66 MHz 486DX2 40MB All were NFS clients of a SPARCstation 10/41 file server containing the test programs, input data, executables, and results. Execution took place locally in /tmp. All the Pentium systems were identical except for the CPU. They each had writeback cache enabled, an Adaptec 274x SCSI controller, and an SMC 8016T ethernet controller installed. Enabling writeback cache is critical to Penti- um performance - writethrough cache produced 2X slower SPEC ratios. COMPILERS Name Compiler Options Math libraries flaw8 GCC 2.6.3 with Cygnus patch aligned -mfpflaw workaround,fdlibm,sun flaw8m GCC 2.6.3 with Cygnus patch aligned -mfpflaw noworkaround,fdlibm,sun noflaw8p GCC 2.6.3 with Cygnus patch aligned -mno-fpflaw workaround,fdlibm,sun noflaw8 GCC 2.6.3 with Cygnus patch aligned -mno-fpflaw noworkaround,fdlibm,sun noflaw GCC 2.6.3 with Cygnus patch -mno-fpflaw noworkaround,fdlibm,sun 263 GCC 2.6.3 unpatched fdlibm,sun The Cygnus patch for GCC 2.6.3 converts all floating-point division instructions to register-register form in an early stage of the compiler and later converts those instructions into calls to a subroutine that works around the division flaw. In some cases the extra instructions may cause a slight performance degradation. (Intel's own compiler, not test- ed in this study, avoids the first conversion and calls separate wor- karound subroutines for each form of division instruction; higher levels of optimization can inline those functions as well.) In addition, conver- sion of an instruction to an external subroutine call usually inhibits a number of optimization techniques. "aligned" in the table refers to the additional modifications described below for more predictable performance. Math libraries were linked in the order indicated. In all cases, Fortran programs were compiled after translation with F2C from AT&T Bell Labs. LIBM.A'S workaround Intel library with workarounds noworkaround Intel library with no workarounds fdlibm Freely-distributable library sun Sun ProCompiler C 2.0.1 library When released, this version of the Intel library will be a commercial product optimized for Pentium, available with or without workarounds for the Pentium division flaw; it supports single, double, and extended pre- cision. (Intel's own compiler, not tested in this study, links with the workaround library by default, since Pentium compilation is the default; but if compilation is specified for a 486 or earlier, then the library without the workarounds is used.) fdlibm is a freely-distributable library, supporting double preci- sion only, made available at netlib by SunSoft Developer Products. The Sun ProCompiler library is a commercial product, and was included at the end of the link sequence so that certain IEEE 754 functions needed for some of the test programs would be available in all the configurations; it supports single, double, and extended precision, although the double precision support was never used because the intel library sometimes and fdlibm always preceded it in the linking sequence. Two levels of optimization were tested: "g" (-g) was used to deter- mine correct output initially, but most of the tests were run with "max" GCC optimizations: -O3 -m486 -finline-functions -funroll-loops -fomit-frame-pointer -fwritable-strings -static -ffloat-store Some of the foregoing may have no effect on Intel systems. After considerable experimentation I determined that -ffloat-store was required to execute many of the more sensitive test programs correctly, although it inhibits some optimizations, and that -ffast-math caused too many problems to be used routinely either with sensitive test programs or realistic applications, so I discarded all results compiled with -ffast- math or without -ffloat-store and started over. No doubt by compiling with -ffast-math and without -ffloat-store, better performance could be obtained on many programs. The Reference Configuration used in correctness and performance comparisons was the 90MHz unflawed Pentium system fix2, with the noflaw8 compiler and libraries, using maximum optimizations. Software modifications It was necessary to modify the Cygnus and Intel software slightly for the purposes of this study. Alignment: The Pentium processor requires that double-precision data be aligned on 8-byte boundaries for optimum performance; the penalty for misalignment can be almost 2X. However the Intel ABI does not specify 8-byte alignment for double-precision data, and GCC does not so align it. I modified GCC to do such alignment everywhere but on the stack, where 8-byte alignment probably would require much more extensive compiler modifications. Instead I stabilized (rather than optimized) stack alignment by insuring that it was consistently aligned prior to the invo- cation of main(), so that minor changes in storage layout such as those induced by different libraries would have less of an effect on perfor- mance measurements - important since the performance changes due to the Pentium flaw workaround software are mostly very small. 486 is also susceptible to performance variations due to misalign- ment, but the worst case is more like 1.2X. Initial inexact exception: IEEE arithmetic requires that all excep- tion flags including inexact be cleared at the start of a program. The tests for the presence of the Pentium division flaw in GCC's crt1.s and Intel's libm.a both executed an inexact division causing the flag to be set, which was noticed and reported as an error by a few of the sensitive test programs. So I modified the division flaw test functions to re- store the exception flags after performing the test. Test Programs For each test configuration, tests were conducted by compiling 126 separate source programs, some of which were compiled separately for single, double, and extended precision, making a total of 168 executable programs. Some of these had many separate input files, making 615 separate executions and output files to be compared for evaluating correctness. Some of these outputs produced multiple timing data, so there were 851 timing data in all. Of these, most were not included in the performance analysis because they ran less than ten seconds and so were not timed accurately enough under Unix. About 310 timings were usable in most of the performance comparisons. Some of the most extreme timing outliers were repeated to eliminate irreproducible glitches. Although each individual timing datum is subject to considerable variation from run to run, the overall conclusions are not likely to change much. Results from several kinds of test programs are reported below: sensitive Sensitive test programs are intended to investigate correctness of difficult cases and boundary conditions. They depend intimately on correct rounding and exception handling per IEEE 754, or on close- to-correct rounding of elementary transcendental functions in libm. They mostly execute quickly and so had little effect on performance comparisons, but contribute many of the differences in the correct- ness comparisons. Examples: elefunt Cody's elementary function tests cvector Coonen's IEEE test vectors ucbmul Kahan's multiplication test ucbdiv Kahan's division test ucbsqr Kahan's sqrt test paranoia Kahan's general arithmetic tests liu Liu's elementary function tests (older version) ucbeef Liu's elementary function tests (current version) kcvector Ng's elementary function tests fkcvector Ng's elementary function tests ucblibtest Ng's elementary function tests ucbflibtest Ng's elementary function tests SP single-precision version DP double-precision version QP extended-precision version kernels Performance measurement programs based on collections of short loops are useful for detecting small-scale phenomena like cache variations, due to slight differences in compiled code. They often exaggerate the impact on realistic applications. Being artificial, they often lack measures of correctness other than ad-hoc checksums applied to their computed out- puts. Examples: sl#N Livermore loop N, single precision dl#N Livermore loop N, double precision sd#N Digital Review loop N, single precision dd#N Digital Review loop N, double precision sn#lll NAS Kernels loop lll, single precision dn#lll NAS Kernels loop lll, double precision benchmark suites SPECfp92, SPECint92, PERFECT, and the Los Alamos benchmark suites are collections of performance test programs intended to represent realistic technical applications. They usually include some kind of internal test of correctness of output to guard against flaws in hardware, optimizing compilers, or libraries. Often the benchmark versions of source codes and input data have been sanitized somewhat, compared to the real thing, in order to achieve better portability among test platforms. SPEC 0nn.* PERFECT adm,arc2d,bdna,dyfesm,flo52,mdg,mg3b,ocean,qcd2,spec77,track,trfd Los Alamos gamteb,hydro,intmc,photon,vgam slalom scalable benchmark code realistic applications Over the years SunSoft Developer Products has received a number of large realistic applications that present interesting correctness or perfor- mance problems, from Sun's field engineers, technical support, customers, and even USENET postings. However most proved unexceptional in this study. Examples: 3e2 SPICE 3E2 semiconductor device simulation dnacompare dna sequence comparison geodetic8 geodetic distance in spacetime g4 exact rational system analyzer herwig57 hadron emission reactions jetset74 jet fragmentation physics launch junction calculations by modal matching ray rayshade 4.06 graphics rendering program reweight particles in detector Performance ratios The following table reports ratios related to SPECfp92 and SPECint92. These ratios are comparable within this report, but not with those measured or reported elsewhere, because the SPEC source programs and test infrastructure were modified for this investigation. In particular, the best SPEC ratios reported elsewhere for Pentium systems have been obtained with compilers that optimize more aggressively and specifically for Pentium than GCC 2.6.3 and F2C. As with other SPEC ratios, the reference performance times come from a VAX 11/780 rather than the noflaw8.max.fix2 reference times used elsewhere in this report. ~SPEC~92~ratios Comp Opt Host fp int flaw8 max fix1 48 62 noflaw8 max fix2 48 62 noflaw8p max flaw2 48 62 noflaw max fix2 47 62 noflaw8 max flaw2 47 62 flaw8 max flaw1 44 62 flaw8m max flaw1 44 62 263 max fix1 43 62 noflaw8 max gateway 15 27 Relative Performance Graphs Relative performance is defined as the ratio of execution times, ex- pressed as a percentage. Thus if a particular test required 120 seconds in the flaw8.max.flaw1 configuration and 80 seconds in the reference configura- tion, the relative performance percentage would be 80/120 = 67% because only 2/3 of the reference performance was obtained. After all the relative performance percentages were computed for a par- ticular configuration, they were sorted and the 0%, 25%, 50%, 75%, and 100% quartile levels were determined and plotted in the graphs below. A typical line % % % % % % 1 1 1 1 2 3 4 2 3 4 5 7 9 0 1 3 7 3 0 0 5 3 4 8 6 0 0 2 2 5 0 0 0 # comp opt host 6==31===-----------74 313 nofl8 max gate indicates the following information for the noflaw8.max.gateway configuration, based upon 313 timing data: the worst relative performance percentage was 6%, the median was 31%, and the best was 57%. The median performance reflects the relative difference between the 486 66DX2 Gateway system and the reference system. The ==31=== double bars indicate the region between the 25% and 75% quartiles, which contains half the data; those relative performance points may be read from the scale as approximately 27% and 37%. The scale is logarithmic so that relative performances of 400% and 25% are equally distant from 100%. The performance spread is so slight in most of the comparison configurations that the 25%, 50%, and 75% quartiles are superimposed. Relative run performance % % % % % % 1 1 1 1 2 3 4 2 3 4 5 7 9 0 1 3 7 3 0 0 5 3 4 8 6 0 0 2 2 5 0 0 0 # comp opt host 79--100108 312 nofl8 max fl2 77---100116 312 nofl8p max fl2 68-----100112 314 fl8 max fx1 59---===97=-126 313 263 max fx1 59------=99-122 313 nofl max fx2 58-----==99108 312 fl8 max fl1 58------=99110 313 fl8m max fl1 6==31===-----------74 313 nofl8 max gate The following provides the same information in numerical form: 0% 25% 50% 75% 100% tests comp.opt.host 100 100 100 100 100 314 noflaw8.max.fix2 79 100 100 100 108 312 noflaw8.max.flaw2 77 99 100 100 116 312 noflaw8p.max.flaw2 68 98 100 100 112 314 flaw8.max.fix1 59 78 97 100 126 313 263.max.fix1 59 91 99 100 122 313 noflaw.max.fix2 58 86 99 100 108 312 flaw8.max.flaw1 58 89 99 100 110 313 flaw8m.max.flaw1 6 27 31 36 74 313 noflaw8.max.gateway Relative Performance Extremes The following tables list the tests displaying the eight most extreme relative performance percentages for each comparison configuration. The 100% reference performance configuration is noflaw8.max.fix2. Relative performance extremes % test % test % test % test comp opt host 59 dl#1 59 dl#1x1001x8 60 dl#19 60 dl#12 263 max fx1 106 ducbmul 117 liu SP 118 094.fpppp 126 ducbdiv 263 max fx1 68 qucbsqr 70 fpppp 70 ducbsqr 86 ducbeef fl8 max fx1 105 sl#11 107 g4.m35 108 g4.m33 112 015.doduc fl8 max fx1 59 dl#1 60 fpppp 60 dl#19 60 dl#1x1001x8 nofl max fx2 109 015.doduc 111 liu SP 117 dn#emit 122 094.fpppp nofl max fx2 58 dd#AIRREL 58 sd#AIRREL 63 dd#EGYPT 63 sd#EGYPT fl8 max fl1 107 g4.m34 107 g4.m36 108 g4.m22 108 g4.m35 fl8 max fl1 58 sd#AIRREL 58 dd#AIRREL 63 sd#EGYPT 63 dd#EGYPT fl8m max fl1 104 sl#11 104 dl#16 106 dn#cfft2d 110 g4.m33 fl8m max fl1 79 dd#ALAM18 92 052.alvinn 94 dr3 DP 96 dn#cfft2d nofl8 max fl2 104 spiff 104 094.fpppp 106 g4.m36 108 g4.m22 nofl8 max fl2 77 cslalom 77 dd#ALAM18 83 fpppp 86 zhuge nofl8p max fl2 104 sl#11 104 dl#16 108 g4.m35 116 015.doduc nofl8p max fl2 6 intmc1000 15 fpppp 16 dl#10 18 dl#10 nofl8 max gate 53 023.eqntott 54 dd#PRIME 56 dn#gmtry 74 dn#vpenta nofl8 max gate Faster than 100%? For several Pentium configurations, there were a few tests that ran faster than the reference configuration; some of these tests even overcame the burden of division flaw workaround code. Since relative performance differences of less than 5% are usually not significant, the interesting tests attain relative performance in excess of 105%. There appear to be two causes for such high performance results: alignment and cache effects. Recall that the "aligned" modifications to GCC did not optimize the alignment of variables allocated on the stack - the modifications merely served to stabilize the alignment among the various *flaw8 configurations. So it is not surprising that by chance some programs happened to achieve better alignment with the "263" and "noflaw" compilations while others achieved worse alignment, compared to the stabilized alignment of the *flaw8 compilations. Internal and external cache effects may also have a role. Occasional bad luck will cause cache thrashing with code that might otherwise run faster. This familiar RISC phenomenon is likely to affect Pentium too. Differences Files The following files list the output differences between various configurations and the reference configuration, noflaw8.max.fix2. They are not expected to be self-explanatory in detail, but support the previous general statements. Comments on specific difference lists flaw8.max.fix1: There are no differences between fixed Pentium outputs, with or without workarounds. noflaw.max.fix2: Two programs show differences due to differences in alignment. liu SP, a transcendental function test program, demonstrates a bug in the Intel log1pf function that causes it to access uninitialized storage; differences in alignment cause different storage to be accessed. spice3e2.ltra 3 probably demonstrates a bug in the spice3e2 circuit simulator related to uninitialized storage too. ucblibtest.lgamma demonstrates a bug in the Intel libm that sometimes causes lgamma(nan) to dump core. noflaw8.max.flaw2: These differences arise due to the Pentium flaw. liu DP demonstrates an atan flaw. sucbdiv and ucbpla are sensitive division test programs designed to test rounding in nearly half-way cases, and to test SRT division, respectively. noflaw8p.max.flaw2: These differences arise due to the Pentium flaw in the test programs and to differences in libm workarounds. Libm workarounds produce numerical differences in the slalom and savage benchmarks and elefunt test. The workaround libm introduces gratuitous underflow exceptions in atan2, which are reported by ucblibtest. flaw8m.max.flaw1: These differences arise due to Pentium flaws in libm, and to the division workaround in the test programs. The doduc and hydro2d SPEC benchmarks show minor numerical differences, along with the EGYPT kernel from the Digital Review suite, launch-junction, and rollbar. The division workaround causes ucbdivtest to fail in extended precision by one unit in the least significant bit rather than in single precision by ten units. ucbpitest QP, a trigonometric function test program, also shows very small differences due to the division workaround. flaw8.max.flaw1: These differences remain after all workarounds are in applied. They are just those that arose under noflaw8p and flaw8m above. noflaw8.max.gateway: More extensive than any of the foregoing are the differences between the transcendental function implementations in 486 and Pentium. These affect all the application programs that noticed differences between flawed and unflawed Pentium. The ucbeeftest QP results indicate the benefit of the improved transcendentals in extended precision: Worst atan er- ror reduced from 1.3 units to 0.7 units; worst log error reduced from 1.1 to 1.0 units. The ucbflibtest.log2 results indicate an unexpected difference: the fyl2x instruction is exact for exact arguments (like log2(2)) on 486, but sets the inexact flag gratuitously on Pentium since the correct exact result is computed. Acknowledgements Intel lent the four Pentium test systems fix1, fix2, flaw1, and flaw2, and provided copies of its libm.a with and without Pentium flaw workarounds. Cygnus Software provided the patched version of GCC 2.6.3 supporting -mno-fpflaw and -mfpflaw at its public ftp site. SunSoft Developer Products provided some of its compiler/hardware valida- tion software, the Solaris 2.4 operating system with driver update 3, and the ProCompiler 2.0.1 C and libm.a, and made fdlibm available at netlib@research.att.com. AT&T Bell Labs made the F2C Fortran-to-C translator available at netlib@research.att.com. Advertisement Studies like this one can be performed to evaluate comparative correct- ness and performance of computer system hardware, operating systems, compilers and libraries. -- David Hough dgh@validgh.com Consultant on system correctness, performance evaluation, and IEEE 754 binary floating-point arithmetic --- Send for business announcement