|
Preface |
14 |
|
|
List of Figures |
16 |
|
|
List of Tables |
19 |
|
|
I Introduction, Basic Definitions, and Standards |
22 |
|
|
1 Introduction |
23 |
|
|
1.1 Some History |
23 |
|
|
1.2 Desirable Properties |
26 |
|
|
1.3 Some Strange Behaviors |
27 |
|
|
1.3.1 Some famous bugs |
27 |
|
|
1.3.2 Difficult problems |
28 |
|
|
2 Definitions and Basic Notions |
33 |
|
|
2.1 Floating-Point Numbers |
33 |
|
|
2.2 Rounding |
40 |
|
|
2.2.1 Rounding modes |
40 |
|
|
2.2.2 Useful properties |
42 |
|
|
2.2.3 Relative error due to rounding |
43 |
|
|
2.3 Exceptions |
45 |
|
|
2.4 Lost or Preserved Properties of the Arithmetic on the Real Numbers |
47 |
|
|
2.5 Note on the Choice of the Radix |
49 |
|
|
2.5.1 Representation errors |
49 |
|
|
2.5.2 A case for radix 10 |
50 |
|
|
2.6 Tools for Manipulating Floating-Point Errors |
52 |
|
|
2.6.1 The ulp function |
52 |
|
|
2.6.2 Errors in ulps and relative errors |
57 |
|
|
2.6.3 An example: iterated products |
57 |
|
|
2.6.4 Unit roundoff |
59 |
|
|
2.7 Note on Radix Conversion |
60 |
|
|
2.7.1 Conditions on the formats |
60 |
|
|
2.7.2 Conversion algorithms |
63 |
|
|
2.8 The Fused Multiply-Add (FMA) Instruction |
71 |
|
|
2.9 Interval Arithmetic |
71 |
|
|
2.9.1 Intervals with floating-point bounds |
72 |
|
|
2.9.2 Optimized rounding |
72 |
|
|
3 Floating-Point Formats and Environment |
74 |
|
|
3.1 The IEEE 754-1985 Standard |
75 |
|
|
3.1.1 Formats specified by IEEE 754-1985 |
75 |
|
|
3.1.2 Little-endian, big-endian |
79 |
|
|
3.1.3 Rounding modes specified by IEEE 754-1985 |
80 |
|
|
3.1.4 Operations specified by IEEE 754-1985 |
81 |
|
|
3.1.5 Exceptions specified by IEEE 754-1985 |
85 |
|
|
3.1.6 Special values |
88 |
|
|
3.2 The IEEE 854-1987 Standard |
89 |
|
|
3.2.1 Constraints internal to a format |
89 |
|
|
3.2.2 Various formats and the constraints between them |
90 |
|
|
3.2.3 Conversions between floating-point numbers and decimal strings |
91 |
|
|
3.2.4 Rounding |
92 |
|
|
3.2.5 Operations |
92 |
|
|
3.2.6 Comparisons |
93 |
|
|
3.2.7 Exceptions |
93 |
|
|
3.3 The Need for a Revision |
93 |
|
|
3.3.1 A typical problem: ``double rounding'' |
94 |
|
|
3.3.2 Various ambiguities |
96 |
|
|
3.4 The New IEEE 754-2008 Standard |
98 |
|
|
3.4.1 Formats specified by the revised standard |
99 |
|
|
3.4.2 Binary interchange format encodings |
100 |
|
|
3.4.3 Decimal interchange format encodings |
101 |
|
|
3.4.4 Larger formats |
111 |
|
|
3.4.5 Extended and extendable precisions |
111 |
|
|
3.4.6 Attributes |
112 |
|
|
3.4.7 Operations specified by the standard |
116 |
|
|
3.4.8 Comparisons |
118 |
|
|
3.4.9 Conversions |
118 |
|
|
3.4.10 Default exception handling |
119 |
|
|
3.4.11 Recommended transcendental functions |
122 |
|
|
3.5 Floating-Point Hardware in Current Processors |
123 |
|
|
3.5.1 The common hardware denominator |
123 |
|
|
3.5.2 Fused multiply-add |
123 |
|
|
3.5.3 Extended precision |
123 |
|
|
3.5.4 Rounding and precision control |
124 |
|
|
3.5.5 SIMD instructions |
125 |
|
|
3.5.6 Floating-point on x86 processors: SSE2 versus x87 |
125 |
|
|
3.5.7 Decimal arithmetic |
126 |
|
|
3.6 Floating-Point Hardware in Recent GraphicsProcessing Units |
127 |
|
|
3.7 Relations with Programming Languages |
128 |
|
|
3.7.1 The Language Independent Arithmetic (LIA) standard |
128 |
|
|
3.7.2 Programming languages |
129 |
|
|
3.8 Checking the Environment |
129 |
|
|
3.8.1 MACHAR |
130 |
|
|
3.8.2 Paranoia |
130 |
|
|
3.8.3 UCBTest |
134 |
|
|
3.8.4 TestFloat |
135 |
|
|
3.8.5 IeeeCC754 |
135 |
|
|
3.8.6 Miscellaneous |
135 |
|
|
II Cleverly Using Floating-Point Arithmetic |
136 |
|
|
4 Basic Properties and Algorithms |
137 |
|
|
4.1 Testing the Computational Environment |
137 |
|
|
4.1.1 Computing the radix |
137 |
|
|
4.1.2 Computing the precision |
139 |
|
|
4.2 Exact Operations |
140 |
|
|
4.2.1 Exact addition |
140 |
|
|
4.2.2 Exact multiplications and divisions |
142 |
|
|
4.3 Accurate Computations of Sums of Two Numbers |
143 |
|
|
4.3.1 The Fast2Sum algorithm |
144 |
|
|
4.3.2 The 2Sum algorithm |
147 |
|
|
4.3.3 If we do not use rounding to nearest |
149 |
|
|
4.4 Computation of Products |
150 |
|
|
4.4.1 Veltkamp splitting |
150 |
|
|
4.4.2 Dekker's multiplication algorithm |
153 |
|
|
4.5 Complex numbers |
157 |
|
|
4.5.1 Various error bounds |
158 |
|
|
4.5.2 Error bound for complex multiplication |
159 |
|
|
4.5.3 Complex division |
162 |
|
|
4.5.4 Complex square root |
167 |
|
|
5 The Fused Multiply-Add Instruction |
169 |
|
|
5.1 The 2MultFMA Algorithm |
170 |
|
|
5.2 Computation of Residuals of Division and Square Root |
171 |
|
|
5.3 Newton--Raphson-Based Division with an FMA |
173 |
|
|
5.3.1 Variants of the Newton--Raphson iteration |
173 |
|
|
5.3.2 Using the Newton--Raphson iteration for correctly rounded division |
178 |
|
|
5.4 Newton--Raphson-Based Square Root with an FMA |
185 |
|
|
5.4.1 The basic iterations |
185 |
|
|
5.4.2 Using the Newton--Raphson iteration for correctly rounded square roots |
186 |
|
|
5.5 Multiplication by an Arbitrary-Precision Constant |
189 |
|
|
5.5.1 Checking for a given constant C if Algorithm 5.2 will always work |
190 |
|
|
5.6 Evaluation of the Error of an FMA |
193 |
|
|
5.7 Evaluation of Integer Powers |
195 |
|
|
6 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values |
198 |
|
|
6.1 Preliminaries |
199 |
|
|
6.1.1 Floating-point arithmetic models |
200 |
|
|
6.1.2 Notation for error analysis and classical error estimates |
201 |
|
|
6.1.3 Properties for deriving validated running error bounds |
204 |
|
|
6.2 Computing Validated Running Error Bounds |
205 |
|
|
6.3 Computing Sums More Accurately |
207 |
|
|
6.3.1 Reordering the operands, and a bit more |
207 |
|
|
6.3.2 Compensated sums |
209 |
|
|
6.3.3 Implementing a ``long accumulator'' |
216 |
|
|
6.3.4 On the sum of three floating-point numbers |
216 |
|
|
6.4 Compensated Dot Products |
218 |
|
|
6.5 Compensated Polynomial Evaluation |
220 |
|
|
7 Languages and Compilers |
222 |
|
|
7.1 A Play with Many Actors |
222 |
|
|
7.1.1 Floating-point evaluation in programming languages |
223 |
|
|
7.1.2 Processors, compilers, and operating systems |
225 |
|
|
7.1.3 In the hands of the programmer |
226 |
|
|
7.2 Floating Point in the C Language |
226 |
|
|
7.2.1 Standard C99 headers and IEEE 754-1985 support |
226 |
|
|
7.2.2 Types |
227 |
|
|
7.2.3 Expression evaluation |
230 |
|
|
7.2.4 Code transformations |
233 |
|
|
7.2.5 Enabling unsafe optimizations |
234 |
|
|
7.2.6 Summary: a few horror stories |
235 |
|
|
7.3 Floating-Point Arithmetic in the C++ Language |
237 |
|
|
7.3.1 Semantics |
237 |
|
|
7.3.2 Numeric limits |
238 |
|
|
7.3.3 Overloaded functions |
239 |
|
|
7.4 FORTRAN Floating Point in a Nutshell |
240 |
|
|
7.4.1 Philosophy |
240 |
|
|
7.4.2 IEEE 754 support in FORTRAN |
243 |
|
|
7.5 Java Floating Point in a Nutshell |
244 |
|
|
7.5.1 Philosophy |
244 |
|
|
7.5.2 Types and classes |
245 |
|
|
7.5.3 Infinities, NaNs, and signed zeros |
247 |
|
|
7.5.4 Missing features |
248 |
|
|
7.5.5 Reproducibility |
249 |
|
|
7.5.6 The BigDecimal package |
250 |
|
|
7.6 Conclusion |
251 |
|
|
III Implementing Floating-Point Operators |
253 |
|
|
8 Algorithms for the Five Basic Operations |
254 |
|
|
8.1 Overview of Basic Operation Implementation |
254 |
|
|
8.2 Implementing IEEE 754-2008 Rounding |
256 |
|
|
8.2.1 Rounding a nonzero finite value with unbounded exponent range |
256 |
|
|
8.2.2 Overflow |
258 |
|
|
8.2.3 Underflow and subnormal results |
259 |
|
|
8.2.4 The inexact exception |
260 |
|
|
8.2.5 Rounding for actual operations |
260 |
|
|
8.3 Floating-Point Addition and Subtraction |
261 |
|
|
8.3.1 Decimal addition |
264 |
|
|
8.3.2 Decimal addition using binary encoding |
265 |
|
|
8.3.3 Subnormal inputs and outputs in binary addition |
266 |
|
|
8.4 Floating-Point Multiplication |
266 |
|
|
8.4.1 Normal case |
267 |
|
|
8.4.2 Handling subnormal numbers in binary multiplication |
267 |
|
|
8.4.3 Decimal specifics |
268 |
|
|
8.5 Floating-Point Fused Multiply-Add |
269 |
|
|
8.5.1 Case analysis for normal inputs |
269 |
|
|
8.5.2 Handling subnormal inputs |
273 |
|
|
8.5.3 Handling decimal cohorts |
274 |
|
|
8.5.4 Overview of a binary FMA implementation |
274 |
|
|
8.6 Floating-Point Division |
277 |
|
|
8.6.1 Overview and special cases |
277 |
|
|
8.6.2 Computing the significand quotient |
278 |
|
|
8.6.3 Managing subnormal numbers |
279 |
|
|
8.6.4 The inexact exception |
280 |
|
|
8.6.5 Decimal specifics |
280 |
|
|
8.7 Floating-Point Square Root |
280 |
|
|
8.7.1 Overview and special cases |
280 |
|
|
8.7.2 Computing the significand square root |
281 |
|
|
8.7.3 Managing subnormal numbers |
282 |
|
|
8.7.4 The inexact exception |
282 |
|
|
8.7.5 Decimal specifics |
282 |
|
|
9 Hardware Implementation of Floating-Point Arithmetic |
283 |
|
|
9.1 Introduction and Context |
283 |
|
|
9.1.1 Processor internal formats |
283 |
|
|
9.1.2 Hardware handling of subnormal numbers |
284 |
|
|
9.1.3 Full-custom VLSI versus reconfigurable circuits |
285 |
|
|
9.1.4 Hardware decimal arithmetic |
286 |
|
|
9.1.5 Pipelining |
287 |
|
|
9.2 The Primitives and Their Cost |
288 |
|
|
9.2.1 Integer adders |
288 |
|
|
9.2.2 Digit-by-integer multiplication in hardware |
294 |
|
|
9.2.3 Using nonstandard representations of numbers |
294 |
|
|
9.2.4 Binary integer multiplication |
295 |
|
|
9.2.5 Decimal integer multiplication |
297 |
|
|
9.2.6 Shifters |
298 |
|
|
9.2.7 Leading-zero counters |
298 |
|
|
9.2.8 Tables and table-based methods for fixed-point function approximation |
300 |
|
|
9.3 Binary Floating-Point Addition |
302 |
|
|
9.3.1 Overview |
302 |
|
|
9.3.2 A first dual-path architecture |
303 |
|
|
9.3.3 Leading-zero anticipation |
305 |
|
|
9.3.4 Probing further on floating-point adders |
309 |
|
|
9.4 Binary Floating-Point Multiplication |
310 |
|
|
9.4.1 Basic architecture |
310 |
|
|
9.4.2 FPGA implementation |
310 |
|
|
9.4.3 VLSI implementation optimized for delay |
312 |
|
|
9.4.4 Managing subnormals |
315 |
|
|
9.5 Binary Fused Multiply-Add |
316 |
|
|
9.5.1 Classic architecture |
317 |
|
|
9.5.2 To probe further |
319 |
|
|
9.6 Division |
319 |
|
|
9.6.1 Digit-recurrence division |
320 |
|
|
9.6.2 Decimal division |
323 |
|
|
9.7 Conclusion: Beyond the FPU |
323 |
|
|
9.7.1 Optimization in context of standard operators |
324 |
|
|
9.7.2 Operation with a constant operand |
325 |
|
|
9.7.3 Block floating point |
327 |
|
|
9.7.4 Specific architectures for accumulation |
327 |
|
|
9.7.5 Coarser-grain operators |
331 |
|
|
9.8 Probing Further |
334 |
|
|
10 Software Implementation of Floating-Point Arithmetic |
335 |
|
|
10.1 Implementation Context |
336 |
|
|
10.1.1 Standard encoding of binary floating-point data |
336 |
|
|
10.1.2 Available integer operators |
337 |
|
|
10.1.3 First examples |
340 |
|
|
10.1.4 Design choices and optimizations |
342 |
|
|
10.2 Binary Floating-Point Addition |
343 |
|
|
10.2.1 Handling special values |
344 |
|
|
10.2.2 Computing the sign of the result |
346 |
|
|
10.2.3 Swapping the operands and computing the alignment shift |
347 |
|
|
10.2.4 Getting the correctly rounded result |
349 |
|
|
10.3 Binary Floating-Point Multiplication |
355 |
|
|
10.3.1 Handling special values |
355 |
|
|
10.3.2 Sign and exponent computation |
357 |
|
|
10.3.3 Overflow detection |
359 |
|
|
10.3.4 Getting the correctly rounded result |
360 |
|
|
10.4 Binary Floating-Point Division |
363 |
|
|
10.4.1 Handling special values |
364 |
|
|
10.4.2 Sign and exponent computation |
365 |
|
|
10.4.3 Overflow detection |
368 |
|
|
10.4.4 Getting the correctly rounded result |
369 |
|
|
10.5 Binary Floating-Point Square Root |
375 |
|
|
10.5.1 Handling special values |
376 |
|
|
10.5.2 Exponent computation |
378 |
|
|
10.5.3 Getting the correctly rounded result |
379 |
|
|
IV Elementary Functions |
387 |
|
|
11 Evaluating Floating-Point Elementary Functions |
388 |
|
|
11.1 Basic Range Reduction Algorithms |
392 |
|
|
11.1.1 Cody and Waite's reduction algorithm |
392 |
|
|
11.1.2 Payne and Hanek's algorithm |
394 |
|
|
11.2 Bounding the Relative Error of Range Reduction |
395 |
|
|
11.3 More Sophisticated Range Reduction Algorithms |
397 |
|
|
11.3.1 An example of range reduction for the exponential function |
399 |
|
|
11.3.2 An example of range reduction for the logarithm |
400 |
|
|
11.4 Polynomial or Rational Approximations |
401 |
|
|
11.4.1 L2 case |
402 |
|
|
11.4.2 L, or minimax case |
403 |
|
|
11.4.3 ``Truncated'' approximations |
405 |
|
|
11.5 Evaluating Polynomials |
406 |
|
|
11.6 Correct Rounding of Elementary Functions tobinary64 |
407 |
|
|
11.6.1 The Table Maker's Dilemma and Ziv's onion peelingstrategy |
407 |
|
|
11.6.2 When the TMD is solved |
408 |
|
|
11.6.3 Rounding test |
409 |
|
|
11.6.4 Accurate second step |
413 |
|
|
11.6.5 Error analysis and the accuracy/performance tradeoff |
414 |
|
|
11.7 Computing Error Bounds |
415 |
|
|
11.7.1 The point with efficient code |
415 |
|
|
11.7.2 Example: a ``double-double'' polynomial evaluation |
416 |
|
|
12 Solving the Table Maker's Dilemma |
418 |
|
|
12.1 Introduction |
418 |
|
|
12.1.1 The Table Maker's Dilemma |
419 |
|
|
12.1.2 Brief history of the TMD |
423 |
|
|
12.1.3 Organization of the chapter |
424 |
|
|
12.2 Preliminary Remarks on the Table Maker's Dilemma |
425 |
|
|
12.2.1 Statistical arguments: what can be expected in practice |
425 |
|
|
12.2.2 In some domains, there is no need to find worst cases |
429 |
|
|
12.2.3 Deducing the worst cases from other functions or domains |
432 |
|
|
12.3 The Table Maker's Dilemma for Algebraic Functions |
433 |
|
|
12.3.1 Algebraic and transcendental numbers and functions |
433 |
|
|
12.3.2 The elementary case of quotients |
435 |
|
|
12.3.3 Around Liouville's theorem |
437 |
|
|
12.3.4 Generating bad rounding cases for the square root using Hensel 2-adic lifting |
438 |
|
|
12.4 Solving the Table Maker's Dilemma for Arbitrary Functions |
442 |
|
|
12.4.1 Lindemann's theorem: application to some transcendental functions |
442 |
|
|
12.4.2 A theorem of Nesterenko and Waldschmidt |
443 |
|
|
12.4.3 A first method: tabulated differences |
445 |
|
|
12.4.4 From the TMD to the distance between a grid and a segment |
447 |
|
|
12.4.5 Linear approximation: Lefèvre's algorithm |
449 |
|
|
12.4.6 The SLZ algorithm |
456 |
|
|
12.4.7 Periodic functions on large arguments |
461 |
|
|
12.5 Some Results |
462 |
|
|
12.5.1 Worst cases for the exponential, logarithmic, trigonometric, and hyperbolic functions |
462 |
|
|
12.5.2 A special case: integer powers |
471 |
|
|
12.6 Current Limits and Perspectives |
471 |
|
|
V Extensions |
473 |
|
|
13 Formalisms for Certifying Floating-Point Algorithms |
474 |
|
|
13.1 Formalizing Floating-Point Arithmetic |
474 |
|
|
13.1.1 Defining floating-point numbers |
475 |
|
|
13.1.2 Simplifying the definition |
477 |
|
|
13.1.3 Defining rounding operators |
478 |
|
|
13.1.4 Extending the set of numbers |
481 |
|
|
13.2 Formalisms for Certifying Algorithms by Hand |
482 |
|
|
13.2.1 Hardware units |
482 |
|
|
13.2.2 Low-level algorithms |
483 |
|
|
13.2.3 Advanced algorithms |
484 |
|
|
13.3 Automating Proofs |
485 |
|
|
13.3.1 Computing on bounds |
486 |
|
|
13.3.2 Counting digits |
488 |
|
|
13.3.3 Manipulating expressions |
490 |
|
|
13.3.4 Handling the relative error |
494 |
|
|
13.4 Using Gappa |
495 |
|
|
13.4.1 Toy implementation of sine |
495 |
|
|
13.4.2 Integer division on Itanium |
499 |
|
|
14 Extending the Precision |
503 |
|
|
14.1 Double-Words, Triple-Words… |
504 |
|
|
14.1.1 Double-word arithmetic |
505 |
|
|
14.1.2 Static triple-word arithmetic |
508 |
|
|
14.1.3 Quad-word arithmetic |
510 |
|
|
14.2 Floating-Point Expansions |
513 |
|
|
14.3 Floating-Point Numbers with Batched Additional Exponent |
519 |
|
|
14.4 Large Precision Relying on Processor Integers |
520 |
|
|
14.4.1 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic |
522 |
|
|
14.4.2 A brief introduction to arbitrary-precision integer arithmetic |
523 |
|
|
VI Perspectives and Appendix |
527 |
|
|
15 Conclusion and Perspectives |
528 |
|
|
16 Appendix: Number Theory Tools for F-P Arithmetic |
529 |
|
|
16.1 Continued Fractions |
529 |
|
|
16.2 The LLL Algorithm |
532 |
|
|
Bibliography |
537 |
|
|
Index |
574 |
|