ciando eBooks - ein Service Ihrer Bibliothek

	Preface	14
	List of Figures	16
	List of Tables	19
	I Introduction, Basic Definitions, and Standards	22
	1 Introduction	23
	1.1 Some History	23
	1.2 Desirable Properties	26
	1.3 Some Strange Behaviors	27
	1.3.1 Some famous bugs	27
	1.3.2 Difficult problems	28
	2 Definitions and Basic Notions	33
	2.1 Floating-Point Numbers	33
	2.2 Rounding	40
	2.2.1 Rounding modes	40
	2.2.2 Useful properties	42
	2.2.3 Relative error due to rounding	43
	2.3 Exceptions	45
	2.4 Lost or Preserved Properties of the Arithmetic on the Real Numbers	47
	2.5 Note on the Choice of the Radix	49
	2.5.1 Representation errors	49
	2.5.2 A case for radix 10	50
	2.6 Tools for Manipulating Floating-Point Errors	52
	2.6.1 The ulp function	52
	2.6.2 Errors in ulps and relative errors	57
	2.6.3 An example: iterated products	57
	2.6.4 Unit roundoff	59
	2.7 Note on Radix Conversion	60
	2.7.1 Conditions on the formats	60
	2.7.2 Conversion algorithms	63
	2.8 The Fused Multiply-Add (FMA) Instruction	71
	2.9 Interval Arithmetic	71
	2.9.1 Intervals with floating-point bounds	72
	2.9.2 Optimized rounding	72
	3 Floating-Point Formats and Environment	74
	3.1 The IEEE 754-1985 Standard	75
	3.1.1 Formats specified by IEEE 754-1985	75
	3.1.2 Little-endian, big-endian	79
	3.1.3 Rounding modes specified by IEEE 754-1985	80
	3.1.4 Operations specified by IEEE 754-1985	81
	3.1.5 Exceptions specified by IEEE 754-1985	85
	3.1.6 Special values	88
	3.2 The IEEE 854-1987 Standard	89
	3.2.1 Constraints internal to a format	89
	3.2.2 Various formats and the constraints between them	90
	3.2.3 Conversions between floating-point numbers and decimal strings	91
	3.2.4 Rounding	92
	3.2.5 Operations	92
	3.2.6 Comparisons	93
	3.2.7 Exceptions	93
	3.3 The Need for a Revision	93
	3.3.1 A typical problem: ``double rounding''	94
	3.3.2 Various ambiguities	96
	3.4 The New IEEE 754-2008 Standard	98
	3.4.1 Formats specified by the revised standard	99
	3.4.2 Binary interchange format encodings	100
	3.4.3 Decimal interchange format encodings	101
	3.4.4 Larger formats	111
	3.4.5 Extended and extendable precisions	111
	3.4.6 Attributes	112
	3.4.7 Operations specified by the standard	116
	3.4.8 Comparisons	118
	3.4.9 Conversions	118
	3.4.10 Default exception handling	119
	3.4.11 Recommended transcendental functions	122
	3.5 Floating-Point Hardware in Current Processors	123
	3.5.1 The common hardware denominator	123
	3.5.2 Fused multiply-add	123
	3.5.3 Extended precision	123
	3.5.4 Rounding and precision control	124
	3.5.5 SIMD instructions	125
	3.5.6 Floating-point on x86 processors: SSE2 versus x87	125
	3.5.7 Decimal arithmetic	126
	3.6 Floating-Point Hardware in Recent GraphicsProcessing Units	127
	3.7 Relations with Programming Languages	128
	3.7.1 The Language Independent Arithmetic (LIA) standard	128
	3.7.2 Programming languages	129
	3.8 Checking the Environment	129
	3.8.1 MACHAR	130
	3.8.2 Paranoia	130
	3.8.3 UCBTest	134
	3.8.4 TestFloat	135
	3.8.5 IeeeCC754	135
	3.8.6 Miscellaneous	135
	II Cleverly Using Floating-Point Arithmetic	136
	4 Basic Properties and Algorithms	137
	4.1 Testing the Computational Environment	137
	4.1.1 Computing the radix	137
	4.1.2 Computing the precision	139
	4.2 Exact Operations	140
	4.2.1 Exact addition	140
	4.2.2 Exact multiplications and divisions	142
	4.3 Accurate Computations of Sums of Two Numbers	143
	4.3.1 The Fast2Sum algorithm	144
	4.3.2 The 2Sum algorithm	147
	4.3.3 If we do not use rounding to nearest	149
	4.4 Computation of Products	150
	4.4.1 Veltkamp splitting	150
	4.4.2 Dekker's multiplication algorithm	153
	4.5 Complex numbers	157
	4.5.1 Various error bounds	158
	4.5.2 Error bound for complex multiplication	159
	4.5.3 Complex division	162
	4.5.4 Complex square root	167
	5 The Fused Multiply-Add Instruction	169
	5.1 The 2MultFMA Algorithm	170
	5.2 Computation of Residuals of Division and Square Root	171
	5.3 Newton--Raphson-Based Division with an FMA	173
	5.3.1 Variants of the Newton--Raphson iteration	173
	5.3.2 Using the Newton--Raphson iteration for correctly rounded division	178
	5.4 Newton--Raphson-Based Square Root with an FMA	185
	5.4.1 The basic iterations	185
	5.4.2 Using the Newton--Raphson iteration for correctly rounded square roots	186
	5.5 Multiplication by an Arbitrary-Precision Constant	189
	5.5.1 Checking for a given constant C if Algorithm 5.2 will always work	190
	5.6 Evaluation of the Error of an FMA	193
	5.7 Evaluation of Integer Powers	195
	6 Enhanced Floating-Point Sums, Dot Products, and Polynomial Values	198
	6.1 Preliminaries	199
	6.1.1 Floating-point arithmetic models	200
	6.1.2 Notation for error analysis and classical error estimates	201
	6.1.3 Properties for deriving validated running error bounds	204
	6.2 Computing Validated Running Error Bounds	205
	6.3 Computing Sums More Accurately	207
	6.3.1 Reordering the operands, and a bit more	207
	6.3.2 Compensated sums	209
	6.3.3 Implementing a ``long accumulator''	216
	6.3.4 On the sum of three floating-point numbers	216
	6.4 Compensated Dot Products	218
	6.5 Compensated Polynomial Evaluation	220
	7 Languages and Compilers	222
	7.1 A Play with Many Actors	222
	7.1.1 Floating-point evaluation in programming languages	223
	7.1.2 Processors, compilers, and operating systems	225
	7.1.3 In the hands of the programmer	226
	7.2 Floating Point in the C Language	226
	7.2.1 Standard C99 headers and IEEE 754-1985 support	226
	7.2.2 Types	227
	7.2.3 Expression evaluation	230
	7.2.4 Code transformations	233
	7.2.5 Enabling unsafe optimizations	234
	7.2.6 Summary: a few horror stories	235
	7.3 Floating-Point Arithmetic in the C++ Language	237
	7.3.1 Semantics	237
	7.3.2 Numeric limits	238
	7.3.3 Overloaded functions	239
	7.4 FORTRAN Floating Point in a Nutshell	240
	7.4.1 Philosophy	240
	7.4.2 IEEE 754 support in FORTRAN	243
	7.5 Java Floating Point in a Nutshell	244
	7.5.1 Philosophy	244
	7.5.2 Types and classes	245
	7.5.3 Infinities, NaNs, and signed zeros	247
	7.5.4 Missing features	248
	7.5.5 Reproducibility	249
	7.5.6 The BigDecimal package	250
	7.6 Conclusion	251
	III Implementing Floating-Point Operators	253
	8 Algorithms for the Five Basic Operations	254
	8.1 Overview of Basic Operation Implementation	254
	8.2 Implementing IEEE 754-2008 Rounding	256
	8.2.1 Rounding a nonzero finite value with unbounded exponent range	256
	8.2.2 Overflow	258
	8.2.3 Underflow and subnormal results	259
	8.2.4 The inexact exception	260
	8.2.5 Rounding for actual operations	260
	8.3 Floating-Point Addition and Subtraction	261
	8.3.1 Decimal addition	264
	8.3.2 Decimal addition using binary encoding	265
	8.3.3 Subnormal inputs and outputs in binary addition	266
	8.4 Floating-Point Multiplication	266
	8.4.1 Normal case	267
	8.4.2 Handling subnormal numbers in binary multiplication	267
	8.4.3 Decimal specifics	268
	8.5 Floating-Point Fused Multiply-Add	269
	8.5.1 Case analysis for normal inputs	269
	8.5.2 Handling subnormal inputs	273
	8.5.3 Handling decimal cohorts	274
	8.5.4 Overview of a binary FMA implementation	274
	8.6 Floating-Point Division	277
	8.6.1 Overview and special cases	277
	8.6.2 Computing the significand quotient	278
	8.6.3 Managing subnormal numbers	279
	8.6.4 The inexact exception	280
	8.6.5 Decimal specifics	280
	8.7 Floating-Point Square Root	280
	8.7.1 Overview and special cases	280
	8.7.2 Computing the significand square root	281
	8.7.3 Managing subnormal numbers	282
	8.7.4 The inexact exception	282
	8.7.5 Decimal specifics	282
	9 Hardware Implementation of Floating-Point Arithmetic	283
	9.1 Introduction and Context	283
	9.1.1 Processor internal formats	283
	9.1.2 Hardware handling of subnormal numbers	284
	9.1.3 Full-custom VLSI versus reconfigurable circuits	285
	9.1.4 Hardware decimal arithmetic	286
	9.1.5 Pipelining	287
	9.2 The Primitives and Their Cost	288
	9.2.1 Integer adders	288
	9.2.2 Digit-by-integer multiplication in hardware	294
	9.2.3 Using nonstandard representations of numbers	294
	9.2.4 Binary integer multiplication	295
	9.2.5 Decimal integer multiplication	297
	9.2.6 Shifters	298
	9.2.7 Leading-zero counters	298
	9.2.8 Tables and table-based methods for fixed-point function approximation	300
	9.3 Binary Floating-Point Addition	302
	9.3.1 Overview	302
	9.3.2 A first dual-path architecture	303
	9.3.3 Leading-zero anticipation	305
	9.3.4 Probing further on floating-point adders	309
	9.4 Binary Floating-Point Multiplication	310
	9.4.1 Basic architecture	310
	9.4.2 FPGA implementation	310
	9.4.3 VLSI implementation optimized for delay	312
	9.4.4 Managing subnormals	315
	9.5 Binary Fused Multiply-Add	316
	9.5.1 Classic architecture	317
	9.5.2 To probe further	319
	9.6 Division	319
	9.6.1 Digit-recurrence division	320
	9.6.2 Decimal division	323
	9.7 Conclusion: Beyond the FPU	323
	9.7.1 Optimization in context of standard operators	324
	9.7.2 Operation with a constant operand	325
	9.7.3 Block floating point	327
	9.7.4 Specific architectures for accumulation	327
	9.7.5 Coarser-grain operators	331
	9.8 Probing Further	334
	10 Software Implementation of Floating-Point Arithmetic	335
	10.1 Implementation Context	336
	10.1.1 Standard encoding of binary floating-point data	336
	10.1.2 Available integer operators	337
	10.1.3 First examples	340
	10.1.4 Design choices and optimizations	342
	10.2 Binary Floating-Point Addition	343
	10.2.1 Handling special values	344
	10.2.2 Computing the sign of the result	346
	10.2.3 Swapping the operands and computing the alignment shift	347
	10.2.4 Getting the correctly rounded result	349
	10.3 Binary Floating-Point Multiplication	355
	10.3.1 Handling special values	355
	10.3.2 Sign and exponent computation	357
	10.3.3 Overflow detection	359
	10.3.4 Getting the correctly rounded result	360
	10.4 Binary Floating-Point Division	363
	10.4.1 Handling special values	364
	10.4.2 Sign and exponent computation	365
	10.4.3 Overflow detection	368
	10.4.4 Getting the correctly rounded result	369
	10.5 Binary Floating-Point Square Root	375
	10.5.1 Handling special values	376
	10.5.2 Exponent computation	378
	10.5.3 Getting the correctly rounded result	379
	IV Elementary Functions	387
	11 Evaluating Floating-Point Elementary Functions	388
	11.1 Basic Range Reduction Algorithms	392
	11.1.1 Cody and Waite's reduction algorithm	392
	11.1.2 Payne and Hanek's algorithm	394
	11.2 Bounding the Relative Error of Range Reduction	395
	11.3 More Sophisticated Range Reduction Algorithms	397
	11.3.1 An example of range reduction for the exponential function	399
	11.3.2 An example of range reduction for the logarithm	400
	11.4 Polynomial or Rational Approximations	401
	11.4.1 L2 case	402
	11.4.2 L, or minimax case	403
	11.4.3 ``Truncated'' approximations	405
	11.5 Evaluating Polynomials	406
	11.6 Correct Rounding of Elementary Functions tobinary64	407
	11.6.1 The Table Maker's Dilemma and Ziv's onion peelingstrategy	407
	11.6.2 When the TMD is solved	408
	11.6.3 Rounding test	409
	11.6.4 Accurate second step	413
	11.6.5 Error analysis and the accuracy/performance tradeoff	414
	11.7 Computing Error Bounds	415
	11.7.1 The point with efficient code	415
	11.7.2 Example: a ``double-double'' polynomial evaluation	416
	12 Solving the Table Maker's Dilemma	418
	12.1 Introduction	418
	12.1.1 The Table Maker's Dilemma	419
	12.1.2 Brief history of the TMD	423
	12.1.3 Organization of the chapter	424
	12.2 Preliminary Remarks on the Table Maker's Dilemma	425
	12.2.1 Statistical arguments: what can be expected in practice	425
	12.2.2 In some domains, there is no need to find worst cases	429
	12.2.3 Deducing the worst cases from other functions or domains	432
	12.3 The Table Maker's Dilemma for Algebraic Functions	433
	12.3.1 Algebraic and transcendental numbers and functions	433
	12.3.2 The elementary case of quotients	435
	12.3.3 Around Liouville's theorem	437
	12.3.4 Generating bad rounding cases for the square root using Hensel 2-adic lifting	438
	12.4 Solving the Table Maker's Dilemma for Arbitrary Functions	442
	12.4.1 Lindemann's theorem: application to some transcendental functions	442
	12.4.2 A theorem of Nesterenko and Waldschmidt	443
	12.4.3 A first method: tabulated differences	445
	12.4.4 From the TMD to the distance between a grid and a segment	447
	12.4.5 Linear approximation: Lefèvre's algorithm	449
	12.4.6 The SLZ algorithm	456
	12.4.7 Periodic functions on large arguments	461
	12.5 Some Results	462
	12.5.1 Worst cases for the exponential, logarithmic, trigonometric, and hyperbolic functions	462
	12.5.2 A special case: integer powers	471
	12.6 Current Limits and Perspectives	471
	V Extensions	473
	13 Formalisms for Certifying Floating-Point Algorithms	474
	13.1 Formalizing Floating-Point Arithmetic	474
	13.1.1 Defining floating-point numbers	475
	13.1.2 Simplifying the definition	477
	13.1.3 Defining rounding operators	478
	13.1.4 Extending the set of numbers	481
	13.2 Formalisms for Certifying Algorithms by Hand	482
	13.2.1 Hardware units	482
	13.2.2 Low-level algorithms	483
	13.2.3 Advanced algorithms	484
	13.3 Automating Proofs	485
	13.3.1 Computing on bounds	486
	13.3.2 Counting digits	488
	13.3.3 Manipulating expressions	490
	13.3.4 Handling the relative error	494
	13.4 Using Gappa	495
	13.4.1 Toy implementation of sine	495
	13.4.2 Integer division on Itanium	499
	14 Extending the Precision	503
	14.1 Double-Words, Triple-Words…	504
	14.1.1 Double-word arithmetic	505
	14.1.2 Static triple-word arithmetic	508
	14.1.3 Quad-word arithmetic	510
	14.2 Floating-Point Expansions	513
	14.3 Floating-Point Numbers with Batched Additional Exponent	519
	14.4 Large Precision Relying on Processor Integers	520
	14.4.1 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic	522
	14.4.2 A brief introduction to arbitrary-precision integer arithmetic	523
	VI Perspectives and Appendix	527
	15 Conclusion and Perspectives	528
	16 Appendix: Number Theory Tools for F-P Arithmetic	529
	16.1 Continued Fractions	529
	16.2 The LLL Algorithm	532
	Bibliography	537
	Index	574