Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Leave Comments, Critiques, and Suggestions Here?

Doing the Math

Introduction

Computers were originally conceived as devices for performing mathematics. The earliest computers spent most of their time solving equations. Although the engineering and scientific community now forms only a miniscule part of the computing world, there is a fantastic legacy from those former times: almost all computers now feature exceptional hardware for performing mathematical calculations accurately and extremely quickly. Sadly, most programming languages make it difficult for programmers to take full advantage of this hardware. An even bigger problem is the lack of documentation; even for many mathematical programmers, aspects of floating-point arithmetic remain shrouded in mystery.

In most programming languages, the support for mathematics is confined to a single file. For example, in C and C++, all of the mathematical functions are included in "math.h". Considering both the size of the field of mathematics, and the historical connections between mathematics and computer science, this limited presence is almost insulting. This arrangement also means that esoteric functions are jumbled together with simple, widely-used functions, making the field more bewildering for newcomers. In the Tango mathematics library, basic "high school" mathematics is seperated from low-level numerical analysis functions, and from the less familiar special interest functions.

IEEE floating-point arithmetic

Much of Tango's math library involves floating-point calculations. To make sense of it, it's necessary to have a basic understanding of IEEE floating-point arithmetic. Fundamentally, it is a mapping of the infinitely many real numbers onto a small number of bytes. Only 4000 million distinct numbers are representable as an IEEE 32-bit float. Even with such a pathetically small representation space, IEEE floating point arithmetic does a remarkably good job of maintaining the illusion that mathematical real numbers are being used; but it's important to understand when the illusion breaks down.

The first imperfection comes from the distribution of these representable numbers. The IEEE number line is quite different to the mathematical number line.

  +  +-----------+------------+.. +.. +----------+----------+ +     #
-inf -real.max  -1    -real.min   0   real.min   1   real.max inf  NAN

Notice that half of the IEEE number line lies between -1 and +1. When using 80-bit real numbers, there are over a thousand times as many representable numbers between 0 and 0.5, as there are between 0.5 and 1. (It is not a pure logarithmic relationship, though: there are exactly as many representable numbers between 1 and 1.5 as between 1.5 and 2). This has important implications for accuracy: the effective precision is incredibly high near zero. Several examples will be covered where numbers in the range -1 to +1 are treated seperately to take advantage of this.

Notice also the special numbers: +/- infinity; the so-called "denormals" or "subnormals" between +/-real.min and 0, which are represented at reduced precision; and finally the mysterious but incredibly useful "NaN" ("Not-a-Number") which will be discussed below.

The second important concept to be aware of is round-off error. Since most mathematical numbers are not perfectly representable, the nearest representable number is used instead. Consequently, almost all floating-point operations suffer from some round-off error. Unfortunately, round-off errors tend to accumulate during lengthy calculations. Several of the functions in the Tango mathematics library perform floating point calculations in unusual ways, in order to minimize the round-off error. For the same reason, almost all functions are supplied only with real precision.

It is worth knowing which operations are immune to round-off error. Any number that can be stored in an int can be represented exactly in a double, and any long can be stored in a real. Any operation which would be exact when performed with int arithmetic is also exact in floating-point (eg, 27.0/3.0 == 9.0 exactly). Multiplications and divisions by arbitrary powers of 2 are also exact.

This has important implications for equality testing: do not use == except to test for exact equality. The test that is normally desired is that of approximate equality, with some allowance for round-off error. Tango includes a function feqrel(), discussed below, which can be used for this purpose.

Dealing With Errors

What should a function do when presented with parameters that lie outside its domain? Consider the simplest case, sqrt(-1). There are several possibilities, each with strengths and weaknesses:

  • Use an input contract. An assert is generated in debug builds, helping to find program bugs quickly. Unfortunately, the behaviour is undefined in release builds unless another solution is used as well. A second problem is that the valid domain can become extremely complicated; it can be unreasonable to expect user code to avoid all of the contract violations.
  • Throw an exception. This causes code bloat, and may force user code to be littered with try..catch blocks.
  • Return a NaN. A Not-a-Number means that it was impossible to provide a sensible answer; for example, the CPU creates a NaN when asked to calculate 0.0/0.0. NaNs are extremely well supported by hardware, and are extremely useful for debugging, but are ignored by most programming languages.

In Tango.math, all functions return NaN when presented with invalid input. Additionally, contracts are used only when it is certain that there is a program logic error (for example, for statistical functions, all probabilities must lie in the range 0..1). But it is possible to do better than this. An IEEE NaN can contain a payload -- an integer stored inside the NaN, which can be used to explain where the NaN came from, and why it was generated. D is perhaps the first popular programming language to make use of this capability, and therefore must develop an ABI. Where possible, a Tango NaN contains a payload explaining where it was generated. This payload is displayed by the I/O functions, and can also be retrieved by library functions. Note that calculations involving a NaN are extremely slow on many processors, so it preferable to restrict their use to abnormal situations.

math.Math

math.Math is where the most familiar and commonly used mathematical functions for dealing with floating-point numbers can be found. Almost all of the functions will be familiar to anyone who has done high school mathematics. The functions fall into three categories: powers, trigonometry, and rounding.

Constants

Even with mathematical constants there are a couple of subtleties. Note that PI is not the mathematical number pi; it is merely the representable number that is closest to pi. So, for example, sin(pi) = 0, but sin(PI) = 2e-20. The round-off error is not in the sine function, but in the representation of PI. (Actually, the situation is more complex: On x86, sin(x) returns incorrect values whenever abs(x) > pi/2).

Powers, roots, and absolute value

  • abs -- for integer, real, complex, and imaginary arguments.
  • sqrt -- real or complex, the real version never returns a complex result.
  • cbrt -- cube root, real only.
  • exp, log, expm1, log1p.

exp, log are the exponential and natural logarithm functions. exp() is accurate for large positive and negative numbers, but of course will return 1.0 for any argument very close to zero. As previously discussed, the IEEE number line is much better at representing numbers near zero than near one. expm1(x) is equivalent to exp(x)- 1.0, but it preserves accuracy near zero. Likewise, log1p(x) is an equivalent to log(x+1) for values of x near zero. When appropriate, using these two functions in place of exp() and log() is one of the simplest ways to improve calculation accuracy. Of course, expm1(x) will always be -1 for any large negative x.

  • pow -- power, high speed for integer powers
  • hypot -- Applies Pythagoras' rule, but in a way that avoids overflow.
  • poly -- evaluate a polynomial.
  • conj -- complex conjugate.
  • max, min, maxNum, minNum

Maximum and minimum are unexpectedly complicated. What should happen if one of the parameters is a NaN? According to IEEE 754R, if one parameter is a NaN and the other is a number, the number should be returned (this behaviour is useful for determining the range of a function). Thus, these functions are named maxNum and minNum, as a reminder that a number is returned whenever possible. Unfortunately, this is a little slower than the naive implementation. max and min are maximum and minimum functions which work for any type; if any of the parameters is a NaN, the results are undefined.

Trignometric functions

For real, complex, and imaginary arguments. All take radians (not degrees).

  • cos, sin, tan
  • acos, asin, atan
  • cosh, sinh, tanh
  • acosh, asinh, atanh

Even for those not familiar with complex numbers, it is well worth knowing the following fast way to calculate sine and cosine simultaneously:

creal z = expi(real x);
real c = z.re;    // = cos(x)
real s = z.im;    // = sin(x)

On most processors (including x86), this is an intrinsic operation, and is twice as fast as calculating sine and cosine seperately.

Note that just like exp(x), cos(x) always returns 1 when x is small. Expressions of the form cos(x) - 1 should be rewritten using the result cos(x) - 1 = -2 * sin(x/2)2.

  • sinPi, cosPi, atanPi

To perform trigonometry accurately using degrees instead of radians, use sinPi(x/180). Since 360 can be represented exactly in floating point, but pi cannot, these functions remain accurate for large numbers of revolutions. (For example, sinPi(1e30) correctly returns 0, but sin(1e30 * PI) returns nonsense).

The functions sec = 1/cos, cosec = 1/sin, and cot = 1/tan were once important in trigonometry, but became less significant after logarithms were introduced http://mathworld.wolfram.com/Secant.html. They are not included in tango.math because they are trivial to calculate manually, and because the name sec() is highly likely to cause confusion with the word 'seconds'.

  • atan2(y, x)

This is used for converting rectangular (x, y) coordinates into polar (angle, radius) coordinates, which is fundamental to complex number operations. (It is also a situation where the sign of zero is particularly important).

Rounding to integers

It is extremely common to need to convert a real number to the nearest integer. There is a small difficulty: what should happen when there is a tie? Should 3.5 be rounded to 3, or to 4? For compatibility with the C programming language, cast(int) always rounds up. Unfortunately, this is a surprisingly slow operation on many processors, because it requires the CPU's rounding mode to be changed, and subsequently restored. (It's such a significant slowdown that the Pentium 4 includes a special optimisation to reduce the speed penalty, and a further optimisation is added in SSE4!) Yet in the majority of applications, it is not important which way ties are rounded. The functions rndint and rndlong are high-speed replacements for cast(int) and cast(long). Their tie-breaking behaviour depends on the rounding mode, which will be discussed below.

Several other useful transformations from real numbers to integers are provided.

  • floor
  • ceil
  • round
  • trunc

Note that these functions return reals, not integers. To store them into an integral type, it is still necessary to use rndint or rndlong.

math.IEEE

math.IEEE is the home of the low-level functions that are more closely related to the computer hardware than to the mathematics. For these functions to be useful, a reasonable understanding of IEEE arithmetic is needed, so they have traditionally been given very obscure names, and have rarely been used outside of mathematical libraries. It is hoped that they will be more widely used in D because they can both greatly improve the accuracy and robustness of calculations; they also make some interesting tricks possible.

Checking for approximate equality

Operator == is usually far too strict for most equality tests, because even a tiny round-off error will cause the test to fail. Instead, some form of "close enough" test must be used. Unfortunately, it's easy to write this test incorrectly. For example, the obvious abs(x - y) <= 0.000001 is incorrect; for example, it implies that 1e-10 is approximately equal to 7.8e-300, but it implies that 1.000000000000000000000001e60 is not approximately equal to 1e60. Instead, we need a test for "nearby on the IEEE number line" rather than "nearby on the mathematic al number line".

Numerical analysts typically speak of "units in the last place" (ulp) of round-off error, but this is only really useful when the errors are small. A more intuitive measure is to count the number of bits of round-off error (or alternatively, the number of bits which are still accurate). Although this sounds complicated, it can be performed efficiently by taking advantage of the binary representation of floating-point number. By specifying the maximum allowable number of roundoff bits at compile time, the test can be made extremely fast -- only marginally slower than a normal == comparison. For determining exactly what the round-off error is, use the precise function feqrel() in tango.math.IEEE.

  • feqrel (Unique to D)

Returns the number of significand bits which are equal in both x and y. Formally, real.mant_dig - ceil(log2) of the number of times nextup() would have to be called to move from the smaller number to the larger one.

The IEEE Exception Status Flags

All IEEE-compiliant processors include special status bits that indicate when "weird" things have happened that programs might want to know about. For example, ieeeFlags.divideByZero tells if any infinities have been created by dividing by zero. By only checking this once at the end of a calculation, it may be possible to avoid comparing thousands of comparisions that are almost never going to fail.

Here's a list of the weird things that can be detected:

invalid : This is set if any NaN's have been generated. This can happen with infinity - infinity, infinity * 0, 0 * infinity, 0/0, infinity/infinity, infinity%infinity, or x%0, for any number x. (This almost always indicates a programming error).

divisionByZero: Set if +/-infinity was generated by dividing by zero. This usually indicates a programming error, but not always; some types of calculations return correct results even when division by zero occurs. (For example, 1/(1+ 1/x) == 0 if x == 0).

overflow: Set if an infinity was generated by adding or multiplying two numbers that were so big that the sum was greater than real.max.

underflow: This happens if two numbers are subtracted or divided and are so tiny that the result got rounded to zero. Underflow almost never creates problems.

inexact: Almost all floating point operations set this flag! The few cases that are exact have already been discussed.

The IEEE Rounding Modes

The rounding mode is controlled with the functions getIeeeRounding and setIeeeRounding. Four rounding modes can be set. The default mode, Round to nearest, is the most statistically accurate, but the least intuitive. In the event of tie, the result is rounded to an even number.

Rounding mode rndint(4.5) rndint(5.5) rndint(-4.5)
Round to nearest 4 6 -4 Ties round to an even number
Round down 4 5 -5
Round up 5 6 -4
Round to zero 4 5 -4

There are two commonly cited reasons for changing the rounding mode. One is as a simple check for numerical stability: if the calculation produces wildly different results when the rounding mode is changed, it's a clear sign that it is suffering from round-off errors. The second use is for implementing interval arithmetic. But the most obvious use is to temporarily control the meaning of rndint, so that it can exactly duplicate the behaviour of cast(int) in an inner loop.

Functions for manipulating exponent and signficand

Because IEEE uses a base 2 exponent, multiplying or dividing a number by 2 is an exact operation (it does not introduce any roundoff error at all), providing of course that the result doesn't overflow or underflow. By exploiting this fact, the accuracy of a calculation can sometimes be improved, or extended in its usable range. The functions ldexp() and frexp() allow for the removal of all the factors of two from a number and the restoration of them at the end of a calculation.

  • frexp -- gets the exponent and significand.
  • ldexp -- N * 2y

  • ilogb, logb
  • scalbn -- x * 2n

  • nextUp, nextDown, nextafter

nextUp(x) and nextDown(x) return the adjacent points on the IEEE number line. They have been added to the IEEE 754R draft standard. The use of nextafter() is discouraged; many members of the IEEE standards commitee now believe that it was a mistake.

  • ieeeMean (Unique to D)

We previously saw from the IEEE number line that there are hundreds or thousands of times as many representable numbers between 0 and 1, as there are between 1 and 2. This is problematic for divide-and-conquer algorithms which attempt to obtain logarithmic performance by performing a 'binary chop'. Consider the problem of finding the value of x for which f(x) = 0, using bisection to repeatedly halve the size of the interval. If it is known that 0 < x < 2, a naive binary chop would divide this interval into [0, 1] and [1, 2]. Unfortunately, this is incorrect, because the interval [0, 1] contains more than 99% of the representable numbers from the original interval! ieeeMean(x,y) gives the midpoint in representation space, giving good performance in all cases. NOTE: this function is not described by IEEE.

  • signbit

The same as x > 0 , except that it distinguishes between positive and negative zero.

  • copysign

Helper functions that can improve accuracy or speed

  • fabs -- intrinsic. Used by math.Math.abs.
  • fma -- stands for "fused multiply add". (x*y) + z. Some CPUs (but not x87) can do this in one instruction; this increases accuracy, because it involves only one roundoff error.
  • expi -- sin + cos. This is an intrinsic on x86 machines. Used by exp(ireal).

Functions for dealing with special types of reals

  • isNormal
  • isSubnormal
  • isInfinity
  • isNaN

These functions exist primarily for compatibility with C. Note that isNaN(x) is equivalent to x!<>=0, and isInfinity(x) is the same as abs(x) == x.infinity, except that they don't raise exceptions for signalling NaNs.

  • isIdentical (Unique to D)

The same as equality, except that positive and negative zero are treated as different, and NaNs are treated as identical if they have the same payload.

NaN Payloads

According to the IEEE 754 standard, a 'payload' can be stored in the mantissa of a NaN. This payload can contain information about how or why it was generated. Historically, almost no programming languages have ever made use of this potentially powerful feature. In Tango, this payload consists of a positive integer. One difficulty is that whenever a cast occurs to a smaller type, bits are lost from the payload. The IEEE standard does not specify what happens when this

  • NaN -- create a NaN with a "payload", where the payload is a ulong.
  • ulong getNaNPayload(real x) -- returns the integer payload. Note that if narrowing conversions have occured, the high-order bits may have changed.

Never store a pointer as an integer payload inside a NaN. The garbage collector will not be able to find it!

math.BigInt

The built-in integral types are have inadequate precision for certain applications, such as high-precision mathematics and cryptography. Tango provides arbitrary-precision arithmetic using the BigInt? type, which is capable of performing calculations with thousands of digits of precision. Generally speaking, BigInt? can be used as a drop-in replacement for int or long. To achieve this, BigInt? uses value semantics, implemented in D1.0 using Copy-on-Write. Unfortunately, this results in the creation of many temporary heap objects (a problem which will not occur in D2.0).

BigInt? is intended to provide excellent performance for small to moderate numbers of digits (up to a few thousand decimal digits). Highly optimised assembly routines are used for the most widely-used processors. For extremely large numbers, or for relatively obscure processors, consider using GMP (http://www.gmplib.org/).

math.Bracket

Several important numerical algorithms are included in Tango. Math.Bracket contains algorithms for finding roots and extrema of single-argument real functions using bracketing. They are designed to work well even for highly non-linear functions.

  • findRoot

The classic one dimensional root finding problem is: given a, b, and a real function f(real), find x such that f(x)==0 and a<=x<=b. If you can provide a range [a..b], such that f(a) and f(b) have opposite signs, the root can be found efficiently within that range. ('Efficiently' means with as few calls to f() as possible).

Math.Bracket.findRoot uses an algorithm originally based on TOMS478 (Alefeld et. al.) which uses cubic interpolation where possible, degrading to slower-convergence methods when necessary. Significant speed and robustness enhancements have been made. TOMS478 and almost all other publically available algorithms suffer from extremely poor worst-case performance (the "industry standard" algorithm due to R.P. Brent can require ~40000 calls to f() for 80-bit reals); this implementation avoids such problems, making it 1.5X faster in the average case, and 200X faster in the worst case. Typically 10-15 calls to f are required to achieve full 80-bit precision.

For example, to solve x5 = 0.2 for x, knowing that x is between 0 and 5, use:

real root = findRoot( (real x) { return pow(x, 5) - 0.2; }, 0.0, 5.0);

This algorithm works more quickly if the bracket doesn't include zero. By observing the IEEE number line again, you can see that if you can specify a range like 1e-5..5.0, you're giving this algorithm a lot less work to do than if you give a range including 0.

Note that not all root-finding problems can be solved using this method; it can be challenging to provide two points on f(x) with opposite sign (sometimes f(x) only just touches the x axis).

  • findMinimum

Minimisation in one dimension is popularly performed using Brent's algorithm, which combines parabolic interpolation with bisection. As discussed by J. Crenshaw (http://www.embedded.com/story/OEG20010628S0046), Brent's algorithm is not optimal, and may be replaced in a future Tango release. The function only locates the minimum to half machine precision, since round-off errors usually render higher precision meaningless (though not always, for example abs(x) has a very precisely defined minimum).

If you need to find the maximum of a function f(x), just find the minimum of -f(x).

Note that as well as two points a and b bracketing the minimum, this algorithm also requires a third point c inside [a..b] which satisfies f(c)<= f(a) and f(c)<=f(b). Depending on your function, finding three such points can be extremely challenging.

math.Probability

Statistical distribution functions are widely used in mathematical statistics, economics, biology, and throughout experimental science and engineering. Tango provides high precision implementations of the most important probability distributions. We previously noted that there are thousands of times as many representable IEEE numbers between 0 and 0.5, as there are between 0.5 and 1. This can be a problem when representing extreme probabilities, because there is usually as much interest in events that almost certain to occur, as there are in events that will almost certainly not occur. To retain precision, functions are provided which use the complemented probability q, defined as q = 1 - p; for probabilities near 1, use the Compl version of the distribution function. (This is entirely analogous to the relationship between exp() and expm1()).

The following cumulative distribution functions are included in Tango. In each case, the complement (with suffix Compl), inverse (with suffix Inv) and complement inverse (with suffix ComplInv) are provided.

MATHEMATICAL TANGO NAME
normalDistribution
Student's t studentsTDistribution
f fDistribution
chi2 chiSqrDistribution
poissonDistribution
binomialDistribution
negativeBinomialDistribution
betaDistribution
gammaDistribution

The inverse functions are used for constructing confidence intervals for statistical data.

Almost all of these functions are implemented via the gamma and beta special functions, discussed below.

Mathematical Special Functions

The mathematical term 'Special Functions' covers a diverse range of transcendental functions. It is uncommon to find multiple special functions used simultaneously in a single application.

math.ErrorFunction

The error function and the integral of the normal distribution are fundamentally the same function, apart from a scaling factor.

  • erf, erfc

The error function and complementary error function. They arise most commonly in the integration of the normal distribution, also called a 'bell curve' or 'Gaussian distribution' (even though Gauss didn't invent it!). erf() has high relative accuracy around zero, while erfc() has high relative accuracy for large positive x. (This is similar to the relationship between expm1() and exp() which was discussed previously). It is possible to obtain high relative accuracy for any x by using the results:

erf(x) = 1 - erfc(x)
erf(x) = -erf(-x)

They are trivially related to the "Q function" commonly used in communications engineering, and to the integral of the normal distribution.

// Cumulative distribution
real normalDistributionFunction(real x) {
       return - 0.5 * erfc(-x * SQRT1_2);
}

real normalDistributionFunctionCompl(real x) {
       return -normalDistributionFunction(-x);
}

alias normalDistributionFunctionCompl Qfunction;

The inverse of the cumulative normal distribution function, also known as the quantile function, is also provided.


example: if exam marks are scaled to a mean of 60 and a stddev of 12.5, what fraction of students will get 85 or more?

import tango.io.Stdout;
import tango.math.ErrorFunction;

void main()
{
    // calculate how many standard deviations
    // we are from the mean.
    real x = (85-60)/12.5; 
   
    Stdout.println(1-normalDistribution(x));
    // prints 0.0227501319481792, which tells us that 2.3% of students
    // will recieve a high distinction.

    // Now for the inverse calculation.
    // What mark is required to be in the top 1% of students?
    Stdout.println(60 + 12.5*normalDistributionInv(0.99));
    // prints: 89.0793484255105
}

math.GammaFunction

  • gamma, logGamma, sgnGamma

The gamma function is a generalisation of the factorial function to include fractional and negative numbers. It is extremely important in mathematical statistics. loggamma(x) is identical to log(abs(gamma(x))), but it remains valid up to much larger values of x. In C, these functions are called tgamma and lgamma for obscure historical reasons. sgnGamma(x) gives the sign (gamma is always positive for x>0, but the sign alternates regularly for negative x).

  • beta, betaIncomplete, betaIncompleteInv

The beta function is closely related to the gamma function. Just as the gamma function gamma(x) is analogous to the factorial of x, the beta function beta(a,b) is analogous to the binomial coefficient (a choose b), but valid for real numbers as well as integers.

The incomplete beta function, betaIncomplete(a,b,x), is the integral of the beta function and is the most important and widely used function in calculating statistical probabilities. Most of the basic statistical tests (eg, Student's t, and Fischer F tests) are implemented via the incomplete beta function and its inverse.

  • gammaIncomplete, gammaIncompleteInv

The incomplete gamma function is the low-level function used in several statistical distributions, most notably the Chi-squared cumulative distribution function.

math.Bessel

Bessel functions are the solutions to Bessel's differential equation:

x2 d2y/dx2 + x dy/dx + (x2 - a2)y = 0

This type of equation arises most commonly in studies of waves in cylindrical or spherical spaces (down a pipe, for instance); hence they are important in calculations of the propogation of electromagnetic waves, for example. There are several types of Bessel functions, corresponding to cylindrical and spherical coordinates, and whether the value at x = 0 is finite ("first kind") or infinite ("second kind"). The traditional mathematical names for these functions are quite cryptic (J0, J1, Jn, Y0, Y1, Yn). In Tango, they are given more descriptive names loosely based on [Ref: WalterBrown] and Mathematica.

Value of a First kind Second kind
zero cylBessel_J0 cylBessel_Y0
one cylBessel_J1 cylBessel_Y1
integer n cylBessel_Jn cylBessel_Yn

NOTE: Other variations of bessel functions are not currently implemented in Tango, but may be added to this module in the future. Modified Bessel functions would be named cylBessel_I; modified Bessel functions of the third kind would be named cylBessel_K, and spherical Bessel functions would be given the names sphBessel_j and sphBessel_y.

math.Elliptic

The elliptic integrals originally arose in calculating arc length of an ellipse, but are generally applicable to an entire class of integral equations. Any such equation can be converted into a combination of the three kinds of elliptic integral. Tango naming conventions are based on Mathematica. The Incomplete elliptic integrals depend on the amplitude, phi, and the modulus m. The Complete elliptic integrals are the special case where phi = PI/2.

Incomplete Complete
First ellipticF ellipticKComplete
Second ellipticE ellipticEComplete
Third ellipticPi* ellipticPiComplete*

* The current implementation suffers from poor accuracy.

In Review

Some general recommendations for accurate and high-speed code:

  • If possible, take advantage of the extremely high relative precision around zero.
  • Don't be unduly worried about division by zero. Often, the calculations will be still be correct. The exception flags can be used to detect abnormal behaviour.
  • Correct use of NaN can be extremely beneficial for debugging.
  • Favour feq over == unless you want to test for exact equality.
  • Favour rndint and rndlong over cast(int) and cast(long), unless tie-breaking behaviour is important.

Appendix 1: The Structure of a Tango NaN

A 'payload' can be stored in the mantissa of a NaN. One bit (the most significant) is required to distinguish between a quiet and a signalling NaN. This leaves 22 bits of payload for a float; 51 bits for a double; 62 bits for an 80-bit real; and 111 bits for a 128-bit quad. Unfortunately, whenever a cast occurs to a smaller type, bits are lost from the payload. In a Tango !NaN, the 22 bits of the float mantissa are the low 22 bits of the payload integer. Bits 22..51 are stored in the double mantissa, and (for 80-bit reals) bits 52..62 are stored in the real part of the mantissa.

Appendix 2: Numerical Algorithms not currently included in Tango

Be aware that the classic text "Numerical Recipes" (http://www.nrbook.com/a/), although an excellent resource with a broad overview of numerical algorithms, sometimes contains erroneous statements, and the associated source code has severe copyright restrictions.

  • Fast Fourier Transforms can be performed efficiently using the FFTW library: http://www.fftw.org.

Footnote: Portability and the use of 80-bit reals

It is sometimes argued that the use of 80-bit real numbers should be sacrificed, in the interests of portability. If portable code is a goal, it is important to realize that although all systems are able to store numbers in float and double precision, not all are capable of performing calculations at those precisions. In particular, the x87 floating point unit, present on almost all Intel and AMD machines, only has true hardware support for 80-bit floating point calculations; calculation at lower precisions involves some emulation. The Pentium III was the first Intel processor capable of natively performing calculations at double precision (this requires the use of SSE2 instructions). The Tango library uses 80-bit reals where possible, gracefully degrading to 64-bit reals when necessary.

References and Further Reading

  • "What Every Computer Scientist Should Know About Floating-Point Arithmetic",

http://docs.sun.com/source/806-3568/ncg_goldberg.html

  • "A Proposal to Add Mathematical Special Functions to the C++ Standard Library", Walter E. Brown,

http://std.dkuug.dk/JTC1/SC22/WG21/docs/papers/2003/n1542.pdf. (Note that this proposal has been accepted into the C++0x TR1 standard).

  • "A Proposal to Add Mathematical Functions for Statistics to the C++ Standard Library", Paul A. Bristow,

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1069.pdf

  • "An Interview with the Old Man of Floating-Point: Reminiscences elicited from William Kahan by Charles Severance",

http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html