tango.math.internal.BignumX86

Optimised asm arbitrary precision arithmetic ('bignum') routines for X86 processors.
All functions operate on arrays of uints, stored LSB first. If there is a destination array, it will be the first parameter. Currently, all of these functions are subject to change, and are intended for internal use only. The symbol [#] indicates an array of machine words which is to be interpreted as a multi-byte number.

License:

BSD style: see license.txt

Authors:

Don Clugston

In simple terms, there are 3 modern x86 microarchitectures: (a) the P6 family (Pentium Pro, PII, PIII, PM, Core), produced by Intel; (b) the K6, Athlon, and AMD64 families, produced by AMD; and (c) the Pentium 4, produced by Marketing.

This code has been optimised for the Intel P6 family. Generally the code remains near-optimal for Intel Core2, after translating EAX-> RAX, etc, since all these CPUs use essentially the same pipeline, and are typically limited by memory access. The code uses techniques described in Agner Fog's superb Pentium manuals available at www.agner.org. Not optimised for AMD, which can do two memory loads per cycle (Intel CPUs can only do one). Despite this, performance is superior on AMD. Performance is dreadful on P4.

Timing results (cycles per int) --Intel Pentium-- --AMD-- PM P4 Core2 K7 +,- 2.25 15.6 2.25 1.5 <<,>> 2.0 6.6 2.0 5.0 (<< MMX) 1.7 5.3 1.5 1.2 * 5.0 15.0 4.0 4.3 mulAdd 5.7 19.0 4.9 4.0 div 30.0 32.0 32.0 22.4 mulAcc(32) 6.5 20.0 5.4 4.9

mulAcc(32) is multiplyAccumulate() for a 32*32 multiply. Thus it includes function call overhead. The timing for Div is quite unpredictable, but it's probably too slow to be useful. On 64-bit processors, these times should halve if run in 64-bit mode, except for the MMX functions.