# Transcendental Meditation

I got into a conversation with some folks who've been moving a large sophisticated image processing application to Java. They've been getting great performance numbers, much to the surprise of the C crowd in their shop.With one exception: code that invokes sin() and cos() heavily is somewhat slower. They asked me why this was happening. I had a pretty good idea, but I checked with Joe Darcy, our local Floating Point God, and he had this to say:

*For many years, the JDK on x86
platforms has used the hardware fsin/fcos x87 instructions in the
range [-*pi*/4,* pi*/4], a range which encompasses about
half of all representable floating-point values. Therefore, in that
range the performance of the JDK's transcendental functions should
be nearly the same as the performance of the transcendental
functions in C, C++, etc. that are using those same fsin/fcos
instructions. Benchmarks which focus on testing the trig
performance of large values, such as almabench, present a skewed
portrait of Java's trigonometric performance. The next question is
why don't we just use fsin/fcos over the entire floating-point
range? The simple answer is that fsin/fcos can deliver answers that
are arbitrarily wrong for the most straightforward way of measuring
error in the result.*

*Every finite real number, no matter
how large, has a well-defined value for sin/cos. Ideally, the
floating-point result returned for sin/cos would be the
representable floating-point number closest to the mathematically
defined result for the floating-point input. A floating-point
library having this property is called correctly rounded, which is
equivalent to saying the library has an error bound less than or
equal to 1/2 an ulp (unit in the last place). For sin/cos, writing
a correctly rounding implementation that runs at a reasonable speed
is still something of a research problem so in practice platforms
often use a library with a 1 ulp error bound instead, which means
either of the floating-point numbers adjacent to the true result
can be returned. This is the implementation criteria the Java Math
library has to meet. The implementation challenge is that sin/cos
are implemented using argument reduction whereby any input is
mapped into a corresponding input in the [-*pi*/4,*
pi*/4] range. Since the period of sin/cos is* pi *and* pi
*is transcendental, this amounts to having to compute a remainder
from the division by a transcendental number, which is non-obvious.
A few years after the x87 was designed, people figured out how to
do this division as if by an exact value of* pi*. Instead the
x87 fsin/fcos use a particular approximation to* pi*, which
effectively means the period of the function is changed, which can
lead to large errors outside [-*pi*/4,* pi*/4]. For
example the value of sine for the floating-point number Math.PI is
around*

*1.2246467991473532E-16*

*while the computed value from fsin
is*

*1.2246063538223773E-16*

*In other words, instead of getting
the full 15-17 digit accuracy of double, the returned result is
only correct to about 5 decimal digits. In terms of ulps, the error
is about 1.64e11 ulps, over *ten billion* ulps. With some effort,
I'm confident I could find results with the wrong sign, etc. There
is a rationale which can justify this behavior; however, it was
much more compelling before the argument reduction problem was
solved.*

This error has tragically become
un-fixable because of the compatibility requirements from one
generation to the next. The fix for this problem was figured out
quite a long time ago. In the excellent paper *The K5 transcendental functions* by T. Lynch,
A. Ahmed, M. Schulte, T. Callaway, and R. Tisdale a technique is
described for doing argument reduction as if you had an infinitely
precise value for pi. As far as I know, the K5 is the only x86
family CPU that did sin/cos accurately. AMD went back to being
bit-for-bit compatibile with the old x87 behavior, assumably
because too many applications broke. Oddly enough, this is fixed in
Itanium.

What we do in the JVM on x86 is
moderately obvious: we range check the argument, and if it's
outside the range [-*pi*/4, *pi*/4]we do the precise
range reduction by hand, and then call fsin.

So Java is accurate, but slower. I've never been a fan of "fast, but wrong" when "wrong" is roughly random(). Benchmarks rarely test accuracy. "double sin(double theta) { return 0; }" would be a great benchmark-compatible implementation of sin(). For large values of theta, 0 would be arguably more accurate since the absolute error is never greater than 1. fsin/fcos can have absolute errors as large as 2 (correct answer=1; returned result=-1).

This is one of those area where no matter what we do, we're screwed.

July 27, 2005 |