このページは http://www.slideshare.net/NaokiShibata/efficient-evaluation-methods-of-elementary-functions-suitable-for-simd-computation の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

6年以上前 (2010/05/31)にアップロードinテクノロジー

Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computatio...

Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, 2010, DOI: 10.1007/s00450-010-0108-2 (May. 2010).

http://www.springerlink.com/content/340228x165742104/

http://freshmeat.net/projects/sleef

Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.

- Efficient Evaluation Methods of

Elementary Functions

Suitable for SIMD Computation

Naoki Shibata

Shiga University

International Supercomputing Conference

1

2010 - Overview

• Methods for evaluating following functions

using SIMD instructions in double

precision

– sin, cos, tan

– log, exp

– asin, acos, atan

• Fast

– Two times as fast as FPU evaluation

• Accurate

– Maximum error is within 6 ulps from the true

value

International Supercomputing

2

Conference - Overview (contd.)

• Advantages against existing methods

– No conditional branches

– No gathering/scattering

– No table look-ups

n

n

n

n

x

tio

tio

tio

tio

ra

ra

ra

ra

sin x

e

e

e

e

p

p

p

p

o

o

o

o

y

D

D

D

D

IM

IM

IM

IM

sin y

S

S

S

S

Apply exactly same series of SIMD operations to x and y,

and we get sin x and sin y.

International Supercomputing

3

Conference - Outline

• Overview

• Background

– Cares needed for SIMD optimization

– Related works

• Proposed method

– Trigonometric functions

– Inverse trigonometric functions

– Exponential function

– Logarithmic function

• Evaluation

– Accuracy

– Speed

– Code size

• Conclusion

International Supercomputing

4

Conference - Background

• SIMD instructions are now pervasive

– SSE in x86 processors

– Altivec in Power / PowerPC processors

– NEON in ARM processors

– Cell Broadband Engine

– Many GPU models

• Length of SIMD registers is going to be

extended

– 256 bits in Sandy Bridge

– 512 bits in Larrabee (or Knights Ferry?) GPUs

International Supercomputing

5

Conference - Background (contd.)

• SIMD inst. set does not include

instructions for evaluating elmentary

functions.

– sin, cos, tan, log, exp, asin, acos, atan

• Two possibilities

– FPU

• Elementary functions are available in limited

architectures.

– Software library

• Many are not optimized for SIMD calculation

International Supercomputing

6

Conference - Cares needed in SIMD

op

•

timization

We need special cares for SIMD optimization

– Memory access and conditional branches are

slow

• with modern processor models with long pipeline

– Table look-up is slow

– Gathering and scattering operations are slow

• Some traditionally slow operations are not

slow anymore

– Division and Square root can now be evaluated

using one instruction each

– No extra register needed

• We need new, specialized algorithms to make

efficient uses of SIMD unit

International Supercomputing

7

Conference - Gathering and scattering

operations are slow

• We need to compose/decompose each

element of vector register

– At least one instruction is needed for each

element

– SIMD ALU may be idle during this operation

• Register spills may happen

– This requires extra execution of instructions

– This may cause memory access

International Supercomputing

8

Conference - Table look-up is slow

• Table look-up is frequently used in traditional

implementation for evaluating elementary

functions

This part must be

– Both in HW and SW

repeated for each

element

Look-up

x

x'

x

x

Scatter

Gather

y

y

y

y'

Look-up

Table

International Supercomputing

9

Conference - Division and SQRT are not too

slow

• 69 clocks of latency and throughput for 1

execution of SQRTPD instruction[*]

• 69 clocks of latency and throughput for 1

execution of DIVPD instruction[*]

• Memory access latency is 165 clocks

(C2Q 9300, measured by CPU-Z)

• The good point is that they do not

require extra instruction execution and

registers

[*] Intel 64 and IA-32 Architectures Optimization Reference

Manual

International Supercomputing

10

Conference - Challenge

• Usually, evaluation of elementary functions is

performed using calculation with extra

precision

– e.g. x86 FPU does evaluation using 80bit calculation,

and round the result into 64bit.

• With extra-precision calculation, obtaining

accurate output is not very hard.

• But in the proposed method, 64bit-precision

calculation is only available

• Thus, error in each step of evaluation leads to

the error in the output

So, we cannot tolerate error in each step

International Supercomputing

11

Conference - Related Works

• GNU C Library and FPU emulator in

Linux OS utilize conditional branches

– Thus, not suitable for SIMD computation

• There are many researches on

evaluating elementary functions on

hardware

– Many of them utilize table look-ups

– Not suitable for SIMD computation

• There are several multiple-precision

floating-point computation libraries

– The design is very different from our implementation

International Supercomputing

12

Conference - Outline

• Overview

• Background

– Cares needed for SIMD optimization

– Related works

• Proposed method

– Trigonometric functions

– Inverse trigonometric functions

– Exponential function

– Logarithmic function

• Evaluation

– Accuracy

– Speed

– Code size

• Conclusion

International Supercomputing

13

Conference - Trigonometric functions

• Finds sin x and cos x at the same time

• Consists of two steps

– Step 1 : Argument is reduced within 0 and

/4

utilizing the symmetry and periodicity

– Step 2 : Evaluate sin and cos functions on

the reduced argument

• tan x can be evaluated by simply

calculating (sin x / cos x)

International Supercomputing

14

Conference - Reducing argument within 0

and /4

• Given argument d, find s and an integer q so that

0 <= s < /4

• Is it as easy as just dividing d with /4? – No!

• Finding q is easy, but s may be inaccurate because of

cancellation error

– when d is a large number close to a multiple of /4

• Some implementations like FPU on x86 processors exhibits

this problem

International Supercomputing

15

Conference - Cancellation error

• Cancellation error happens when we subtract

one number from another where

– Two numbers are very close

– Both numbers have limited precision, and already

rounded

• Suppose that we are calculating 1 – x, with 5

digits of precision.

– True value of x is 0.999985 which is rounded to

0.99999

– True value : 1 – 0.999985 = 0.000015

– Rounded : 1 – 0.99999 = 0.00001

– In this case, the results has only 1 digit of accuracy.

Loss of accuracy is caused by cancellation

International Supercomputing

16

Conference - Solution for this problem

• Basic idea : calculate this part with extra

precision utilizing properties of IEEE 754

• q is assumed to be expressed in 26 bits,

which is a half of mantissa

• /4 is split into three parts

so that all of the bits in the lower half of the

mantissa are zero,

International Supercomputing

17

Conference - Solution for this problem

(•contd.)

Multiplying these numbers with q does

not produce an error

– because, lower half of these numbers are zero

• Subtracting the resulting numbers in

descending order from d does not

produce cancellation error

– Results of each step are all accurate

• In this way, we can obtain s accurately.

International Supercomputing

18

Conference - Step 2 in evaluating trigonometric

functions

• Simply summing up the terms of the Taylor

series does not produce accurate result

– Because of accumulation of errors

• Using another kind of reduction is common

– Utilizing triple angle formula of sine

– e.g. First divide x by 3, evaluate Taylor series, and

then apply triple angle formula

– This speeds up calculation by reducing x

International Supercomputing

19

Conference - Step 2 (contd.)

• This method is not good for our case.

– When sin x is small, difference between two

terms are so large and it produces rounding

error

• So, we use double angle formula of

cosine, instead

• sin x can be found by the following

formula

International Supercomputing

20

Conference - Step 2 (contd.)

• But, we have another problem

– When x is close to 0, cos2 x gets close to 1, and we

have cancellation error

• So, we use double angle formula of (1 – cos x)

• Let f(x) be (1 - cos x), and we get

• And, these do not produce cancellation or

rounding errors

International Supercomputing

21

Conference - Inverse trigonometric

functions

• We are going to find atan x, and obtain

asin x and acos x from atan x

• Again, just evaluating terms of Taylor

series produces accumulation of errors

• There seems no good way of argument

reduction is for inverse trigonometric

functions

– We propose a new way of argument reduction

for atan

International Supercomputing

22

Conference - New argument reduction for

evaluating arc tangent (1/3)

• Suppose x = atan d, thus tan x = d

• If d <= 1, we evaluate atan function on

the argument of 1/d instead of d

• Then subtract the result from /2

• Now, we can assume that d > 1 and /4 < x < /2

International Supercomputing

23

Conference - New argument reduction for

evaluating arc tangent (2/3)

• We use following formulas

(13)

(14)

• We use (13) to calculate cotangent of argument d

• Then we repeatedly use (14) to find cotangent of (d /

2n)

– which is actually "enlarging" cot d

• By (13), enlarging cot d corresponds to reducing tan d

International Supercomputing

24

Conference - New argument reduction for

evaluating arc tangent (3/3)

• Suppose that atan 2 = , tan = 2

• Apply (13) to get

• Apply (14) to get

• Apply (13) to get

• = 0.618 is less than 2,

thus we reduced the argument

(13)

(14)

After argument reduction, we calculate Taylor series of atan

International Supercomputing

25

Conference - asin and acos functions

• We have problem if we simply use the

following functions

– These produce cancellation errors when |x| is

close to 1

• This problem can be avoided by small modifications of

the formulas

International Supercomputing

26

Conference - Exponential function

• Consists of two steps

– Similar to trigonometric functions

• Step 1 : Argument is reduced

within 0 and loge 2

• Step 2 : Further reduce the

argument, and evaluate Taylor

series

International Supercomputing

27

Conference - Step 1

• We find s and an integer q so that 0

< s <= loge 2

• This step is very similar to the first step for

trigonometric functions

• The same problem arises, and we solve the

problem in the same way

International Supercomputing

28

Conference - Step 2

• We can use the following formula to further

reduce argument

• And the evaluate Taylor series

• But, if x is close to 0, we will have rounding

errors

– Since difference between 1 and other terms are large

• So, we find (exp x) – 1, instead of exp x

• And the remaining part is similar to sin and

cos

International Supercomputing

29

Conference - Logarithmic function

• We use the following series, rather than

Taylor series

• This series converges faster than Taylor

series

• Just evaluating this series is enough.

• This series is well known.

International Supercomputing

30

Conference - Outline

• Overview

• Background

– Cares needed for SIMD optimization

– Related works

• Proposed method

– Trigonometric functions

– Inverse trigonometric functions

– Exponential function

– Logarithmic function

• Evaluation

– Accuracy

– Speed

– Code size

• Conclusion

International Supercomputing

31

Conference - Evaluation

• Accuracy

– We compared the output by proposed

method and output by MPFR library

– We measured the evaluation accuracy

within a few ranges for each function

• Speed

– We compared the speed of the proposed

methods, an FPU and MPFR Library

– We used Core i7 920 (2.66GHz), Core 2

Duo E7200 (2.53GHz)

International Supercomputing

32

Conference - Accuracy

Accuracy results for sin, cos and tan

Accuracy results for atan

Accuracy results for asin and acos

International Supercomputing

33

Conference - Accuracy (contd.)

Accuracy results for exp

Accuracy results for log

In all cases, error does not exceed 6 ulps

International Supercomputing

34

Conference - Speed

Proposed method is about two times as fast as FPU calculation

International Supercomputing

35

Conference - Code size

• The total code size is very small

• Suitable for Cell B.E.

– which has only 256K bytes of directly

accessible scratch pad memory in

each SPE.

International Supercomputing

36

Conference - Conclusion

• We proposed efficient methods for

evaluating elementary functions in double

precision

• It does not include table lookups, scattering

from, or gathering into SIMD registers, or

conditional branches.

• The average and maximum errors were less

than 0.67 ulp and 6 ulps, respectively.

• The evaluation speed was faster than the

FPUs on Intel Core 2 and Core i7

processors.

International Supercomputing

37

Conference - Thank you

• An implementation of the proposed method is now available as public domain software.

• Contact

http://freshmeat.net/projects/sleef

International Supercomputing

38

Conference