このページは http://www.slideshare.net/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

1年以上前 (2015/03/01)にアップロードinテクノロジー

In this deck, John Gustafson presents: An Energy Efficient and Massively Parallel Approach to Val...

In this deck, John Gustafson presents: An Energy Efficient and Massively Parallel Approach to Valid Numerics.

"Written by one of the foremost experts in high-performance computing and the inventor of Gustafson’s Law, The End of Error: Unum Computing explains a new approach to computer arithmetic: the universal number (unum). The unum encompasses all IEEE floating-point formats as well as fixed-point and exact integer arithmetic. This new number type obtains more accurate answers than floating-point arithmetic yet uses fewer bits in many cases, saving memory, bandwidth, energy, and power."

Watch the video presentation: http://wp.me/p3RLHQ-dTk

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

- AN ENERGY-EFFICIENT AND

MASSIVELY PARALLEL APPROACH

TO VALID NUMERICS

John L. Gustafson, Ph.D.

CTO, Ceranovo

Director, Massively Paral el Technologies, Etaphase,

Clustered Systems

Copyright © 2014, 2015 John L. Gustafson - Big problems facing computing

• Too much energy and power needed per calculation

• More hardware paral elism than we know how to use

• Not enough bandwidth (the “memory wal ”)

• Rounding errors more treacherous than people realize

• Rounding errors prevent use of paral el methods

• Sampling errors turn physics simulations into guesswork

• Numerical methods are hard to use, require experts

• IEEE floats give different answers on different platforms - The ones vendors care most about

• Too much energy and power needed per calculation

• More hardware paral elism than we know how to use

• Not enough bandwidth (the “memory wal ”)

• Rounding errors more treacherous than people realize

• Rounding errors prevent use of paral el methods

• Sampling errors turn physics simulations into guesswork

• Numerical methods are hard to use, require experts

• IEEE floats give different answers on different platforms - Too much power and heat needed

• Huge heat sinks

• 20 MW limit for exascale

• Data center electric bil s

• Mobile device battery life

• Heat intensity means bulk

• Bulk increases latency

• Latency limits speed - More parallel hardware than we can use

• Huge clusters usual y partitioned into 10s, 100s of cores

• Few algorithms exploit mil ions of cores except LINPACK

• Capacity is not a substitute for capability! - Not enough bandwidth (“Memory wall”)

Operation

Energy

Time

consumed

needed

64-bit multiply-add

200 pJ

1 nsec

Read 64 bits from cache

800 pJ

3 nsec

Move 64 bits across chip

2000 pJ

5 nsec

Execute an instruction

7500 pJ

1 nsec

Read 64 bits from DRAM

12000 pJ

70 nsec

Notice that 12000 pJ @ 3 GHz = 36 watts!

One-size-fits-al overkil 64-bit precision wastes energy, storage bandwidth - Happy 100th Birthday, Floating Point

1914: Torres proposes automatic computing with a fraction and a scaling factor.

2014: We stil use a format designed for World War I hardware capabilities! - Floats designed for visible scratch work

• OK for manual calculations

• Operator sees, remembers errors

• Can head off overflow, underflow

• Automatic math hides all that

• No one sees processor “flags”

• Disobeys algebraic laws

• Wastes bit patterns as NaNs

• IEEE 754 “standard” is real y the

IEEE 754 guideline; optional

rules spoil consistent results - Analogy: Printing in 1970 vs. 2014

1970: 30 sec per page

2013: 30 sec per page

Faster technology is for better prints,

not thousands of low-quality prints per second.

Why not do the same thing with computer arithmetic? - This is just… sad.
- Floats prevent use of parallelism

• No associative property for floats

• (a + b) + (c + d) (paral el) ≠ ((a + b) + c) + d (serial)

• Looks like a “wrong answer”

• Programmers trust serial, reject paral el

• IEEE floats report rounding, overflow, underflow in

processor register bits that no one ever sees. - A New Number Format: The Unum

• Universal numbers

• Superset of IEEE types,

both 754 and 1788

• Integersè floatsè unums

• No rounding, no overflow to

∞, no underflow to zero

• They obey algebraic laws!

• Safe to paral elize

• Fewer bits than floats

• But… they’re new

“You can’t boil the ocean.”

• Some people don’t like new

—Former Intel exec, when shown the unum idea - Three ways to express a big number

Avogadro’s number: ~6.022×1023 atoms or molecules

Sign-Magnitude Integer (80 bits):

0 1111111100001010101001111010001010111111010100001001010011000000000000000000000

sign

Lots of digits

IEEE Standard Float (64 bits):

0 10001001101 1111111000010101010011110100010101111110101000010011

sign exponent (scale)

fraction

Unum (29 bits):

utag

Self-descriptive “utag” bits track

0 11001101 111111100001 1 111 1011

and manage uncertainty, exponent

size, and fraction size

sign exp. frac. ubit exp. size frac. size - Why unums use fewer bits than floats

• Exponent smal er by about 5 – 10 bits, typical y

• Trailing zeros in fraction compressed away, saves ~2 bits

• Shorter strings for more common values

• Cancel ation removes bits and the need to store them

IEEE Standard Float (64 bits):

0 10001001101 1111111000010101010011110100010101111110101000010011

Unum (29 bits):

0 11001101 111111100001 1 111 1011 - Open ranges, as well as exact points

Bit string meanings

Bit string meanings

using IEEE Float rules

in unum format

Complete representation of all real numbers using a finite number of bits - Ubounds are the hull of 1 or 2 unums

• Includes closed, open, half-open intervals

• Includes ±∞, empty set, quiet and signaling NaN

• Unlike traditional intervals, ubounds are closed

and lossless under set operations - The three layers of computing

Grammar

High-efficiency

Limited set of

rules for exact,

floats and real

fused

inexact

intervals

operations - The Warlpiri unums

Before the aboriginal Warlpiri of

Northern Australia had contact with

other civilizations, their counting

system was “One, two, many.”

Maybe they were onto something. - Fixed-size unums: faster than floats

• Warlpiri ubounds are one byte, but closed system for reals

• Unpacked unums pre-decode exception bits, hidden bit

Circuit required for

“IEEE half-precision

float = ∞?”

Circuit required for

“unum = ∞?”

(any precision) - Floating Point II: The Wrath of Kahan

• Berkeley professor Wil iam Kahan is the father of modern IEEE

Standard floats

• Also the authority on their many dangers

• Every idea to fix floats faces his tests that expose how new idea is

even worse

Working unum environment

completed August 13, 2013.

Can unums survive the

wrath of Kahan? - A Typical Kahan Chal enge

• Correct answer: (1, 1, 1, 1).

• IEEE 32-bit: (0, 0, 0, 0) FAIL

• IEEE 64-bit: (0, 0, 0, 0) FAIL

• Myth: “Getting the same answer with increased precision means the

answer is correct.”

• IEEE 128-bit: (0, 0, 0, 0) FAIL

• Extended precision math packages: (0, 0, 0, 0) FAIL

• Interval arithmetic: Um, somewhere between –∞ and ∞. EPIC FAIL

• Unums, 6-bit average size: (1, 1, 1, 1) CORRECT

I have been unable to find a problem that “breaks” unum math. - Kahan’s “Smooth Surprise”

Find minimum of log(|3(1–x)+1|)/80 + x2 + 1 in 0.8 ≤ x ≤ 2.0

Plot, test using half a mil ion

Plot, test using a few dozen

double-precision IEEE floats.

very low-precision unums.

Shows minimum at x = 0.8.

Shows minimum where

FAIL

x spans 4/3.

CORRECT - Kahan on the computation of powers

I sent Kahan my fixed-cost, fixed-storage method, and he said it looked “impractical.”

I asked if he had a method that shows the fol owing computation is exact:

5.96046447753906250.875 = 4.76837158203125

Have not heard from him since. - Two can play this game, Professor K.

• Stable fixed point found by floats, not by traditional intervals

• Unums find both stable point, and unstable point at origin

• Finding exact stable point is mathematical y incorrect!

• Adding a tiny wobble sin(x)/x destroys floats, but not unums.

Unums can do anything floats can do, through explicit use of the guess function. - Rump’s Royal Pain

Compute 333.75y6 + x2(11x2y2 – y6 – 121y4 – 2) + 5.5y8 + x/(2y)

where x = 77617, y = 33096.

• Using IBM (pre-IEEE Standard) floats, Rump got

• 1.172603 in 32-bit precision

• 1.1726039400531 in 64-bit precision

• 1.172603940053178 in 128-bit precision

• Using IEEE double precision: 1.18059x1021

• Correct answer: –0.82739605994682136…!

Didn’t even get sign right

Unums: Correct answer to 23 decimals using an average

of only 75 bits per number. Not even IEEE 128-bit precision

can do that. Precision, range adjust automatical y. - Some fundamental principles

Bound the answer as tightly as possible within

the numerical environment, or admit defeat.

• No more guessing

• No more “the error is O(hn)” type estimates

• The smal er the bound, the greater the information

• Performance is information per second

• Maximize information per bit

• Fused operations are always explicitly distinct from their

non-fused versions and results are identical across

platforms - Polynomials: bane of classic intervals

Dependency and closed endpoints lose information (amber)

Unum polynomial evaluator

loses no information. - Polynomial evaluation solved at last

Mathematicians have sought this for at least 60 years.

“Dependency Problem” creates sloppy

Unum evaluation refines answer to

range when input is an interval

limits of the environment precision - Uboxes and solution sets

• A ubox is a multidimensional unum

• Exact or ULP-wide in each dimension

• Sets of uboxes constitute a solution set

• One dimension per degree of freedom in solution

• Solves the main problem with interval arithmetic

• Super-economical for bit storage

• Data paral el in general - Calculus considered harmful

• Computers are discrete

• Calculus is continuous

• Ensures sampling errors

• Changes problem to fit the tool - Deeply Unsatisfying Error Bounds

4x

• Classical numerical texts

teach this “error bound”:

Error ≤ (b – a) h2 |f ʹ′ʹ′(ξ)| / 24

• What is f ʹ′ʹ′? Where is ξ ?

What is the bound??

• Bound is often infinite, which means no bound at al

• “Whatever it is, it’s four times better if we make h half as

big” creates demand for supercomputing that cannot be

satisfied. - Two “ubox methods”, both mindless

• Paint bucket: find one solution point, test neighbors and classify as

solution or fail until no more neighbors to test

• Works if solution is known to be a connected set

• Requires a starting point “seed”

• Wave front of trial uboxes can be computed in parallel

• Try the universe: Use Warlpiri uboxes (4-bit precision) to tile al of n-

space; increment exponent and fraction size automatical y

• 13n things to do in paral el (!)

• Finds every solution, no matter what, since al of n-space is represented

• Detects il -posed problems and solves them anyway

• Paral elism adjusts from 3 to tril ions - Quarter-circle example

• Suppose al we know is x2 + y2 = 1, and x and y are ≥ 0

• Suppose we have at most 2 bits exponent, 4 bits fraction

Task:

Bound the quarter circle area.

Bound the value of π. - Set is connected; need a seed

• We know x = 0, y = 1 works

• Find its eight ubox

neighbors in the plane

• Test x2 + y2 = 1, x ≥ 0, y ≥ 0

• Solution set is green

• Trial set is amber

• Failure set is red

• Stop when no more trials - Exactly one neighbor passes

• Unum math automatical y

excludes cases that floats

Not part of the unit circle

would accept

• Trials are neighbors of new

solutions that

• Are not already failures

• Are not already solutions

• Note: no calculation of

y = 1− x2 - The new trial set

• Five trial uboxes to test

• Perfect, easy paral elism

for multicore

• Each ubox takes only

15 to 23 bits

• Ultra-fast operations

• Skip to the final result… - The complete quarter circle

• The complete solution, to

this finite precision level

• Information is reciprocal of

green area

• Use to find area under arc,

bounded above and below

• Proves value of π to an

accuracy of 3%

• No calculus needed, or

divides, or square roots - Compressed Final Result

• Coalesce uboxes to largest

possible ULP values

• Lossless compression

• Total data set: 603 bits!

• 6x faster graphics than

current methods

Instead of ULPs being the

source of error, they are the

atomic units of computation - Fifth-degree polynomial roots

• Analytic solution: There isn’t one.

• Numerical solution: Huge errors from rounding

• Unums: quickly return

x = –1, x = 2 as the exact

solutions. No rounding.

No underflow. Just…

the correct answer.

With as few as 4 bits

for the operands! - The power of open-closed endpoints

Root-finding

just works. - Classical Numerical Analysis

• Time steps

• Use position to estimate force

start

• Use force to estimate acceleration

Δt

• Update the velocity

• Update the position

Δt

• Lather, rinse, repeat

Δt

• Accumulates rounding

and sampling error, both unknown

Δt

• Cannot be done in parallel

M - A New Type of Paral elism

• Space steps, not time steps

• Acceleration, velocity bounded

in any given space interval

• Find traversal time as a function

of space step (2D ubox)

• Massively paral el!

• No rounding error

• No sampling error

• Obsoletes existing

ODE methods - Pendulums Done Right

• Physics teaches us it’s a

harmonic oscil ator with

period

g

2π

L

• Force-fits nonlinear ODE

into linear ODE for which

calculus works.

• WRONG answer - Physical Truth vs. Force-Fit Solution

Bends the problem to fit solution methods - Uboxes for linear solvers

y

0.70

• If the A and b values in Ax=b

are rounded, the “lines” have

0.69

width from uncertainty

• Apply a standard solver, and

get the red dot as “the answer”,

0.68

x. A pair of floating-point

numbers.

0.67

• Check it by computing Ax and

see if it rigorously contains b.

Yes, it does.

0.66

• Hmm… are there any other

points that also work?

0.65

0.64

x

0.74

0.75

0.76

0.77

0.78 - Float, Interval, and Ubox Solutions

y

0.6618

• Point solution (black dot) just gives

0.6616

one of many solutions; disguises

answer instability

• Interval method (gray box) yields a

0.6614

bound too loose to be useful

• The ubox set (green) is the best

you can do for a given precision

0.6612

• Uboxes reveal il -posed nature…

yet provide solution anyway

0.6610

• Works equal y wel on nonlinear

problems!

x

0.7544

0.7546

0.7548

0.7550 - Other Apps with Ubox Solutions

• Photorealistic computer

graphics

• N-body problems

• Structural analysis

• Laplace’s equation

• Perfect gas models without

statistical mechanics

Imagine having provable bounds on

answers for the first time, yet with

easier programming, less storage, less

bandwidth use, less energy/power

demands, and abundant paral elism. - Revisiting the Big Challenges-1

• Too much energy and power needed per calculation

• Unums cut the main energy hog by about 50%

• More hardware paral elism than we know how to use

• Uboxes reveal vast sources of data paral elism, the easiest kind

• Not enough bandwidth (the “memory wal ”)

• More use of CPU transistors, fewer bits moved to/from memory

• Rounding errors more treacherous than people realize

• Unums eliminate rounding error, automate precision choice

• Rounding errors prevent use of multicore methods

• Unums restore algebraic laws, eliminating the deterrent - Revisiting the Big Challenges-2

• Sampling errors turn physics simulations into guesswork

• Uboxes produce provable bounds on physical behavior

• Numerical methods are hard to

use, require expertise

• “Paint bucket” and “Try the universe” are brute force general

methods that need no expertise… not even calculus - The End of Error

• A complete text on unums

and uboxes is available from

CRC Press as of this month:

http://www.crcpress.com/product/isbn/

9781482239867

• Aimed at general reader;

mathematicians wil hate its

casual style

• Complete prototype

environment is available as

Mathematica notebook

through publisher

Thank you!