- Part I

Fundamental Concepts

Spring 2006

Parallel Processing, Fundamental

Slide 1

Concepts - About This Presentation

This presentation is intended to support the use of the textbook

Introduction to Parallel Processing: Algorithms and Architectures

(Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by

the author in connection with teaching the graduate-level course

ECE 254B: Advanced Computer Architecture: Paral el Processing,

at the University of California, Santa Barbara. Instructors can use

these slides in classroom teaching and for other educational

purposes. Any other use is strictly prohibited. © Behrooz Parhami

Edition

Released

Revised

Revised

First

Spring 2005

Spring 2006

Spring 2006

Parallel Processing, Fundamental

Slide 2

Concepts - I Fundamental Concepts

Provide motivation, paint the big picture, introduce the 3 Ts:

• Taxonomy (basic terminology and models)

• Tools for evaluation or comparison

• Theory to delineate easy and hard problems

Topics in This Part

Chapter 1 Introduction to Paral elism

Chapter 2 A Taste of Paral el Algorithms

Chapter 3 Paral el Algorithm Complexity

Chapter 4 Models of Paral el Processing

Spring 2006

Parallel Processing, Fundamental

Slide 3

Concepts - 1 Introduction to Parallelism

Set the stage for presenting the course material, including:

• Chal enges in designing and using paral el systems

• Metrics to evaluate the effectiveness of paral elism

Topics in This Chapter

1.1 Why Paral el Processing?

1.2 A Motivating Example

1.3 Paral el Processing Ups and Downs

1.4 Types of Paral elism: A Taxonomy

1.5 Roadblocks to Paral el Processing

1.6 Effectiveness of Paral el Processing

Spring 2006

Parallel Processing, Fundamental

Slide 4

Concepts - 1.1 Why Parallel Processing?

TIPS

nce

a

1.6 / yr

GIPS

rm

Pentium II

R10000

rfo

Pentium

r pe

68040

80486

sso

80386

MIPS

ce

68000

ro

P

80286

KIPS

1980

1990

2000

2010

Calendar year

Fig. 1.1 The exponential growth of microprocessor performance,

known as Moore’s Law, shown over the past two decades (extrapolated).

Spring 2006

Parallel Processing, Fundamental

Slide 5

Concepts - The Semiconductor Technology Roadmap

Calendar year

2001

2004

2007

2010

2013

2016

Halfpitch (nm)

140

90

65

45

32

22

Clock freq. (GHz)

2

4

7

12

20

30

Wiring levels

7

8

9

10

10

10

Power supply (V)

1.1

1.0

0.8

0.7

0.6

0.5

Max. power (W)

130

160

190

220

250

290

From the 2001 edition of the roadmap [Alla02]

TIPS

Factors contributing to the validity of Moore’s law

Denser circuits; Architectural improvements

nce

a

1.6 / yr

GIPS

rm

Measures of processor performance

Pentium II

R10000

rfo

Pentium

Instructions/second (MIPS, GIPS, TIPS, PIPS) r pe

68040

80486

sso

80386

MIPS

Floating-point operations per second

ce

68000

ro

P

80286

(MFLOPS, GFLOPS, TFLOPS, PFLOPS)

Running time on benchmark suites

KIPS

1980

1990

2000

2010

Calendar year

Spring 2006

Parallel Processing, Fundamental

Slide 6

Concepts - Why High-Performance Computing?

Higher speed (solve problems faster)

Important when there are “hard” or “soft” deadlines;

e.g., 24-hour weather forecast

Higher throughput (solve more problems)

Important when there are many similar tasks to perform;

e.g., transaction processing

Higher computational power (solve larger problems)

e.g., weather forecast for a week rather than 24 hours,

or with a finer mesh for greater accuracy

Categories of supercomputers

Uniprocessor; aka vector machine

Multiprocessor; centralized or distributed shared memory

Multicomputer; communicating via message passing

Massively parallel processor (MPP; 1K or more processors)

Spring 2006

Parallel Processing, Fundamental

Slide 7

Concepts - The Speed-of-Light Argument

The speed of light is about 30 cm/ns.

Signals travel at a fraction of speed of light (say, 1/3).

If signals must travel 1 cm during the execution of an

instruction, that instruction wil take at least 0.1 ns;

thus, performance wil be limited to 10 GIPS.

This limitation is eased by continued miniaturization,

architectural methods such as cache memory, etc.;

however, a fundamental limit does exist.

How does paral el processing help? Wouldn’t multiple

processors need to communicate via signals as wel ?

Spring 2006

Parallel Processing, Fundamental

Slide 8

Concepts - Why Do We Need TIPS or TFLOPS Performance?

Reasonable running time = Fraction of hour to several hours (103-104 s)

In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations

Example 1: Southern oceans heat Modeling

(10-minute iterations)

4096 E-W regions

300 GFLOP per iteration

-S

300 000 iterations per 6 yrs =

10

1024 N

16 FLOP

regions

12 layers

in depth

Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)

109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP

Example 3: Monte Carlo simulation of nuclear reactor

1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP

Decentralized supercomputing ( from Mathworld News, 2006/4/7 ):

Grid of tens of thousands networked computers discovers 230 402 457 – 1,

the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )

Spring 2006

Parallel Processing, Fundamental

Slide 9

Concepts - The ASCI Program

1000

Plan Develop

Use

)SP

100+ TFLOPS, 20 TB

O

L

100

ASCI Purple

F

30+ TFLOPS, 10 TB

(T

ASCI Q

nce

a

10+ TFLOPS, 5 TB

10

ASCI White

ASCI

rm

rfo

3+ TFLOPS, 1.5 TB

e

P

ASCI Blue

1+ TFLOPS, 0.5 TB

1

ASCI Red

1995

2000

2005

2010

Calendar year

Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)

Computing Initiative (ASCI) program, sponsored by the US Department of

Energy, with extrapolation up to the PFLOPS level.

Spring 2006

Parallel Processing, Fundamental

Slide 10

Concepts - The Quest for Higher Performance

Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)

1. IBM Blue Gene/L 2. SGI Columbia

3. NEC Earth Sim

LLNL, California

NASA Ames, California

Earth Sim Ctr, Yokohama

Material science,

Aerospace/space sim,

Atmospheric, oceanic,

nuclear stockpile sim

climate research

and earth sciences

32,768 proc’s, 8 TB,

10,240 proc’s, 20 TB, 5,120 proc’s, 10 TB,

28 TB disk storage

440 TB disk storage

700 TB disk storage

Linux + custom OS

Linux

Unix

71 TFLOPS, $100 M

52 TFLOPS, $50 M

36 TFLOPS*, $400 M?

Dual-proc Power-PC

20x Altix (512 Itanium2) Built of custom vector

chips (10-15 W power)

linked by Infiniband

microprocessors

Ful system: 130k-proc,

Volume = 50x IBM,

360 TFLOPS (est)

Power = 14x IBM

* Led the top500 list for 2.5 yrs

Spring 2006

Parallel Processing, Fundamental

Slide 11

Concepts - Supercomputer Performance Growth

PFLOPS

ASCI goals

nce

a

$240M MPPs

rm

$30M MPPs

rfo

CM-5

Vector supers

TFLOPS

CM-5

r pe

CM-2

ute

Micros

p

m

Y-MP

GFLOPS

rco

pe

Alpha

u

Cray

S

X-MP

80860

80386

MFLOPS

1980

1990

2000

2010

Calendar year

Fig. 1.2 The exponential growth in supercomputer performance over

the past two decades (from [Bell92], with ASCI performance goals and

microprocessor peak FLOPS superimposed as dotted lines).

Spring 2006

Parallel Processing, Fundamental

Slide 12

Concepts - Init. Pass 1 Pass 2 Pass 3

1.2 A Motivating

2 m 2

2

2

Example

3

3 m 3

3

4

5

5

5 m 5

6

7

7

7

7 m

Fig. 1.3 The sieve of

8

Eratosthenes yielding a

9

9

10

list of 10 primes for n = 30.

11

11

11

11

12

Marked elements have

13

13

13

13

been distinguished by

14

15

15

erasure from the list.

16

17

17

17

17

18

19

19

19

19

20

Any composite number

21

21

22

has a prime factor

23

23

23

23

that is no greater than

24

25

25

25

its square root.

26

27

27

28

29

29

29

29

30

Spring 2006

Parallel Processing, Fundamental

Slide 13

Concepts - Single-Processor Implementation of the Sieve

Current Prime

Index

P

Bit-vector

1

2

n

Fig. 1.4 Schematic representation of single-processor

solution for the sieve of Eratosthenes.

Spring 2006

Parallel Processing, Fundamental

Slide 14

Concepts - Control-Parallel Implementation of the Sieve

Index

Index

Index

P1

P2

... Pp

Shared

Current Prime

Memory

I/O Device

1

2

n

(b)

Fig. 1.5 Schematic representation of a control-parallel

solution for the sieve of Eratosthenes.

Spring 2006

Parallel Processing, Fundamental

Slide 15

Concepts - Running Time of the Sequential/Parallel Sieve

Time

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

2 | 3 | 5 | 7 | 11 |13|17

p

=

1

,

t

=

1

41

1

1

9

29

23 31

23 29 31

2 | 7 |17

3 5 | 11 |13|

19

p = 2, t = 706

2 |

| 3 11 | 19 29 31

5 | 7 13|17 23

p = 3, t = 499

Fig. 1.6 Control-paral el realization of the sieve of

Eratosthenes with n = 1000 and 1 p 3.

Spring 2006

Parallel Processing, Fundamental

Slide 16

Concepts - Data-Parallel Implementation of the Sieve

P

Current Prime

1

Index

Assume at most n processors,

so that all prime factors dealt with

are in P1 (which broadcasts them)

1

2

n/p

P2

Current Prime

Index

Communi-

cation

n/p+1

2n/p

Pp

Current Prime

Index

n < n / p

n–n/p+1

n

Fig. 1.7 Data-paral el realization of the sieve of Eratosthenes.

Spring 2006

Parallel Processing, Fundamental

Slide 17

Concepts - One Reason for Sublinear Speedup:

Communication Overhead

Ideal speedup

Solution time

Actual speedup

Computation

Communication

Number of processors

Number of processors

Fig. 1.8 Trade-off between communication time and computation

time in the data-paral el realization of the sieve of Eratosthenes.

Spring 2006

Parallel Processing, Fundamental

Slide 18

Concepts - Another Reason for Sublinear Speedup:

Input/Output Overhead

Ideal speedup

Solution time

Actual speedup

Computation

I/O time

Number of processors

Number of processors

Fig. 1.9 Effect of a constant I/O time on the data-paral el

realization of the sieve of Eratosthenes.

Spring 2006

Parallel Processing, Fundamental

Slide 19

Concepts - 1.3 Parallel Processing Ups and Downs

Fig. 1.10 Richardson’s circular

Using thousands of “computers”

theater for weather forecasting

(humans + calculators) for 24-hr

calculations.

weather prediction in a few hours

1960s: ILLIAC IV (U Il inois) –

four 8 8 mesh quadrants, SIMD

1980s: Commercial interest –

technology was driven by

government grants & contracts.

Conductor

Once funding dried up,

many companies went bankrupt

2000s: Internet revolution –

info providers, multimedia, data

mining, etc. need lots of power

Spring 2006

Parallel Processing, Fundamental

Slide 20

Concepts - Trends in High-Technology Development

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes

Graphics

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B

GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes

Networking

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

Transfer of

ideas/people

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

GovRes

RISC

IndResIndR

IndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

GovResGovResGovResG

GovResGovResGovResGo

Paral elism

IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes

Evolution of parallel processing has been

quite different from other high tech fields

IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1

1960

1970

1980

1990

2000

Development of some technical fields into $1B businesses and the roles played by

government research and industrial R&D over time (IEEE Computer, early 90s?).

Spring 2006

Parallel Processing, Fundamental

Slide 21

Concepts - Trends in Hi-Tech Development (2003)

Spring 2006

Parallel Processing, Fundamental

Slide 22

Concepts - Status of Computing Power (circa 2000)

GFLOPS on desktop: Apple Macintosh, with G4 processor

TFLOPS in supercomputer center:

1152-processor IBM RS/6000 SP (switch-based network)

Cray T3E, torus-connected

PFLOPS on drawing board:

1M-processor IBM Blue Gene (2005?)

32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers

Processor: 8 threads, on-chip memory, no data cache

Chip: defect-tolerant, row/column rings in a 6 6 array

Board: 8 8 chip grid organized as 4 4 4 cube

Tower: Boards linked to 4 neighbors in adjacent towers

System: 32 32 32 cube of chips, 1.5 MW (water-cooled)

Spring 2006

Parallel Processing, Fundamental

Slide 23

Concepts - 1.4 Types of Parallelism: A Taxonomy

Single data

Multiple data

Shared

Message

stream

streams

variables

passing

SISD

SIMD

n

ory

m

GMSV

GMMP

lobal

Unipro c essors Array

o r vector

sio

G e Shared - memory Rarel y used

ingle instr stream

m

S

processors

n

multiprocessors

pa

x

s

MISD

MIMD

’s e

ory

m

DMSV

DMMP

n

Rarel y used

Multipr oc’s or

e

Distri buted Distrib-

memory

ultiple instr stream

so

istributed m

M

multicomputers

n

D

shared memory multicomputers

h

Jo

Flynn’s categories

Fig. 1.11 The Flynn-Johnson classification of computer systems.

Spring 2006

Parallel Processing, Fundamental

Slide 24

Concepts - 1.5 Roadblocks to Parallel Processing

Grosch’s law: Economy of scale applies, or power = cost2

No longer valid; in fact we can get more bang per buck in micros

Minsky’s conjecture: Speedup tends to be proportional to log p

Has roots in analysis of memory bank conflicts; can be overcome

Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)

Faster ICs make paral el machines faster too; what about x1000?

Tyranny of vector supercomputers: Familiar programming model

Not all computations involve vectors; paral el vector machines

Software inertia: Bil ions of dol ars investment in software

New programs; even uniprocessors benefit from parallelism spec

Amdahl’s law: Unparallelizable code severely limits the speedup

Spring 2006

Parallel Processing, Fundamental

Slide 25

Concepts - Amdahl’s Law

50

f = fraction

f = 0

40

unaffected

)

f = 0.01

30

p = speedup

f = 0.02

of the rest

20

peedup (s

S

f = 0.05

10

1

s =

f = 0.1

f + (1 – f)/p

0

min(p, 1/f)

0

10

20

30

40

50

Enhancement factor (p )

Fig. 1.12 Limit on speed-up according to Amdahl’s law.

Spring 2006

Parallel Processing, Fundamental

Slide 26

Concepts - 1.6 Effectiveness of Parallel Processing

Fig. 1.13 p

Number of processors

1

Task graph W(p) Work performed by p processors

exhibiting

2

limited

T(p) Execution time with p processors

inherent

3

T(1) = W(1); T(p) W(p)

paral elism.

4

S(p) Speedup = T(1) / T(p)

5

8

7

6

E(p) Efficiency = T(1) / [p T(p)]

10

9

R(p) Redundancy = W(p) / W(1)

11

12

U(p) Utilization = W(p) / [p T(p)]

W(1) = 13

T(1) = 13

13

Q(p) Quality = T3(1) / [p T2(p) W(p)]

T( ) = 8

Spring 2006

Parallel Processing, Fundamental

Slide 27

Concepts - Reduction or Fan-in Computation

Example: Adding 16 numbers, 8 processors, unit-time additions

----------- 16 numbers to be added -----------

Zero-time communication

+

+

+

+

+

+

+

+

E(8) = 15 / (8 4) = 47%

S(8) = 15 / 4 = 3.75

R(8) = 15 / 15 = 1

+

+

+

+

Q(8) = 1.76

Unit-time communication

+

+

E(8) = 15 / (8 7) = 27%

S(8) = 15 / 7 = 2.14

+

R(8) = 22 / 15 = 1.47

Q(8) = 0.39

Sum

Fig. 1.14 Computation graph for finding the sum of 16 numbers .

Spring 2006

Parallel Processing, Fundamental

Slide 28

Concepts - ABCs of Parallel Processing in One Slide

A

Amdahl’s Law (Speedup Formula)

Bad news – Sequential overhead will kill you, because:

Speedup = T /T 1/[f + (1 – f)/p] min(1/f, p)

1

p

Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.

B

Brent’s Scheduling Theorem

Good news – Optimal scheduling is very difficult, but even a naive

scheduling algorithm can ensure:

T /p T T /p + T = (T /p)[1 + p/(T /T )]

1

p

1

1

1

Result: For a reasonably parallel task (large T /T ), or for a suitably

1

smal p (say, p T /T ), good speedup and efficiency are possible.

1

C

Cost-Effectiveness Adage

Real news – The most cost-effective parallel solution may not be

the one with highest peak OPS (communication?), greatest speed-up

(at what cost?), or best utilization (hardware busy doing what?).

Analogy: Mass transit might be more cost-effective than private cars

even if it is slower and leads to many empty seats.

Spring 2006

Parallel Processing, Fundamental

Slide 29

Concepts - 2 A Taste of Parallel Algorithms

Learn about the nature of paral el algorithms and complexity:

• By implementing 5 building-block paral el computations

• On 4 simple paral el architectures (20 combinations)

Topics in This Chapter

2.1 Some Simple Computations

2.2 Some Simple Architectures

2.3 Algorithms for a Linear Array

2.4 Algorithms for a Binary Tree

2.5 Algorithms for a 2D Mesh

2.6 Algorithms with Shared Variables

Spring 2006

Parallel Processing, Fundamental

Slide 30

Concepts - 2.1 Some Simple Computations

x0

identity

x

element

1

t = 0

x2

t = 1

t = 2

t = 3

.

.

xn–2

.

xn–1

t = n – 1

s = x x . . . x

0

1

n–1

t = n

s

Fig. 2.1 Semigroup computation on a uniprocessor.

Spring 2006

Parallel Processing, Fundamental

Slide 31

Concepts - Parallel Semigroup Computation

x

x x

x x

x x

x

x x

x

0

1

2

3

4

5

6

7

8

9

10

log n levels

2

s = x x . . . x

0

1

n–1

s

Semigroup computation viewed as tree or fan-in computation.

Spring 2006

Parallel Processing, Fundamental

Slide 32

Concepts - Parallel Prefix Computation

x0

identity

x

element

1

t = 0 Parallel version

x

much trickier

2

t = 1

compared to that

t = 2 of semigroup

s0

computation

t = 3

s1

.

Requires a

.

x

s

n–2

2

.

minimum of

xn–1

log n levels

2

s = x x x . . . x

t = n – 1

0

1

2

n–1

t = n

sn–2

sn–1

Prefix computation on a uniprocessor.

Spring 2006

Parallel Processing, Fundamental

Slide 33

Concepts - The Five Building-Block Computations

Semigroup computation: aka tree or fan-in computation

Al processors to get the computation result at the end

Parallel prefix computation:

The ith processor to hold the ith prefix result at the end

Packet routing:

Send a packet from a source to a destination processor

Broadcasting:

Send a packet from a source to al processors

Sorting:

Arrange a set of keys, stored one per processor, so that

the ith processor holds the ith key in ascending order

Spring 2006

Parallel Processing, Fundamental

Slide 34

Concepts - 2.2 Some Simple Architectures

P

P

0

P1

2

P3

P4

P5

P6

P7

P8

P

P

0

P1

2

P3

P4

P5

P6

P7

P8

Fig. 2.2 A linear array of nine processors and its ring variant.

Max node degree

d = 2

Network diameter

D = p – 1

( p/2 )

Bisection width

B = 1

( 2 )

Spring 2006

Parallel Processing, Fundamental

Slide 35

Concepts - (Balanced) Binary Tree Architecture

P0

Complete binary tree

2q – 1 nodes, 2q–1 leaves

Balanced binary tree

P1

P4

Leaf levels differ by 1

P

P

P

2

P3

5

6

Max node degree

d = 3

Network diameter

D = 2 log p ( 1 )

2

P7

P8

Bisection width

B = 1

Fig. 2.3 A balanced (but incomplete) binary tree of nine processors.

Spring 2006

Parallel Processing, Fundamental

Slide 36

Concepts - Two-Dimensional (2D) Mesh

P

P

P

P

P

P

0

1

2

0

1

2

Nonsquare

mesh

P

P

P

(r rows,

P

P

P

3

4

5

3

4

5

p/r col’s)

also possible

P

P

P

P

P

P

6

7

8

6

7

8

Max node degree

d = 4

Network diameter

D = 2 p – 2

( p )

Bisection width

B p

( 2 p )

Fig. 2.4 2D mesh of 9 processors and its torus variant.

Spring 2006

Parallel Processing, Fundamental

Slide 37

Concepts - Shared-Memory Architecture

Max node degree

d = p – 1

P 0

Network diameter

D = 1

P

P

Bisection width

B = p/2 p/2

8

1

P

P

Costly to implement

7

2

Not scalable

But . . .

P

P

6

3

Conceptually simple

Easy to program

P

P

5

4

Fig. 2.5 A shared-variable architecture modeled as a complete graph.

Spring 2006

Parallel Processing, Fundamental

Slide 38

Concepts - Architecture/Algorithm Combinations

Semi-

Parallel

Packet

Broad-

Sorting

group

prefix

routing

casting

P

P

0

P1

2

P3

P4

P5

P6

P7

P8

P

P

0

P1

2

P3

P4

P5

P6

P7

P8

P0

We wil spend more time on

P1

P4

linear array and binary tree

P

P

P

2

P3

5

6

P7

P8

P

P

P

P

P

P

0

1

2

0

1

2

P

P

P

P

P

P

3

4

5

3

4

5

P

P

P

P

P

P

6

7

8

6

7

8

and less time on mesh and

P 0

P

P

shared memory (studied later)

8

1

P

P

7

2

P

P

6

3

P

P

5

4

Spring 2006

Parallel Processing, Fundamental

Slide 39

Concepts - 2.3 Algorithms for a Linear Array

0

P

1

P

2

P

3

P

P

P

4

6

P

7

P

8

P

5

Initial

5 2 8 6 3 7 9 1 4 values

5 8 8 8 7 9 9 9 4

8 8 8 8 9 9 9 9 9

8 8 8 9 9 9 9 9 9

8 8 9 9 9 9 9 9 9

8 9 9 9 9 9 9 9 9

9 9 9 9 9 9 9 9 9 Maximum

identified

Fig. 2.6 Maximum-finding on a linear array of nine processors.

For general semigroup computation:

Phase 1: Partial result is propagated from left to right

Phase 2: Result obtained by processor p – 1 is broadcast leftward

Spring 2006

Parallel Processing, Fundamental

Slide 40

Concepts - Linear Array Prefix Sum Computation

0

P

1

P

2

P

3

P

P

P

4

6

P

7

P

8

P

5

Initial

5 2 8 6 3 7 9 1 4 values

5 7 8 6 3 7 9 1 4

5 7 15 6 3 7 9 1 4

5 7 15 21 3 7 9 1 4

5 7 15 21 24 7 9 1 4

5 7 15 21 24 31 9 1 4

5 7 15 21 24 31 40 1 4

5 7 15 21 24 31 40 41 4

5 7 15 21 24 31 40 41 45 Final

results

Fig. 2.7 Computing prefix sums on a linear array of nine processors.

Diminished parallel prefix computation:

The ith processor obtains the result up to element i – 1

Spring 2006

Parallel Processing, Fundamental

Slide 41

Concepts - Linear-Array Prefix Sum Computation

0

P

1

P

2

P

3

P

P

P

4

6

P

7

P

8

P

5

5 2 8 6 3 7 9 1 4

Initial

1 6 3 2 5 3 6 7 5 values

5 2 8 6 3 7 9 1 4

Local

6 8 11 8 8 10 15 8 9 prefixes

+

0 6 14 25 33 41 51 66 74 Diminished

=

prefixes

5 8 22 31 36 48 60 67 78

6 14 25 33 41 51 66 74 83 Final

results

Fig. 2.8 Computing prefix sums on a linear array

with two items per processor.

Spring 2006

Parallel Processing, Fundamental

Slide 42

Concepts - Linear Array Routing and Broadcasting

Left-moving packets

0

P

1

P

2

P

3

P

P

P

4

6

P

7

P

8

P

5

Right-moving packets

Routing and broadcasting on a linear array of nine processors.

To route from processor i to processor j:

Compute j – i to determine distance and direction

To broadcast from processor i:

Send a left-moving and a right-moving broadcast message

Spring 2006

Parallel Processing, Fundamental

Slide 43

Concepts - 5 2 8 6 3 7 9 1 4

5 2 8 6 3 7 9 1

4

5 2 8 6 3 7 9

1

4

9

5 2 8 6 3 7 1

4

Linear Array

7

5 2 8 6 3 1

9

4

Sorting

3

5 2 8 6

1

7

4

9

6

4

(Externally

5 2 8

3

7 9

1

8

6

5 2 1

3

4 7 9

Supplied Keys

8

)

5 1 2 3

4 6 7 9

5

3

8

1

2

4

6 7 9

1

5

2

3 4

8

9

6

7

5

6

8

1

2

3

4

7

9

5

7

2

3

4

9

1

6

8

6

8

Fig. 2.9 Sorting on a

1

2

3

4

5

7

9

linear array with the

7

1

9

2

3

4

5

6

8

keys input sequential y

8

1

2

3

4

5

6

7

9

from the left.

9

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

9

Spring 2006

Parallel Processing, Fundamental

Slide 44

Concepts - Linear Array Sorting (Internally Stored Keys)

0

P

1

P

2

P

3

P

P

P

4

6

P

7

P

8

P

5

In odd steps,

5 2 8 6 3 7 9 1 4

1, 3, 5, etc.,

5 2 8 3 6 7 9 1 4

2 5 3 8 6 7 1 9 4

odd-

2 3 5 6 8 1 7 4 9

numbered

2 3 5 6 1 8 4 7 9

processors

2 3 5 1 6 4 8 7 9

exchange

2 3 1 5 4 6 7 8 9

values with

2 1 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9

their right

neighbors

Fig. 2.10 Odd-even transposition sort on a linear array.

T(1) = W(1) = p log p

T(p) = p

W(p) p2/2

2

S(p) = log p (Minsky’s conjecture?)

R(p) = p/(2 log p)

2

2

Spring 2006

Parallel Processing, Fundamental

Slide 45

Concepts - 2.4 Algorithms for a Binary Tree

P0

P0

P1

P4

P1

P4

P

P

P

2

P3

5

6

P

P

P

2

P3

5

6

P7

P8

P7

P8

Semigroup computation and broadcasting on a binary tree.

Spring 2006

Parallel Processing, Fundamental

Slide 46

Concepts - x x

0 1 x x

2 3 x 4

Binary Tree

Parallel Prefix

x x

0 1

x x

2 3 x 4

Upwa

U rdp ward

Propagation

Computation

propagation

x

x x

0

x

x

1

2

3 4

x 3

x 4

Identity

Fig. 2.11 Parallel

prefix computation

Identity

x x

0 1

on a binary tree of

processors.

Downward

x x

Identity

x

0

0

Downward

0

x x 1

1 x 2

Propagation

propagation

x

x

x

0

1

2

x x x

x x

0 1

0

2

1 x x

2 3

x 3

x 4

x0

x x

0 1

x x

0 1 x 2

Results

x x

0 1 x x

2 3

x x

0 1 x x

2 3 x 4

Spring 2006

Parallel Processing, Fundamental

Slide 47

Concepts - Node Function in Binary Tree Parallel Prefix

Two binary operations:

[i, k]

[0, i – 1]

one during the upward

propagation phase,

and another during

downward propagation

Upward

Insert latches for

propagation

systolic operation

Downward

(no long wires or

propagation

propagation path)

[i, j – 1]

[0, j – 1]

[0, i – 1]

[ j, k]

Spring 2006

Parallel Processing, Fundamental

Slide 48

Concepts - Usefulness of Parallel Prefix Computation

Ranks of 1s in a list of 0s/1s:

Data:

0 0 1 0 1 0 0 1 1 1 0

Prefix sums:

0 0 1 1 2 2 2 3 4 5 5

Ranks of 1s:

1 2 3 4 5

Priority arbitration circuit:

Data:

0 0 1 0 1 0 0 1 1 1 0

Dim’d prefix ORs:

0 0 0 1 1 1 1 1 1 1 1

Complement:

1 1 1 0 0 0 0 0 0 0 0

AND with data:

0 0 1 0 0 0 0 0 0 0 0

Carry-lookahead network:

p ¢

x

= x

p g a g g p p p g a cin

a ¢ x = a

g or a

g ¢

x

=

g

Direction of indexing

Spring 2006

Parallel Processing, Fundamental

Slide 49

Concepts - Binary Tree Packet Routing

P0

XXX

LXX

RXX

P1

P4

LLX

RRX

P

P

P

LRX

RLX

2

P3

5

6

RRL

RRR

Preorder

P7

P8

Node index is a representation

indexing

of the path from the tree root

Packet routing on a binary tree with two indexing schemes.

Spring 2006

Parallel Processing, Fundamental

Slide 50

Concepts - Binary Tree Sorting

Smal values

5 2

3

“bubble up,”

causing the

5

2

3

1 4

root to “see”

the values in

1

4

(a)

(b)

ascending

order

2

2 1

5

3 1

5

3

Linear-time

sorting (no

4

4

better than

linear array)

(c)

(d)

Fig. 2.12 The first few steps of the sorting algorithm on a binary tree.

Spring 2006

Parallel Processing, Fundamental

Slide 51

Concepts - The Bisection-Width Bottleneck in a Binary Tree

Bisection Width = 1

Linear-time

sorting is the

best possible

due to B = 1

Fig. 2.13 The bisection width of a binary tree architecture.

Spring 2006

Parallel Processing, Fundamental

Slide 52

Concepts - 2.5 Algorithms for a 2D Mesh

5

2

8

8

8

8

9

9

9

6

3

7

7

7

7

9

9

9

9

1

4

9

9

9

9

9

9

Row maximums

Column maximums

Finding the max value on a 2D mesh.

5

7

15

5

7

15 0

5

7

15

6

9

16

6

9

16 15

21

24

31

9

10

14

9

10

14 31

40

41

45

Row prefix sums

Diminished prefix

Broadcast in rows

sums in last column

and combine

Computing prefix sums on a 2D mesh

Spring 2006

Parallel Processing, Fundamental

Slide 53

Concepts - Routing and Broadcasting on a 2D Mesh

P

P

P

P

P

P

0

1

2

0

1

2

Nonsquare

mesh

P

P

P

(r rows,

P

P

P

3

4

5

3

4

5

p/r col’s)

also possible

P

P

P

P

P

P

6

7

8

6

7

8

Routing: Send along the row to the correct column; route in column

Broadcasting: Broadcast in row; then broadcast in all column

Routing and broadcasting on a 9-processors 2D mesh or torus

Spring 2006

Parallel Processing, Fundamental

Slide 54

Concepts - Sorting on a 2D Mesh Using Shearsort

5 2 8 2 5 8 1 4 3 1 3 4 1 3 2 1 2 3

Number of iterations = log p

2

6 3 7 7 6 3 2 5 8 8 5 2 6 5 4 4 5 6

Compare-exchange steps in each iteration = 2 p

Total steps = (log p + 1) p

9 1 4 1 4 9 7

2 6 9 6 7 9 8 7 9 7 8 9

Initial values Snake-like Top-to-bot om Snake-like Top-to-bot om Left-to-right

row sort

column

row sort

column

row sort

5 2 8 2 5 8 1 4 3 s o1r t 3 4 1 3 2 s1o r t 2 3

Phase

Phase

Phase

6 3 7 7 6 3 2 5 81 8 5 2 6 5 4 2 4 5 6 3

9 1 4 1 4 9 7 6 9 6 7 9 8 7 9 7 8 9

Initial values Snake-like Top-to-bot om Snake-like Top-to-bot om Left-to-right

row sort column

row sort

column

row sort

sort

sort

Phase 1

Phase 2

Phase 3

1

Fi g. 2.14 The shearsort algorit 2

h

m on a 3 3 mesh.

3

Spring 2006

Parallel Processing, Fundamental

Slide 55

Concepts - 2.6 Algorithms with Shared Variables

Semigroup computation:

Each processor can perform

P 0

the computation local y

P

P

8

1

Parallel prefix computation:

Same as semigroup, except

only data from smal er-index

P

P

processors are combined

7

2

Packet routing: Trivial

Broadcasting: One step with

P

P

all-port (p – 1 steps with

6

3

single-port) communication

Sorting: Each processor

P

P

5

4

determines the rank of its data

element; fol owed by routing

Spring 2006

Parallel Processing, Fundamental

Slide 56

Concepts - 3 Parallel Algorithm Complexity

Review algorithm complexity and various complexity classes:

• Introduce the notions of time and time/cost optimality

• Derive tools for analysis, comparison, and fine-tuning

Topics in This Chapter

3.1 Asymptotic Complexity

3.2 Algorithms Optimality and Efficiency

3.3 Complexity Classes

3.4 Paral elizable Tasks and the NC Class

3.5 Paral el Programming Paradigms

3.6 Solving Recurrences

Spring 2006

Parallel Processing, Fundamental

Slide 57

Concepts - 3.1 Asymptotic Complexity

c g(n)

c' g(n)

g(n)

f(n)

g(n)

f(n)

f(n)

c g(n)

c g(n)

n 0

n

n 0

n

n 0

n

f(n) = O(g(n))

f(n) = (

g(n))

f(n) = (

g(n))

f(n) = O(g(n)) f(n) = (g(n)) f(n) = (g(n))

Fig. 3.1 Graphical representation of the notions of asymptotic complexity.

3n log n = O(n2)

½ n log2 n = (n)

2000 n2= (n2)

Spring 2006

Parallel Processing, Fundamental

Slide 58

Concepts - Little Oh, Big Oh, and Their Buddies

Notation

Growth rate

Example of use

f(n) = o(g(n))

strictly less than

T(n) = cn2 + o(n2)

f(n) = O(g(n))

no greater than

T(n, m) = O(n logn+m)

f(n) = (g(n))

the same as

T(n) = (n log n)

f(n) = (g(n))

no less than

T(n) = ( n)

f(n) = (g(n))

strictly greater than T(n) = (log n)

Spring 2006

Parallel Processing, Fundamental

Slide 59

Concepts - Growth Rates for

Typical Functions

Sublinear Linear Superlinear

log2n

n1/2

n

n log2n n3/2

--------

--------

--------

--------

--------

Table 3.1 Comparing the

9

3

10

90

30

Growth Rates of Sublinear

36

10

100

3.6 K 1 K

and Superlinear Functions

81

31

1 K 81 K 31 K

(K = 1000, M = 1 000 000).

169

100

10 K 1.7 M 1 M

256

316

100 K 26 M 31 M

361

1 K 1 M 361 M 1000 M

n (n/4) log2n n log2n 100 n1/2 n3/2

Table 3.3 Effect of Constants

--------

--------

-------- -------- --------

10

20

on the Growth Rates of

s

2 min 5 min 30 s

100

15

Running Times Using Larger

min 1 hr

15 min 15 min

1 K 6 hr

1 day 1 hr 9 hr

Time Units and Round Figures.

10 K 5 day 20 day 3 hr 10 day

100 K 2 mo 1 yr

9 hr 1 yr

Warning: Table 3.3 in

1 M 3 yr

11 yr

1 day 32 yr

text needs corrections.

Spring 2006

Parallel Processing, Fundamental

Slide 60

Concepts - Some Commonly Encountered Growth Rates

Notation

Class name

Notes

O(1)

Constant

Rarely practical

O(log log n) Double-logarithmic

Sublogarithmic

O(log n)

Logarithmic

O(logk n)

Polylogarithmic

k is a constant

O(na), a < 1

e.g., O(n1/2) or O(n1– )

O(n / logk n)

Stil sublinear

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

O(n)

Linear

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

O(n logk n)

Superlinear

O(nc), c > 1 Polynomial

e.g., O(n1+ ) or O(n3/2)

O(2n)

Exponential

General y intractable

O(22n)

Double-exponential

Hopeless!

Spring 2006

Parallel Processing, Fundamental

Slide 61

Concepts - 3.2 Algorithm Optimality and Efficiency

Lower bounds: Theoretical arguments

Upper bounds: Deriving/analyzing

based on bisection width, and the like

algorithms and proving them correct

Shifting lower bounds

Improving upper bounds

1988

1994

1996

1991

1988

1982

Zak’s thm. Ying’s thm.

Optimal

Dana’s alg. Chin’s alg.

Bert’s alg. Anne’s alg.

(log n)

(log n

2 )

algorithm?

O(n)

O(n log log n) O(n log n)

O(n 2)

log n

log n

2

n / log n

n

n log log n

n log n

n 2

Linear

Sublinear

Superlinear

Typical complexity classes

Fig. 3.2 Upper and lower bounds may tighten over time.

Spring 2006

Parallel Processing, Fundamental

Slide 62

Concepts - Some Notions of Algorithm Optimality

Time optimality (optimal algorithm, for short)

T(n, p) = g(n, p), where g(n, p) is an established lower bound

Problem size

Number of processors

Cost-time optimality (cost-optimal algorithm, for short)

pT(n, p) = T(n, 1); i.e., redundancy = utilization = 1

Cost-time efficiency (efficient algorithm, for short)

pT(n, p) = (T(n, 1)); i.e., redundancy = utilization = (1)

Spring 2006

Parallel Processing, Fundamental

Slide 63

Concepts - Beware of Comparing Step Counts

Solution

Machine or

algorithm A

4 steps

20 steps

Machine or

algorithm B

For example, one algorithm may

need 20 GFLOP, another 4 GFLOP

(but float division is a factor of 10

slower than float multiplication

Fig. 3.2 Five times fewer steps does not

necessarily mean five times faster.

Spring 2006

Parallel Processing, Fundamental

Slide 64

Concepts - 3.3 Complexity Classes

NP-ha rd

(Intractable?)

NP-c omplete

Exponential time

(e.g. the subset sum problem)

(intractable problems)

NP

Pspace-complete

Nondeterministic

Polynomial

Pspace

P

NP-

Co-NP-

Polynomial

complete

NP

Co-NP complete

(Tractable)

?

P = NP

P

(tractable)

A more complete view

of complexity classes

Conceptual view of the P, NP, NP-complete, and NP-hard classes.

Spring 2006

Parallel Processing, Fundamental

Slide 65

Concepts - Some NP-Complete Problems

Subset sum problem: Given a set of n integers and a target

sum s, determine if a subset of the integers adds up to s.

Satisfiability: Is there an assignment of values to variables in

a product-of-sums Boolean expression that makes it true?

(Is in NP even if each OR term is restricted to have exactly three literals)

Circuit satisfiability: Is there an assignment of 0s and 1s to

inputs of a logic circuit that would make the circuit output 1?

Hamiltonian cycle: Does an arbitrary graph contain a cycle

that goes through al of its nodes?

Traveling salesman: Find a lowest-cost or shortest-distance

tour of a number of cities, given travel costs or distances.

Spring 2006

Parallel Processing, Fundamental

Slide 66

Concepts - 3.4 Parallelizable Tasks and the NC Class

NP-ha rd

(Intractable?)

NC (Nick’s class):

Subset of problems

NP-c omplete

(e.g. the subset sum problem)

in P for which there

NP

exist paral el

Nondeterministic

Polynomial

algorithms using

p = nc processors

P

Polynomial

(polynomially many)

(Tractable)

?

that run in O(logk n)

P-c omplete

P = NP

time (polylog time).

NC

?

P-complete problem:

Nick's Class

NC = P

"efficiently"

Given a logic circuit

parallelizable

with known inputs,

determine its output

(circuit value prob.).

Fig. 3.4 A conceptual view of complexity classes and their relationships.

Spring 2006

Parallel Processing, Fundamental

Slide 67

Concepts - 3.5 Parallel Programming Paradigms

Divide and conquer

Decompose problem of size n into smal er problems; solve subproblems

independently; combine subproblem results into final answer

T(n)

=

T (n)

+

T

+

T (n)

d

s

c

Decompose Solve in parallel

Combine

Randomization

When it is impossible or difficult to decompose a large problem into

subproblems with equal solution times, one might use random decisions

that lead to good results with very high probability.

Example: sorting with random sampling

Other forms: Random search, control randomization, symmetry breaking

Approximation

Iterative numerical methods may use approximation to arrive at solution(s).

Example: Solving linear systems using Jacobi relaxation.

Under proper conditions, the iterations converge to the correct solutions;

more iterations greater accuracy

Spring 2006

Parallel Processing, Fundamental

Slide 68

Concepts - 3.6 Solving Recurrences

f(n) = f(n – 1) + n {rewrite f(n – 1) as f((n – 1) – 1) + n – 1}

= f(n – 2) + n – 1 + n

= f(n – 3) + n – 2 + n – 1 + n

. . .

This method is

= f(1) + 2 + 3 + . . . + n – 1 + n

known as unrolling

= n(n + 1)/2 – 1 = (n2)

f(n) = f(n/2) + 1

{rewrite f(n/2) as f((n/2)/2 + 1}

= f(n/4) + 1 + 1

= f(n/8) + 1 + 1 + 1

. . .

= f(n/n) + 1 + 1 + 1 + . . . + 1

-------- log n times --------

2

= log n = (log n)

2

Spring 2006

Parallel Processing, Fundamental

Slide 69

Concepts - More Example of Recurrence Unrolling

f(n) = 2f(n/2) + 1

= 4f(n/4) + 2 + 1

= 8f(n/8) + 4 + 2 + 1

. . .

= n f(n/n) + n/2 + . . . + 4 + 2 + 1

= n – 1 = (n)

Solution via guessing:

f(n) = f(n/2) + n

Guess f(n) = (n) = cn + g(n)

= f(n/4) + n/2 + n

cn + g(n) = cn/2 + g(n/2) + n

= f(n/8) + n/4 + n/2 + n

Thus, c = 2 and g(n) = g(n/2)

. . .

= f(n/n) + 2 + 4 + . . . + n/4 + n/2 + n

= 2n – 2 = (n)

Spring 2006

Parallel Processing, Fundamental

Slide 70

Concepts - Still More Examples of Unrolling

f(n) = 2f(n/2) + n

Alternate solution method:

= 4f(n/4) + n + n

f(n)/n = f(n/2)/(n/2) + 1

= 8f(n/8) + n + n + n

Let f(n)/n = g(n)

. . .

g(n) = g(n/2) + 1 = log

= n f(n/n) + n + n + n + . . . + n

2 n

--------- log n times ---------

2

= n log n = (n log n)

2

f(n) = f(n/2) + log n

2

= f(n/4) + log (n/2) + log n

2

2

= f(n/8) + log (n/4) + log (n/2) + log n

2

2

2

. . .

= f(n/n) + log 2 + log 4 + . . . + log (n/2) + log n

2

2

2

2

= 1 + 2 + 3 + . . . + log n

2

= log n (log n + 1)/2 = (log2 n)

Spring 2006

2

P

2 arallel Processing, Fundamental Slide 71

Concepts - Master Theorem for Recurrences

Theorem 3.1:

Given f(n) = a f(n/b) + h(n); a, b constant, h arbitrary function

the asymptotic solution to the recurrence is (c = log a)

b

f(n) = (n c

c –

)

if h(n) = O(n

) for some > 0

f(n) = (n c

c

log n)

if h(n) = (n )

f(n) = (h(n))

if h(n) = (n c +

) for some > 0

Example:

f(n) = 2 f(n/2) + 1

a = b = 2; c = logb a = 1

h(n) = 1 = O( 1 –

n

)

f(n) = (nc ) = (n)

Spring 2006

Parallel Processing, Fundamental

Slide 72

Concepts - Intuition Behind the Master Theorem

Theorem 3.1:

Given f(n) = a f(n/b) + h(n); a, b constant, h arbitrary function

the asymptotic solution to the recurrence is (c = log a)

b

f(n) = (n c

c –

)

if h(n) = O(n

) for some > 0

f(n) = 2f(n/2) + 1 = 4f(n/4) + 2 + 1 = . . .

The last term

= n f(n/n) + n/2 + . . . + 4 + 2 + 1

dominates

f(n) = (n c

c

log n)

if h(n) = (n )

f(n) = 2f(n/2) + n = 4f(n/4) + n + n = . . .

All terms are

= n f(n/n) + n + n + n + . . . + n

comparable

f(n) = (h(n))

if h(n) = (n c +

) for some > 0

f(n) = f(n/2) + n = f(n/4) + n/2 + n = . . .

The first term

= f(n/n) + 2 + 4 + . . . + n/4 + n/2 + n

dominates

Spring 2006

Parallel Processing, Fundamental

Slide 73

Concepts - 4 Models of Parallel Processing

Expand on the taxonomy of paral el processing from Chap. 1:

• Abstract models of shared and distributed memory

• Differences between abstract models and real hardware

Topics in This Chapter

4.1 Development of Early Models

4.2 SIMD versus MIMD Architectures

4.3 Global versus Distributed Memory

4.4 The PRAM Shared-Memory Model

4.5 Distributed-Memory or Graph Models

4.6 Circuit Model and Physical Realizations

Spring 2006

Parallel Processing, Fundamental

Slide 74

Concepts - 4.1 Development of Early Models

Associative memory

100111010110001101000

Comparand

Mask

Parallel masked search of all words

Bit-serial implementation with RAM

Memory

array with

Associative processor

comparison

Add more processing logic to PEs

logic

Table 4.1 Entering the second half-century of associative processing

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

Decade Events and Advances

Technology

Performance

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

1940s Formulation of need & concept

Relays

1950s Emergence of cell technologies

Magnetic, Cryogenic

Mega-bit-OPS

1960s Introduction of basic architectures Transistors

1970s Commercialization & applications ICs

Giga-bit-OPS

1980s Focus on system/software issues VLSI

Tera-bit-OPS

1990s Scalable & flexible architectures

ULSI, WSI

Peta-bit-OPS

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

Spring 2006

Parallel Processing, Fundamental

Slide 75

Concepts - The Flynn-Johnson Classification Revisited

Data stream(s)

Single

Multiple

le

SIMD

SISD

SIMD

(s)

ing

versus

m

S

“Uniprocessor”

“Array processor”

a

MIMD

“Shared-memory

l stre

al

multiprocessor”

le

lob

ry

ntro

MISD

GMSV

GMMP

G

o

o

MIMD

m

C

ultip

(Rarely used)

DMSV

DMMP

M

e

“Distributed

“Distrib-memory

M

istrib

I

shared memory” multicomputer

2

D

I

I

1

5

Shared

Message

Global

Data

Data

variables

passing

versus

In

Out

Communication/

Distributed

Fig. 4.2

Synchronization

memory

I 3

I 4

Fig. 4.1 The Flynn-Johnson classification of computer systems.

Spring 2006

Parallel Processing, Fundamental

Slide 76

Concepts - 4.2 SIMD versus MIMD Architectures

Most early parallel machines had SIMD designs

SIMD Timeline

Attractive to have skeleton processors (PEs)

1960

Eventually, many processors per chip

High development cost for custom chips, high cost

ILLIAC IV

MSIMD and SPMD variants

1970

Most modern parallel machines have MIMD designs

DAP

COTS components (CPU chips and switches)

1980 Goodyear MPP

MPP: Massively or moderately parallel?

TMC CM-2

Tightly coupled versus loosely coupled

MasPar MP-1

1990

Explicit message passing versus shared memory

Network-based NOWs and COWs

Networks/Clusters of workstations

2000 Clearspeed

array coproc

Grid computing

Vision: Plug into wal outlets for computing power

2010

Spring 2006

Parallel Processing, Fundamental

Slide 77

Concepts - 4.3 Global versus Distributed Memory

Memory

Processors

modules

0

0

1

1

Options:

Processor-

Processor-

Crossbar

to-processor

. to-memory .

network

.

network

.

Bus(es)

.

.

MIN

p -1

m-1

Bottleneck

. . .

Complex

Parallel I/O

Expensive

Fig. 4.3 A paral el processor with global memory.

Spring 2006

Parallel Processing, Fundamental

Slide 78

Concepts - Removing the Processor-to-Memory Bottleneck

Processors

Caches

Memory

modules

0

0

1

1

Processor-

Processor-

to-processor

Challenge: . to-memory

.

network

Cache

.

network

.

coherence

.

.

p -1

m-1

. . .

Parallel I/O

Fig. 4.4 A parallel processor with global memory and processor caches.

Spring 2006

Parallel Processing, Fundamental

Slide 79

Concepts - Distributed Shared Memory

Memories Processors

Some Terminology:

0

NUMA

1

Nonuniform memory access

(distributed shared memory)

.

. Interconnection

UMA

.

network

Uniform memory access

(global shared memory)

p -1

COMA

Cache-only memory arch

.

Parallel I/O

.

.

Fig. 4.5 A paral el processor with distributed memory.

Spring 2006

Parallel Processing, Fundamental

Slide 80

Concepts - 4.4 The PRAM Shared-Memory Model

Processors

Shared Memory

0

0

1

2

1

3

.

.

.

.

.

.

p–1

m–1

Fig. 4.6 Conceptual view of a paral el random-access machine (PRAM).

Spring 2006

Parallel Processing, Fundamental

Slide 81

Concepts - PRAM Implementation and Operation

Memory Access

PRAM Cycle:

Network &

Processors

Controller

Shared Memory All processors

0

0

read memory

1

locations of their

choosing

Proces-

2

sor

1

3

All processors

Control

.

compute one step

.

.

independently

.

.

.

All processors

store results into

p–1

m–1memory locations

of their choosing

Fig. 4.7 PRAM with some hardware details shown.

Spring 2006

Parallel Processing, Fundamental

Slide 82

Concepts - 4.5 Distributed-Memory or Graph Models

Fig. 4.8 The sea of interconnection networks.

Spring 2006

Parallel Processing, Fundamental

Slide 83

Concepts - Some Interconnection Networks (Table 4.2)

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

––––––

Number Network

Bisection Node Local

Network name(s)

of nodes diameter

width

degree links?

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

––––––

1D mesh (linear array)

k

k – 1

1

2

Yes

1D torus (ring, loop)

k

k/2

2

2

Yes

2D Mesh

k2

2k – 2

k

4

Yes

2D torus (k-ary 2-cube)

k2

k

2k

4

Yes1

3D mesh

k3

3k – 3

k2

6

Yes

3D torus (k-ary 3-cube)

k3

3k/2

2k2

6

Yes1

Pyramid

(4k2 – 1)/3 2 log k

2k

9

No

2

Binary tree

2l – 1

2l – 2

1

3

No

4-ary hypertree

2l(2l+1 – 1) 2l

2l+1

6

No

Butterfly

2l(l + 1) 2l

2l

4

No

Hypercube

2l

l

2l–1

l

No

Cube-connected cycles

2l l

2l

2l–1

3

No

Shuffle-exchange

2l

2l – 1

2l–1/l 4 unidir. No

De S

Brpr

ui i

j ng 2006

n

Para

2l llel Proc

e

s

s

ling, Funda

2l m

/l ental

4 S

u l

ni ide

dir. 84

No

––––––––––––––––– Co

–– nc

–––ept

–– s

–––––––––––––––––––––––––––––––––––

–––––

1 With folded layout - 4.6 Circuit Model and Physical Realizations

Bus switch

(Gateway)

Low-level

cluster

Scalability dictates hierarchical connectivity

Fig. 4.9 Example of a hierarchical interconnection architecture.

Spring 2006

Parallel Processing, Fundamental

Slide 85

Concepts - Signal Delay on Wires No Longer Negligible

1.5

Hypercube

1.0

0.5

2-D Torus

2-D Mesh

0.0 0

2

4

6

Wire Length (mm)

Fig. 4.10 Intrachip wire delay as a function of wire length.

Spring 2006

Parallel Processing, Fundamental

Slide 86

Concepts - Pitfalls of

4

O(

O 10

1

0 )4

O(

O 1

( 0 )

)4

Scaling up

(Fig. 4.11)

If the weight

Sca

c l

a e

l d

d u

p

p a

n

a t o

n t

he r

a

r m

a p

m a

p g

a e

g !

Sc

S aled u

p a

nt to

n t h

t e r

am

a pa

p g!e! !

Wh

W at

a i s

i w

r

w o

r ng

g w

i

w t

i h

h t

his

i p

i

p c

i t

c ure

r ?

?

of ant grows

Wh

W at ti s w

rong w

ith

t t h

t is p

ictu

ct re?

e

?

by a factor of

one trillion,

Ant scaled up in length

the thickness

from 5 mm to 50 m

of its legs

must grow by

Leg thickness must grow

a factor of

from 0.1 mm to 100 m

one mil ion to

support the

new weight

Sca

c l

a e

l d

d u

p

p a

n

a t c

o

c llla

l p

a s

p es u

nd

n e

d r

r o

wn

w w

e

w ig

i h

g t

h .

Sc

S aled u

p a

nt tco

llaps

p es

e u

nd

n er

r o.

own

w w

eight.t .

Spring 2006

Parallel Processing, Fundamental

Slide 87

Concepts