このページは http://www.slideshare.net/DICEStudio/five-rendering-ideas-from-battlefield-3-need-for-speed-the-run の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約5年前 (2011/08/18)にアップロードin暮らし

Siggraph 2011 talk by John White (NFS) & Colin Barre-Brisebois (BF3) at EA about rendering & perf...

Siggraph 2011 talk by John White (NFS) & Colin Barre-Brisebois (BF3) at EA about rendering & performance

- More Performance!

Five Rendering Ideas from Battlefield 3 and

Need For Speed: The Run

John White (NFS)

Colin Barré-Brisebois (BF3)

Advances in Real-Time Rendering in Games - Agenda

Motivations

The Techniques

Separable Bokeh Depth-of-Field

Hi-Z / Z-Cull Reverse Reload Tricks

Chroma Sub-Sampled Image Processing

Tiled-Based Deferred Shading on Xbox 360

Temporally-Stable Screen-Space Ambient Occlusion

Q&A

Advances in Real-Time Rendering in Games - Motivations

Frostbite 2 is DICE’s

Next-Generation Engine for Current

Generation Platforms

5 Pillars:

Animation

Audio

Scale

Destruction

Rendering

Powers Battlefield 3 and Need For Speed: The Run

Advances in Real-Time Rendering in Games - More Info

Lots of FB2 papers on DICE website

publications.dice.se

Also see Alex Ferrier’s talk on “1000

points of light” Tues, 2pm, East

Building, Ballroom A/B

Advances in Real-Time Rendering in Games - Advances in Real-Time Rendering in Games
- Advances in Real-Time Rendering in Games
- Separable Bokeh

Depth-of-Field

Advances in Real-Time Rendering in Games - Real World Bokeh - Disc

Photo Courtesy of Mohsin Hasan 2011

Advances in Real-Time Rendering in Games - Real World Bokeh – Pentagonal

Photo Courtesy of Mohsin Hasan 2011

Advances in Real-Time Rendering in Games - Circle of Confusion Calculation

Calc per pixel CoC from real world camera parameters

Lens Length (derive from FOV)

F/Stop

Focal Plane

CoC is a simple MADD on the raw Z Depth [Demers04]

[Jones08]

CoC = abs(z * CoCScale + CoCBias)

CoCScale = (A * focal ength * focalplane * (zfar - znear)) / ((focalplane - focal ength) * znear * zfar)

CoCBias = (A * focal ength * (znear - focalplane )) / ((focalplane * focal ength) * znear)

Advances in Real-Time Rendering in Games - Pre Multiplied CoC

For 16-bit source pre multiply the CoC by the colour

Store CoC in alpha

Recover colour by doing col.rgb /= col.a

Ensure CoC always has a small number so colour can

always be recovered

Advances in Real-Time Rendering in Games - Blur Process

Gaussian blur

Common in DX9 games. Cheap

2D Area samples

Limited kernel size before texture tap explosion

GS expanded Point Sprites

Heavy fil rate

CryEngine3 DX11 and Unreal Engine 3 Samaritan demo

Advances in Real-Time Rendering in Games - Gaussian vs. real world bokeh

Arbitrary blurs in image space are O(N^2)

Gaussian blurs can be made separable O(N)

What 2D blurs can be made separable?

Gaussian

Box

Skewed Box

Advances in Real-Time Rendering in Games - Other separable blurs

Gaussian, Box and Skewed Box

Advances in Real-Time Rendering in Games - Hexagonal Blurs

Decompose a hexagon into 3 rhombi

Each rhombi can be computed via separable blur

7 Passes in total. 3 shapes x 2 blurs + 1 combine

1

2

First Pass

Second Pass

3

Advances in Real-Time Rendering in Games - Hexagonal Blurs – Pass Reduction

Hexagonal blur using a separable filter

But 7 passes and 6 blurs is not competitive

Need to reduce passes

1

2

First Pass

Second Pass

3

Advances in Real-Time Rendering in Games - Hexagonal Blurs – Pass Reduction 1

Pass 1

Pass 2

Pass 1 Up

Down Left

Pass 2 Down Left

+

+ Down Right

+ Down Right

Advances in Real-Time Rendering in Games - Hexagonal blurs – Pass reduction 2

Pass 1

Pass 2

+

+

Advances in Real-Time Rendering in Games - Hexagonal Bokeh

Advances in Real-Time Rendering in Games - Hexagonal Bokeh

Advances in Real-Time Rendering in Games - Hexagonal Bokeh

Advances in Real-Time Rendering in Games - Hexagonal Bokeh

Advances in Real-Time Rendering in Games - Hexagonal vs Gaussian

Gaussian

2 Passes with a total of 2 blurs

Hexagonal

2 Passes (3 resolves) with a total of 4 blurs

BUT each blur only needs half the taps therefore same #taps

BUT each tap contributes equally unlike Gaussian so need less taps for a

given aesthetic filter kernel width!

PLUS We can improve further

Advances in Real-Time Rendering in Games - Iterative Refinement

Because we have equal weighted blurs can use iterative

refinement on the blurring [Sousa08]

Multiple passes fill in the under-sampling

Dual iteration blur needs a total of 5 passes with a total of

8 half blurs.

Advances in Real-Time Rendering in Games - Iterative Refinement

Pass 1

Pass 2

Pass 3

Pass 5

+

+

Pass 4

Advances in Real-Time Rendering in Games - Pseudo Scatter filter

Proper bokeh should have its blur scattered to its

neighbours

However pixel shaders are designed to gather they can’t

scatter the results

Typical blurs default the filter kernel to the CoC of pixel

Instead, default to big CoC and reject based on the

sampled texel CoC

–Extra method to stop bleeding artifacts and can sharpen

up smooth gradients

Advances in Real-Time Rendering in Games - Hi Z culling

When downsampling the premultiplied CoC buffer output the

computed CoC as depth

You can then draw the plane at Z depth of 0.001f

In focus pixels will be quickly rejected by Hi Z

Same for iterative refinement

Draw undersample pass at higher Z value, fine at small Z

value

–Requires an explicit copy afterwards to re-fill

Advances in Real-Time Rendering in Games - Hi-Z / Z-Cull

Reverse Reload Tricks

Advances in Real-Time Rendering in Games - Hi-Z (1/)

•Ubiquitous on modern hardware

•Stores a Low Res version of the Z

buffer

•Can use this to conservatively reject

groups of pixels

•Saves fragment shading known

occluded pixels

Advances in Real-Time Rendering in Games - Hi-Z (2/)

Advances in Real-Time Rendering in Games - Hi-Z (3/)

Empty Space

Solid Space

Advances in Real-Time Rendering in Games - Volume Rendering (1/)

•In Deferred Renderers it is common to reproject screen

pixels back into world space

–Common for lights (point, spot, line)

–Shadow volumes

•Draw a convex bounding polyhedron projected in screens

space

•In shader, reject pixels which are not in the volume bounds

Advances in Real-Time Rendering in Games - Volume Rendering (2/)

A

B

C

Advances in Real-Time Rendering in Games - Reverse Hi-Z Reload (X360)

Alias a Render Target on existing depth buffer

Init aliased RT to D3DHIZFUNC_GREATER_EQUAL

Draw Full screen quad

–NULL Pixel Shader

–Zfunc == Never

Hi-Z is now primed in reverse

Similar technique on Playstation3 (See DevNet)

Advances in Real-Time Rendering in Games - Reverse Hi-Z (1/)

Empty Space

Solid Space

Solid Space

Empty Space

Advances in Real-Time Rendering in Games - Reverse Hi-Z (2/)

The GPU will now cull fragments if they are closer than the

value in the depth buffer

By rendering the backfaces of convex polyhedra, pixels

beyond the faces will quickly reject

If the camera is inside the volume then only pixels inside

the volume will pass

Perfect for cascaded shadow maps

Advances in Real-Time Rendering in Games - Reverse Hi-Z (3/)

A

B

C

Advances in Real-Time Rendering in Games - Reverse Hi-Z for CSM rendering

Each cascade for a directional light is bounded by a cuboid

in world space

Only world space pixels inside the cuboid will project onto

the shadow map

By drawing the cuboid backfaces only these pixels will

pass the reverse Z test

Advances in Real-Time Rendering in Games - CSM Cuboids

CSM1

CSM0

C

A

B

Advances in Real-Time Rendering in Games - CSM Cuboids

Evaluate as a separate pass [Sousa08]

–Input is depth buffer

–Creates a L8 mask texture input into directional light pass

Can do a prior full screen pass to tag back facing pixels wrt

to light source in stencil

–Heuristic on sun angle with camera

Potential for ¼ res with bilateral upsample

Stencil is updated to denote already processed pixels

Advances in Real-Time Rendering in Games - Reverse Hi-Z CSM (1/)

Cascade 0

Advances in Real-Time Rendering in Games - Reverse Hi-Z CSM (2/)

Cascade 1

Advances in Real-Time Rendering in Games - Reverse Hi-Z CSM (3/)

Cascade 2

Advances in Real-Time Rendering in Games - Reverse Hi-Z CSM (4/)

Cascade 3

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps (1/)

Downsample and dilate SM, keeping track of min and max depths

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps (2/)

Dilated min/max SM allow us to know if a pixel is ...

–Fully In shadow

–Fully out of shadow

–Partial y in shadow (conservatively)

Draw each cuboid twice

First Pass

–Single simple tap shader

Second Pass

–High quality PCF shader

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps Simple Pass

If Z < MinMax.min return (1,1,1,1)

If Z > MinMax.max return (0,0,0,1)

If Z < MinMax.min return (0,0,0,0)

Mask

Stencil

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps PCF Pass (1/)

Second pass is standard PCF filter

Mask

Stencil

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps (3/)

Do for all cascades

Advances in Real-Time Rendering in Games - Min/Max Shadow Maps PCF Pass (2/)

Final Overdraw

Advances in Real-Time Rendering in Games - Conditional Tests

When we render the cuboid we can count how many pixels pass

If Zero then no pixels will sample from the shadow map

–So why even render the shadow map!

–Draw the cuboid first and only if pixels pass draw the actual shadow map

Zero passed pixels occur for two reasons

–Al pixels are further away

–Al pixels have been touched by a closer cascade (stencil cleared)

Advances in Real-Time Rendering in Games - Chroma Sub-Sampled

Image Processing

Advances in Real-Time Rendering in Games - Chroma Sub-Sampling

Not a new idea. Used in TV broadcasts as well as

Jpeg/Mpeg compression

Decompose image into luminance and chroma

Store Luma at full res only. Chroma at lower

Advances in Real-Time Rendering in Games - Chroma Sub-Sampling - Motivation

Post processing requires lots of bandwidth

Easy to optimise ALU down

Quickly hit a performance ceiling, especially for 16bpp pixels

Reading and writing a 720p image with 16bpp components is 14MB

bandwidth

Assuming 14GB/Sec Bandwidth and perfect cache usage this is 1ms for a

single pass

RED

Green

Blue

X

Advances in Real-Time Rendering in Games - Chroma Sub Sampling - Motivation

Instead reduce down to Luma only

¼ of the bandwidth required

Requires extra processing on the Colour

But this can be 2 channel at ¼ res ( 1/8 original size )

Also can get away with less taps

Luma

Cb

Cr

Advances in Real-Time Rendering in Games - Chroma Sub Sampling

Bandwidth is reduced to 1/4

So, are the shaders now 4X quicker?

No. We are ALU bound again

Texture units and ALU are designed for 4 component SIMD

We are only using 1 component

Need to pack 4 luma values together and process together

Advances in Real-Time Rendering in Games - Chroma Sub Sampling

Pack 4 adjacent pixels together into one RGBA pixel

Only need 1 texread to get 4 luma values

So a 1280x720 luma buffer is a 320x720 ARGB buffer

With the packed buffer bilinear filtering is not correct

Have to manually filter horizontally using DOTP

Colin wil go into this later

Advances in Real-Time Rendering in Games - Chroma Sub Sampling

Advances in Real-Time Rendering in Games - Chroma Sub Sampling

Advances in Real-Time Rendering in Games - Butterfly Packing

Overlay each quadrant into ARGB

Mirror around the image center point

Bilinear now works except across the boundaries

•Re-draw a strip with additive blend and swizzling

•R<->G and B<->A for horizontal

•R<->B and G<->A for vertical

Radial blurs just work

Advances in Real-Time Rendering in Games - Butterfly Unpacking

When rendering fullscreen quad, need two attributes

0,0

2,0

Use UV in Mirror Mode

640,-640,-360,-360

-640,640,-360,-360

Dot with saturated second

component

0,2

2,2

-640,-640,360,-360

-640,-640,-360,360

Advances in Real-Time Rendering in Games - Future Work

Use for hexagonal blurs

Output packed tonemap

•Only perform temporal AA for luma

•Packed luma used for MLAA passes

Advances in Real-Time Rendering in Games - Tiled-based

Deferred Shading on Xbox 360

Advances in Real-Time Rendering in Games - Tiled-based Deferred Shading? (1/)

Want more interesting lighting with more dynamic lights!

Platform is fixed better usage of rendering resources

[Swoboda09] and [Coffin11] on Playstation3™, by

[Andersson09] in DirectCompute, and other hybrids

Christina, Johan and I teamed-up for this version on 360

Load-balance and compute lighting where it matters:

1.Divide the screen in screen-space tiles

2.Cull analytical lights (point, cone, line), per tile

3.Compute lighting for all contributing lights, per tile

Advances in Real-Time Rendering in Games - Tiled-based Deferred Shading? (2/)

Advances in Real-Time Rendering in Games - Tiled-based Deferred Shading? (3/)

Advances in Real-Time Rendering in Games - Tiled-based Deferred Shading? (4/)

Advances in Real-Time Rendering in Games - Tiled-based Deferred Shading? (5/)

Advances in Real-Time Rendering in Games - How Does This Fit on Xbox 360?

We don't have DirectCompute nor SPUs on 360...

Fortunately, Xenos is powerful, and will crunch ALU

For maximal throughput, data at rendering time has to

be cleverly pre-digested

If timed properly, we can also use the CPUs to help the

GPU along the way...

GPU is better at analyzing a scene than CPUs…

Let’s use it to classify the scene

Advances in Real-Time Rendering in Games - GPGPU Culling (1/)

Our screen is divided in 920 tiles of 32x32 pixels

Downsample and classify the scene from 720p to 40x23

(1 pixel == 1 tile)

Find each tile’s Min/Max depth

Find each tile’s material permutations

Downsampling is done in multi-pass and via MRTs

Similar to [Hutchinson10]

Advances in Real-Time Rendering in Games - GPGPU Culling (2/)

Advances in Real-Time Rendering in Games - GPGPU Culling (3/)

Advances in Real-Time Rendering in Games - GPGPU Culling (4/)

Advances in Real-Time Rendering in Games - GPGPU Culling (5/)

Build mini-frustas for each tile

Cull lights against sky-free tiles in a shader

Store the culling results in a texture:

Column == Light ID

Row == Tile ID

Actually, 4 lights can be processed at once (A-R-G-B)

Read back the contribution results on the CPU and

prepare for lighting!

Advances in Real-Time Rendering in Games - I Need a Light

Parse the culling results texture on CPU

For each light type,

For each tile,

For each material permutation,

Regroup & set the light parameters for the PS constants

Setup the shader loop counter*

Additively render lights with a single draw call (to the final HDR

lighting buffer)

Advances in Real-Time Rendering in Games - Results (1/)

Advances in Real-Time Rendering in Games - Results (2/)

Advances in Real-Time Rendering in Games - Timeline

GPU

G-

G Bu

B ffer

f

DS

C

Othe

O

r

Light

CPU

G-Buf

G

fer

-Buf

Oth

O er

Prepare

DS: Downsample / Classify

C: Cull

Light: Lighting pass

We kick CPU jobs from the GPU using a MEMEXPORT

shader (i.e.: write token at specific address, job starts)

Advances in Real-Time Rendering in Games - Don’t Upset The GPU (1/)

Constant Waterfall sucks!

This WILL kill performance

To prevent, use the aL register when iterating over lights

[Pritchard10]

If set properly, ALU / lighting will run at 100% efficiency

In C++ Code

int lightCounter[4] = { count, start, step, 0 };

pDevice->SetPixelShaderConstantI(0, lightCounter, 1);

Advances in Real-Time Rendering in Games - Don’t Upset The GPU (2/)

int tileLightCount : register(i0);

float4 lightParams[NUM_LIGHT_PARAMS] : register(c0);

start

count*step

step

[loop]

for (int iLight = 0; iLight < tileLightCount; iLight++)

{

float4 params1 = lightParams[iLight + 0]; // mov r0 c0[0+aL]

float4 params2 = lightParams[iLight + 1]; // mov r1 c0[1+aL]

float4 params3 = lightParams[iLight + 2]; // mov r2 c0[2+aL]

…

}

Advances in Real-Time Rendering in Games - Don’t Upset The GPU (3/)

Use Dr.PIX, and check shader disassembly!

These shaders are ALU bound

Simplify your math, especially in the loops!

Get rid of complicated non 1:1 instructions (e.g. smoothstep)

Play with microcode: -normalize(v) is faster than normalize(-v)

Move code around to help with dual-issuing:

/* 14 */ mul r5.xyz, r4.yzx, r4.yzx

this + mulsc r0.w, c254.y, r0.z

Use shader predicates to help the compiler ([flatten], [branch], [isolate],

[ifAny], [ifAll]), and tweak GPRs!

Advances in Real-Time Rendering in Games - Don’t Upset The GPU (4/)

Use GPU freebies

Texture sampler scale/bias (*2-1)

Simplify / remove unneeded code via permutations

Upload constants via the constant buffer pointers

We use async pre-compiled command buffers

(APCBs)

Keep them lean & mean (check contents in PIX)

For more info, check out Ivan’s awesome

presentation from Gamefest 2011 [Nevraev11]

Advances in Real-Time Rendering in Games - Performance

Light Type

Performance

(8 lights/tile, every tile)

Point

4.0 ms

Point (with Spec)

7.8 ms

Cone

5.1 ms

Cone (with Spec)

5.3 ms

Line

5.8 ms

Classification: 1.35 ms (with resolves)

Advances in Real-Time Rendering in Games - Temporally-stable

Screen-Space Ambient Occlusion

Advances in Real-Time Rendering in Games - SSAO in Frostbite 2 (1/)

SSAO for mid-range PC & consoles, HBAO for high-end PC

Line sample [Loos10], with linear depth in a texture

Linearize depth for better precision/distribution

kZ = – far * near / (far – near);

kW = far / (far – near)

linearDepth = kZ / (z - kW)

Sample linear depth texture with linear sampling

Scale SSAO parameters over distance

Final compositing with Hi-Stencil, reject sky pixels

4x4 random noise texture is sufficient, 1:1 (texel:pixel)

Advances in Real-Time Rendering in Games - SSAO in Frostbite 2 (2/)

Line sampling, from Volumetric Obscurance [Loos10] - HBAO in Frostbite 2

Advances in Real-Time Rendering in Games - SSAO in Frostbite 2 (1/)

Advances in Real-Time Rendering in Games - Blurring the line (1/)

Dynamic AO is done best with edge-preserving blur /

bilateral filtering

On consoles, we have really tight budgets

Scenes are pretty action-packed, halos not too noticeable

AO should be a subtle effect

We need to find the fastest way to blur AO, and has to

look soft! (e.g.: 9x9 Gaussian, with bilinear)

Advances in Real-Time Rendering in Games - Fast Grayscale Blur - 8 as 8888 (1/)

Reduce the number of taps: aliasing the AO results from

R8 as A8R8G8B8

1 horizontal tap == 4 taps (ARGB)

Combine with bilinear sampling (vertical pass only)

9x9 Gaussian = 3 horizontal taps and 5 vertical taps

On PS3: alias the memory directly

On 360: Formats are different in memory, use resolve

remap textures. See FastUntile XDK sample.

Advances in Real-Time Rendering in Games - Fast Grayscale Blur (1/)

Horizontal

Vertical

9 “samples” 3 point taps

9 “samples 5 bilinear taps

Advances in Real-Time Rendering in Games - Fast Grayscale Blur (2/)

For a 640x360 SSAO Buffer (720p / 2)

Technique

PlayStation 3

Xbox 360

9x9 Gaussian

0.5 ms

0.65 ms

(0.52 ms + 0.132 ms resolve)

9x9 Gaussian

0.40 ms

0.43 ms

(Bilinear, as R8)

(0.3 ms + 0.132 ms resolve)

9x9 Gaussian

0.10 ms

0.18 ms

(Bilinear, as A8R8G8B8)

(0.143 ms + 0.034 ms resolve)

Average (total) AO performance (compute + blur + blit) :

(360: 1.25-1.5 ms ; PS3: 1.5-2.0 ms)

Advances in Real-Time Rendering in Games - Thank You

Christina Coffin

Johan Andersson

Ivan Nevraev

Daniel Collin

Khalid Khalkhouli

Andrew Routledge

Stephen Hill

Aurelio Reis

Alex Ferrier

Fredrik Seehussen

Alex Fry

Natalya Tatarchuk

Mohsin Hasan

Advances in Real-Time Rendering in Games - Questions

John White – Bokeh, Z Cull Reverse Reload, Chroma Subsampling

Colin Barré-Brisebois – Tile-based Deferred Shading, SSAO

Advances in Real-Time Rendering in Games - References (1/)

ANDERSSON, J., “Parallel Graphics in Frostbite - Current & Future”, Beyond Programmable

Shading, SIGGRAPH 2009.

BAVOIL, L., SAINZ, M., and DIMITROV, R., “Image-space horizon-based ambient occlusion”,

SIGGRAPH 2008.

COFFIN, C., “SPU Based Deferred Shading for Battlefield 3 on Playstation 3”, GDC 2011.

DEMERS, J., “Depth of Field : A Survey of Techniques”, GPU Gems, Ch.23.

HUTCHINSON, N. et al., “Screen Space Classification for Efficient Deferred Shading”,

SIGGRAPH 2010.

JONES, M., Optimal CoC Calculation, Rendering @ EA internal mail, 2008.

Advances in Real-Time Rendering in Games - References (2/)

LOOS, B. J., and SLOAN, P-P., “Volumetric Obscurance”, 2010.

NEVRAEV, I., “Xbox 360 Precompiled Command Buffers”, Microsoft Gamefest London 2011.

PRITCHARD, C., “Xbox 360 Shaders and Performance: How Not to Upset the GPU”, Microsoft

Gamefest Seattle 2010.

SOUSA, T., “Crysis Next Gen Effects”, GDC 2008.

SWOBODA, M., “Deferred Lighting and Post-Processing on PLAYSTATION®3”, GDC 2009.

KAWASE, M., “Frame Buffer Postprocessing Effects in DOUBLE-S.T.E.A.L (Wreckless), GDC

2003.

Advances in Real-Time Rendering in Games