このページは http://www.slideshare.net/Mark_Kilgard/gpuaccelerated-path-rendering の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

約4年前 (2012/09/15)にアップロードinテクノロジー

Preprint for SIGGRAPH Asia 2012

Copyright ACM, 2012

For thirty years, resolution-independent 2D...

Preprint for SIGGRAPH Asia 2012

Copyright ACM, 2012

For thirty years, resolution-independent 2D standards (e.g. PostScript,

SVG) have depended on CPU-based algorithms for the filling and

stroking of paths. However advances in graphics hardware have largely

ignored the problem of accelerating resolution-independent 2D graphics

rendered from paths.

Our work builds on prior work to re-factor the path rendering task

to leverage existing capabilities of modern pipelined and massively

parallel GPUs. We introduce a two-step “Stencil, then Cover” (StC)

paradigm that explicitly decouples path rendering into one GPU

step to determine a path’s filled or stenciled coverage and a second

step to rasterize conservative geometry intended to test and reset the

coverage determinations of the first step while shading color samples

within the path. Our goals are completeness, correctness, quality, and

performance—but we go further to unify path rendering with OpenGL’s

established 3D rendering pipeline. We have built and productized our

approach to accelerate path rendering as an OpenGL extension.

- Computer Graphics Proceedings, Annual Conference Series, 2012

GPU-accelerated Path Rendering

Mark J. Kilgard

Jeff Bolz

NVIDIA Corporation∗

Figure 1: GPU-accelerated scenes rendered at super-real-time rates with our system: Snail Bob Flash-based game (5ms) by permission of

Andrey Kovalishin and Maxim Yurchenko, Van Gogh SVG scene with gradients (5.26ms) by permission of Enrique Meza C, complete (shown

clipped) SIGGRAPH web page (4.8ms), and SVG scene with path clipping (1.9ms) by permission of Michael Grosberg, all rendered on a

GeForce 560M laptop.

Abstract

When people surf the web, read PDF documents, interact with

smart phones or tablets, use productivity software, play casual

For thirty years,

resolution-independent 2D standards (e.g.

games, or in everyday life encounter the full variety of visual output

PostScript, SVG) have depended on CPU-based algorithms for the

created digitally (advertising, books, logos, signage, etc.), they are

filling and stroking of paths. Advances in graphics hardware have

experiencing resolution-independent 2D graphics.

largely ignored accelerating resolution-independent 2D graphics

While 3D graphics dominates graphics research, we observe that

rendered from paths.

most visual interactions between humans and computers involve

We introduce a two-step “Stencil, then Cover” (StC) programming

2D graphics. Sometimes this type of computer graphics is called

interface.

Our GPU-based approach builds upon existing tech-

vector graphics, but we prefer the term path rendering because the

niques for curve rendering using the stencil buffer, but we explicitly

latter term emphasizes the path as the unifying primitive for this

decouple in our programming interface the stencil step to deter-

approach to rendering.

mine a path’s filled or stroked coverage from the subsequent cover

step to rasterize conservative geometry intended to test and reset

1.1

Terminology of Path Rendering

the coverage determinations of the first step while shading color

samples within the path. Our goals are completeness, correctness,

A path is a sequence of trajectories and contours. In this context,

quality, and performance—yet we go further to unify path render-

a trajectory is a connected sequence of path commands. Path com-

ing with OpenGL’s established 3D and shading pipeline. We have

mands include line segments, B´ezier curve segments, and partial

built and productized our approach to accelerate path rendering as

elliptical arcs. Each path command has an associated set of nu-

an OpenGL extension.

meric parameters known as path coordinates. When a pair of path

coordinates defines a 2D (x, y) location, this pair is a control point.

CR Categories: I.3.2 [Computer Graphics]: Graphics Systems—

Intuitively a trajectory corresponds to pressing a pen’s tip down on

Stand-alone systems;

paper, dragging it to draw on the paper, and eventually lifting the

pen.

Keywords:

path rendering, vector graphics, OpenGL, stencil

A contour is a trajectory with the same start and end point; in other

buffer

words, a closed trajectory. These contours and trajectories may be

Links:

DL

PDF

convex, self-intersecting, nested in other contours, or may intersect

other trajectories/contours in the path. There is generally no bound

1

Introduction

on the number of path segments or trajectories/contours in a path.

For a rendering “primitive,” paths can be quite complex.

∗e-mail: {mjk,jbolz}@nvidia.com

Paths are rendered by either filling or stroking the path. Concep-

tually, path filling corresponds to determining what points (frame-

buffer sample locations) are logically “inside” the path. Stroking

is roughly the region swept out by a fixed-width pen—centered on

the trajectory—that travels along the trajectory orthogonal to the

trajectory’s tangent direction.

Copyright ACM, (2012). This is the author’s ver-

1.2

History, Standards, Motivation, and Contributions

sion of the work. It is posted here by permission

of ACM for your personal use. Not for redistri-

Seminal work by Warnock and Wyatt [1982] introduced a coherent

bution. The definitive version will be published in

model for path rendering. Since that time, many standards and pro-

gramming interfaces have incorporated path rendering constructs

ACM Transactions on Graphics, http://doi.acm.org

1 - ACM SIGGRAPH 2012, Singapore, November 28–December 1, 2012

into their 2D graphics framework. Without being exhaustive, we

DPI) and the dated visual appearance established by resolution-

note

dependent bitmap graphics. Apple’s new iPad display has a dis-

play density of 264 DPI, greatly surpassing the 100 DPI density

• document presentation and printing: PostScript [Adobe Sys-

norm for PC screens. These handheld devices are carried directly

tems 1985], PDF [Adobe Systems 2008a]

on one’s person so their screen real estate is relatively fixed—so im-

provements in display appearance is likely to be through increasing

• font specification: PostScript fonts [Adobe Systems 1992]

screen density rather than enlarging screen area.

• immersive web: Flash [Adobe Systems 2008b], HTML 5’s

Pixel resolutions for conventional monitors are increasing too.

Scalable Vector Graphics [SVG Working Group 2011a]

Large 2560x1600 resolution screens are mass-produced and readily

• 2D programming interfaces: OpenVG [Khronos Group 2008]

available. Driving such high resolutions with CPU-based path ren-

dering schemes is untenable at real-time rates. Indeed the very het-

• productivity software: Illustrator, Photoshop, Office

erogeneity of modern displays in terms of pixel resolution, screen

size, and their combination—pixel density—strengthens the case

Despite path rendering’s 30 year heritage and broad adoption, it

for improving path rendering performance.

has not benefited from acceleration by graphics hardware to any-

where near the extent 3D graphics has. Most path rendering today is

1.3.2

Multi-touch Interfaces

performed by the CPU with sequential algorithms, not particularly

different from their formulation 30 years ago. Our motivation is to

Mobile devices also rely on multi-touch screens for input so the

harness existing GPUs to improve the overall experience achievable

user is extremely aware of the latency between touch gestures and

with path rendering.

the resulting screen update. The user is literally pointing at the

pixels they expect to see updated. Multi-touch encourages rotation

We present a productized system for GPU-accelerated path render-

and scaling. When imagery can easily be rotated, scaled, sub-pixel

ing in the context of the OpenGL graphics system; see some of our

translated, and even projected, assumptions that all text and graph-

rendering results in Figure 1. Our system works on the three most-

ics will be orthographically aligned to the screen’s pixel grid are

recent architectural generations of GeForce and Quadro GPUs—

no longer a given so rendering all path content directly from paths

and we expect all recent GPUs can support the algorithms and pro-

makes sense.

gramming interface we describe.

The primary contributions delivered by our system are:

1.3.3

Immersive Web Standards

• A novel ”stencil, then cover” programming interface for path

The proximate HTML 5 web standard exposes path rendering func-

rendering, well-suited to acceleration by GPUs.

tionality in a standard and pervasive way through both Scalable

Vector Graphics (SVG) and the Canvas element.

• Our programming interface’s efficient implementation within

the OpenGL graphics system to avoid CPU bottlenecks.

JavaScript performance has increased to the point that dynamic con-

tent can be orchestrated within a standards-based HTML 5 web

• Accompanying algorithms to handle tessellation-free sten-

page such that the system’s path rendering performance is often

ciled stroking of paths, standard stroking embellishments such

a bottleneck.

as dashing, clipping paths to arbitrary paths, and mixing 3D

and path rendering.

1.3.4

Power Wall

Section 2 reviews prior path rendering systems. Section 3 explains

Minimizing power consumption has become a mantra for com-

our approach; we cite the crucial prior research that our approach

puter system design across all classes of devices—whether mo-

integrates in this section.

Section 4 compares our quality and

bile devices or not. When power is at a premium, moving CPU-

performance to other implementations and highlights our system’s

and bandwidth-intensive computations such as pixel manipulation

novel ability to mix with 3D and GPU-shaded rendering. Section 5

and rasterization to more power-efficient GPU circuitry can reduce

discusses opportunities for future work.

overall power consumption while improving interactivity and min-

imizing update latency. GPU-acceleration of path rendering is pre-

1.3

New Demands on Path Rendering

cisely such an opportunity.

Historically, applications mostly “pre-render” 2D content specified

2

Prior Path Rendering Systems

with paths into bitmaps for glyphs and icons/images for vector art-

work, then cache and blit those rasterized results as needed. Ren-

dering directly from the path data generally proved too slow to be

2.1

CPU-based Path Rendering Systems Critiqued

viable. Early window systems based on path rendering concepts

such as Sun’s NeWS [Gosling et al. 1989] and Adobe’s Display

Path rendering is historically and still typically performed by CPU-

PostScript [Adobe Systems 1993] were arguably overly ambitious

based scan line renderers. Paths are converted into a set of edges.

in basing their 2D rendering model around path rendering rather

These edges are transformed and sorted in Y-order. Then the edges

than resolution-dependent 2D bitmap rendering as did the more

are intersected with each scan line, sorted in X-order, and pixels in

successful GDI and X11-based systems that proved easier for 2D

the path region are updated.

graphics hardware to accelerate.

The scan-line rendering approach is notable for being work-

efficient and cache-friendly. No computation is expended on pixels

1.3.1

Increasing Screen Density and Resolution

that are obviously outside the path and only active edges are con-

sidered when processing a given scan line. Such scan line render-

Smart phones and tablets have created new platforms free from

ers use a “chunker” strategy—where rather than the chunk being

legacy display limitations such as relatively low—by today’s avail-

a 2D tile, the chunk is a single scan line. This leads to a reason-

able technology—display density (measured in dots-per-inch or

ably friendly access pattern for CPU caches. Additionally the scan

2 - Computer Graphics Proceedings, Annual Conference Series, 2012

Figure 2:

Performance ratios rendering SVG content at window resolutions from 1002 to 11002.

A ratio of 1.0 means the

NV path rendering (16 samples per pixel) performance is equal to the other renderer; higher ratios indicate how many multiples faster

NV path rendering is than the alternative. Note the logarithmic Y axis. Scenes were selected for their variety. Benchmark configuration

is a GeForce 650 and fast Core i7 CPU.

line enter/leave counts are transient. In contrast to a window-sized

Direct2D operates by transforming paths on the CPU and then per-

ancillary buffer such as a depth or stencil buffer, the scan line en-

forming a constrained trapezoidal tessellation of each path. The

ter/leave counts can live in the cache and have their storage recycled

result is a set of pixel-space trapezoids and additional shaded geom-

for each processed scan line.

etry to compute fractional coverage for the left and right edges of

the trapezoids. These trapezoids and shaded geometry are then ras-

While work-efficient and cache-friendly as noted, this CPU-

terized by the GPU. The resulting performance is generally better

intensive approach is quite sequential. Every path must be trans-

than entirely CPU-based approaches and requires no ancillary stor-

formed into screen space. Every path must be scan line rasterized.

age for multisample or stencil state; Direct2D renders directly into

Every scan line must be intersected with the active edge list. Every

an aliased framebuffer with properly antialiased results. Direct2D’s

sorted active edge list must be scanned left-to-right. There is not an

primary disadvantage is the ultimate performance is determined not

easy way to pipeline all these tasks or exploit massive parallelism—

by the GPU (doing fairly trivial rasterization) but rather by the CPU

such as is routine for GPU-accelerated 3D graphics. Hence this is

performing the transformation and trapezoidal tessellation of each

an approach that maps well to the CPU but cannot be obviously

path and Direct3D validation work.

accelerated in this form by the GPU.

Skia is the C++ path rendering API used by Google’s Android and

2.2

GPU-based Path Rendering Systems

Chrome browsers. Skia has a conventional CPU-based path ren-

derer but has recently integrated a new OpenGL ES2-accelerated

Over the years, many attempts have been made—with varying de-

back-end called Ganesh. Ganesh has experimented with two accel-

grees of mixed success—to accelerate path rendering with GPUs.

erated approaches. The first used the stencil buffer to render paths.

We postpone discussion of prior techniques for GPU rendering of

Because of API overheads with this approach, this first approach

curves with the stencil buffer to Section 3 since they are the basis

was replaced with a second approach where the CPU-based raster-

for our approach.

izer computes a coverage mask which is loaded as a texture upon

every path draw to provide the GPU proper antialiased coverage.

2.2.1

Acceleration of Path Rendering Programming Inter-

This hybrid scheme is often bottlenecked by the dynamic texture

faces

updates required for every rendered path.

Cairo [Packard and Worth 2003] is an open-source path rendering

implementation. An early attempt at GPU-acceleration called Glitz

The Khronos standards body worked to develop an API standard

[Nilsson and Reveman 2004] has since been abandoned. Glitz oper-

known as OpenVG with the explicit goal of enabling hardware-

ated at the level of the XRender [Packard 2001] extension so did not

acceleration of path rendering (the VG stands for vector graph-

accelerate paths directly. Arguably, Glitz was a more GPU-assisted

ics). Various companies and research groups have worked to de-

back-end than GPU accelerated. More recently, Cairo has worked

velop OpenVG hardware designs [FreeScale, Multimedia Applica-

on a first-class GPU back-end but the immediate mode nature of

tions Division 2010; Huang and Chae 2006; Kim et al. 2008] that,

the Cairo API and converting CPU-transformed paths to spans lim-

based on available descriptions, are fairly similar to the conven-

its the acceleration opportunities.

tional CPU-based scan line rasterizer scheme, recast as a hardware

unit. Reported performance levels are quite low compared to what

Microsoft’s Direct2D [Kerr 2009] API is layered upon Direct3D.

we report.

3 - ACM SIGGRAPH 2012, Singapore, November 28–December 1, 2012

2.2.2

Vector Texture Schemes

This explicitly decoupled approach has advantages not available in

interfaces that appear to offer a one-step DrawPath command.

An unconventional approach to GPU-accelerating path rendering is

Our two-step approach makes arbitrary path clipping, mixing with

cleverly encoding path content into GPU memory—typically as a

3D graphics, programmable blend modes, and other novel path ren-

texture—and then using a programmable shader essentially to de-

dering usages possible.

code the path content. Nehad and Hoppe [2008] and Qin [2009]

adopt variations on this approach. While this approach has some

3.2

Filling

interesting advantages such as being able to directly “texture map”

3D geometry with path rendered content, these approaches suffer

3.2.1

Improvements to Prior Methods

from the need to preprocess a static path scene into a specific tex-

ture encoding. This makes this approach unsuitable for editable or

Our approach to filling paths is inspired by the work of Loop and

dynamic path rendering. Additionally, many rendering approxima-

Blinn [2005] who developed an efficient fragment shader-based ap-

tions and authoring limitations are needed to make vector texture

proach to determining whether or not an (x, y) sample is inside or

schemes tractable.

outside a given quadratic or cubic B´ezier hull. In the Loop-Blinn

formulation, inexpensive arithmetic on interpolated texture coordi-

2.2.3

Discussion of Deficiencies

nates provides a Boolean predicate which when true indicates the

fragment’s sample position is not inside the B´ezier region.

The norm for CPU-based path rendering systems is maintaining

Our approach is not the first time the stencil buffer has been uti-

roughly 16 coverage samples per pixel (details vary). This creates

lized for stenciling paths. Kokojima et al. [2006] applied the Loop-

a challenge for GPU-based schemes because GPUs often support

Blinn scheme in conjunction with the stencil buffer to determine

1, 2, 4, or 8 samples per pixel through multisampling. This often

the winding number of TrueType glyph outlines. Kokojima et al.

creates a situation where the GPU-accelerated path rendering is in-

showed the general filled polygon algorithm of Lane et al. [1983]—

ferior to the CPU-based path rendering quality.

subsequently popularized for use with the stencil buffer [Neider

et al. 1993]—can naturally combine with the Loop-Blinn quadratic

When path rendering schemes are layered upon existing OpenGL

discard shader to determine the samples inside an arbitrarily com-

or Direct3D APIs, we have observed performance being limited by

plex TrueType outline. After stenciling each glyph into the stencil

the state change rate of the underlying 3D API. Often path render-

buffer, conservative geometry based on a convex hull or bounding

ing can result in many state changes per path when scenes can easily

box can test against the non-zero stencil values, shade those sam-

consist of 100s or 1000s of paths. In this case, the API overhead can

ples, and reset the stencil values back to an initial zero state.

substantially limit the overall performance. Our experience study-

ing prior approaches to using GPUs for path rendering indicates

Kokojima’s approach does not immediately extend to cubic B´ezier

these approaches are often more GPU-assisted rather than GPU-

segments because the inside region within a cubic B´ezier hull is

accelerated, with this attributable to continuous CPU involvement

not necessarily convex. Rueda et al. [2008] addressed this by pro-

or substantial CPU-based preprocessing.

viding simple topological strategies to subdivide cubic B´ezier hulls

using B´ezier subdivision to guarantee convexity, but used an overly

expensive discard fragment shader based on B´ezier normalization

3

Our Approach

rather than applying the Loop-Blinn cubic formulation.

In contrast to other systems for accelerating path rendering with

Our approach to handling cubic B´ezier segments builds on all this

GPUs, our approach explicitly reveals the coverage determinations.

work by combining cubic B´ezier convex subdivision rules with the

These determinations—for both filling and stroking—appear as

Loop-Blinn cubic formulation. We also perform the discard shaders

stencil buffer updates. A crucial insight underlying our approach

at sample-rate rather than pixel-rate for improved coverage deter-

is never determining the boundary between the “inside” and “out-

minations and antialiasing. We use interpolation at explicit sample

side” of a stroke or fill. Instead, we rely on point-sampled determi-

positions and our target GPU’s sample mask functionality to evalu-

nations of whether a particular (x, y) framebuffer location is inside

ate multiple samples within a pixel in a single shader instance.

or outside the stroke or fill. For antialiasing, we rely on GPU multi-

PostScript, SVG, and other standards support partial circular and

sampling to provide multiple sample coverage positions, each with

elliptical arcs so an additional discard shader, expressed in Cg, han-

its own sub-pixel stencil value.

dles these cases:

void roundCoverage(float2 st : TEXCOORD0.CENTROID)

3.1

Stencil, then Cover

{

if (st.s*st.s + st.t*st.t > 1) discard;

We perform path rendering in two steps. This is not unique; all

}

path rendering schemes involve two steps. The two steps may be

“tessellate, then render tessellation” [Kerr 2009] or “intersect with

with the (s, t) texture coordinates assigned so (0,0) is centered at

scan line, then paint pixels” [Packard and Worth 2003] or “ray cast,

the origin of roundness to discard samples outside the arc region

then shade” [Nehab and Hoppe 2008] but each rendering of a path

contained in a sequence of one or more polygonal hulls bounding

is inherently sequential in the sense that determining what pixels

such arcs.

are covered must precede shading and blending those pixels.

3.2.2

Baked Form of Filled Paths

What is novel in our approach is explicitly decoupling the two steps.

We call our approach, with its two decoupled steps, “Stencil, then

In order to render a filled path, we “bake” the path into a resolution-

Cover” (StC). Rather than a single DrawPath operation that hides

independent representation from which the path can be stenciled

the two-step nature of path rendering within the implementation, an

under arbitrary projective transforms. This baking process takes

OpenGL application using our extension first “stencils” the path in

time linearly proportional to the number of path commands. The

the stencil buffer [Akeley and Foran 1995], then “covers” the path

resulting baked path data resides completely on the GPU. The re-

to shade it.

quired GPU storage is also linearly proportional to the number of

4 - Computer Graphics Proceedings, Annual Conference Series, 2012

Figure 3: Filled path, with control points, with anchor geometry,

and with cubic B´ezier discard hulls, and conservative cover geom-

etry.

Figure 4: High-level data flow of OpenGL showing pixel, vertex,

path commands. For a static path, the baking process needs to be

and new path pipelines.

done just once; the baking process must be repeated if the path’s

commands or coordinates change, but edits to the path, including

insertions and deletions of commands, require just re-baking the

path segments at or immediately adjacent to the edits.

3.3

Stroking

Once baked, a filled path is reduced to five sets of primitives:

Our stroking approach operates similarly to our filling approach

whereby we stencil, then cover stroked paths from a baked

1. Polygonal anchor geometry (structured as triangle fans), ren-

resolution-independent representation residing on the GPU that re-

dered with no shader.

quires minimal CPU overhead to both stencil and cover.

2. Quadratic discard triangles, rendered with a Loop-Blinn

3.3.1

Quadratic B ´ezier Stroking

quadratic discard shader.

3. Cubic discard triangles (if the cubic B´ezier hull is a triangle)

Analytically determining the points contained by a stroke curved

and quadrilaterals, rendered with a Loop-Blinn cubic discard

segment is not easy.

The boundary of the stroked region of

shader.

a quadratic B´ezier corresponds to an offset curve.

While the

quadratic B´ezier curve generating the offset curve is 2nd order,

4. Arc discard triangles, rendered with the roundCoverage

the offset curve for this generating segment’s boundary is 6th or-

discard shader shown above.

der [Salmon 1960]. This makes exactly determining an intersec-

tion with this boundary unfeasible, particularly within the execu-

5. Conservative covering geometry, typically a triangle fan or

tion context of a GPU’s fragment shader. The boundary becomes

quadrilateral.

even more vexing for partial elliptical arcs and cubic B´ezier seg-

ments. The boundary for a general cubic B´ezier curve is 10th order

Primitive sets #1 through #4 are rendered during the stencil fill

[Farouki and Neff 1990]!

step. Two-sided stencil testing increments non-discarded stencil

samples of front-facing primitives; back-facing primitives instead

decrement non-discarded stencil samples. Primitive set #5 is ren-

Quadratic B ´ezier Segment Point Containment

Hence our ap-

dered during the cover fill step. Primitive sets #2 through #4 have

proach involves simply determining if a given (x, y) point is inside

properly assigned texture coordinates that drive each set’s respec-

or outside the stroked region of a quadratic B´ezier segment. This

tive discard shader. Figure 3 visualizes the baked anchor, discard,

can be reduced to solving a 3rd order equation.

and cover geometry.

A quadratic B´ezier segment Q—defined by the segment’s three

This approach to path filling is theoretically sound because the

control points C0, C1, and C2—can converted to monomial form

stencil rendering reduces to a winding number computation consis-

Q(t) = At2 + Bt + C = 0 for t ∈ [0, 1]. A point P is judged

tent with a discrete formulation of Jordan’s Theorem [Fabris et al.

within the stroke of Q when there is a parametric value s on Q such

1997].

that Q (s) · (P − Q(s)) = 0 and the squared distance between

Q(s) and P is within the squared stroke radius. (The dot product

All the data for a baked path can be stored within a single allo-

of a quadratic function and the derivative of a quadratic function is

cation of GPU memory to minimize the expense of stenciling or

3rd order.) Intuitively this corresponds to finding the 1 or 3 points

covering the path. Because the baked representation is completely

Q(s) with a tangent direction orthogonal to the segment connecting

resolution-independent, robust, and entirely on the GPU, the CPU

P and Q(s). Such solutions s will be local minima or maxima for

overhead to launch the stenciling and/or covering of an already

the distance between P and points on Q so computing the squared

baked path object is minimal.

distance d = (Q(s) − P ) · (Q(s) − P ) for each solution s indicates

if P is within the stroke of Q(t) for t ∈ [0, 1] when both s ∈ [0, 1]

We implement our approach as an OpenGL extension named

and d is less than or equal the square of half the path’s stroke width.

NV path rendering [Kilgard 2012]. Performing the stencil and

Figure 5 visualizes this procedure.

cover steps within the graphics driver avoids the API and driver val-

idation overhead (see Section 4.2) that plagued other GPU-based

Solving the cubic equation at every rasterized sample is expensive,

approaches. Figure 4 shows how our new path pipeline co-exists

but the computation can be simplified somewhat. The cubic equa-

with the existing pipelines in OpenGL for pixel and vertex process-

tion can be rearranged into an easier-to-solve depressed cubic [Car-

ing.

dano 1545] of the form t3 + G(x, y)t + H(x, y) = 0. While the

5 - ACM SIGGRAPH 2012, Singapore, November 28–December 1, 2012

once a quadratic stroke is “baked” for rendering, it can be rendered

under an arbitrary linear transformation—including projection—

without any further CPU re-processing. The hull vertices and their

coefficients for G and H can be stored in GPU memory so that

stencil-only rendering the quadratic stroke involves simply config-

uring the appropriate buffers, the appropriate vertex and fragment

shader pair, and rendering the hull geometry of the quadratic stroke.

Higher-order-than-Quadratic Stroking

Path rendering stan-

dards incorporate cubic B´ezier segments and partial elliptical arcs;

these involve cubic and rational quadratic generating curves for ras-

terized offset curve regions. The direct evaluation approach applied

to generating quadratic B´ezier curves is not tractable.

Figure 5: Visualization of points within and outside the stroked

Instead we subdivide cubic B´ezier segments and partial ellipti-

region of a quadratic B´ezier segment and their basis for inclusion

cal arcs into an approximating sequence of quadratic B´ezier seg-

or not.

ments. To maintain a curved boundary appearance at all mag-

nifications, our subdivision approach maintains G1 continuity at

quadratic B´ezier segment boundaries. No matter how much you

zoom into the boundary of higher-order stroked segments, there is

never any sign of linear edges or even a false discontinuity in the

curvature.

Following the approach of Kim and Ahn [2009], we bound the sub-

division such that the true higher-order generating curve never es-

capes a specified percentage threshold of the stroke width of the

approximating quadratic stroke sequence. We also subdivide at

key topological features, specifically points of self-intersection and

minimum curvature.

3.3.2

Stroking Embellishments

Figure 6: Examples of concave (top row) and convex (bottom row)

Stroking of line segments, end caps, and joins is straightforward.

stroked quadratic B´ezier segment hulls.

Stroked line segments are drawn as stencil-only rectangles. Poly-

gon caps (square and triangular) and joins (bevel and miter) are like-

wise drawn as stencil-only triangles. This geometry can be drawn

without any fragment shader. Round caps and joins are drawn

coefficients G(x, y) and H(x, y) are different for every path-space

with the same roundCoverage stencil-only sample-rate discard

(x, y) location, the functions G and H are linear in terms of (x, y)

shader (Section 3.2.1) used for partial circular and elliptical arcs

so a vertex shader can evaluate G(x, y) and H(x, y) at hull posi-

for filling with the (s, t) texture coordinates assigned appropriately

tions and exploit the GPU’s ability to interpolate linearly G and H

to discard samples outside the circular region of the round cap or

at positions within the hull.

join. The baking process for stroked paths includes generating the

Care is taken when an arrangement of quadratic B´ezier control

rectangles and triangles for line segment and polygonal caps and

points is collinear, collocated, or very nearly so. In such cases,

joins. Geometry for round caps and joins is generated along with

we demote the quadratic B´ezier segment to its linear degenerate

the texture coordinates to drive the round coverage discard shader.

equivalent for robustness.

3.3.3

Dashing

Stroked Quadratic Segment Hull Construction

To harness this

approach for rendering, we construct a hull around the quadratic

B´ezier stroked segment. As shown in Figure 6, the hull is typi-

Dashing is a feature of all major path rendering standards except

cally concave, consisting of seven vertices—though the hull may

Flash. Dashing complicates stroking by turning on and off the

be convex when the quadratic stroke’s width is wide relative to its

stroking along a path based on an application-specified repeating

arc length. Ruf [2011] has addressed the problem of a tight bound-

on-off pattern specified in units of arc length. Our stroke baking

ing representation for quadratic strokes, but his approach involves

process applies the dash pattern while gathering the geometry for

parabolic edges with the assumption the CPU can evaluate such

the stroked path. While complicated in its details, our dashing pro-

edges efficiently; for our purposes, we want a triangular decompo-

cess is similar to other path rendering implementations in its high-

sition of the hull suitable for GPU rasterization.

level structure. The primary difference is curved path segments are

reduced to quadratic B´ezier segments in our approach instead of

While solving the cubic equation—even in depressed form—is ex-

line segments. Whereas the arc length computations in conven-

pensive, we note that stroked regions are typically small and nar-

tional path rendering systems typically involve recursive subdivi-

row in screen space so this expensive process is used sparingly in

sion until the curved segment approximates a line segment, our ap-

practice. Even when strokes are wide, the massively parallel nature

proach can stop subdividing at quadratic B´ezier segments. Unlike

of the GPU makes this approach quite fast. Most important to us,

higher order curves, the arc length of a quadratic B´ezier segment (a

6 - Computer Graphics Proceedings, Annual Conference Series, 2012

segment of a parabola) has a closed form analytical solution:

1

Qx(t)2 + Qy(t)2dt =

0

√

√

b+2

ac

√

ln(

√

)(b2−4 ac)+2 (b+2 c)

c(c+a+b)−2 b

ac

b+2 c+2

c(c+a+b)

8c3/2

with copious common subexpressions and where a = B · B,

b = 2B · C, and c = C · C. Our interest in this approach is

our desire to minimize use of expensive recursive subdivision al-

gorithms while baking stroked paths, particularly during dashing.

Some numerical care must be taken to avoid negative square roots,

negative logarithms, and division by zero, but these cases occur

when quadratic segments are nearly linear.

Our dashing approach results in a resolution-independent baked

form of the dashed stroked path. Once dashed and baked, no further

CPU-based processing is necessary to render dashed paths. This

is in contrast to other implementations of dashed stroking where

Figure 7: Complex clipping scenario. Our approach: 8.69ms @

dashing has a considerable CPU processing expense during ren-

1000x1000x16. Cairo: 909ms @ 1000x1000. System: Core i7 +

dering. While our implementation must of course represent each

GeForce 560M GPU.

segment resulting from dashing, our render-time algorithm is com-

pletely oblivious to whether the original path was dashed.

1. Stencil the clip path into the stencil buffer with a “stencil fill”

3.3.4

Baked Form of Stroked Paths

operation.

Once baked, a stroked path is reduced to four sets of primitives:

2. Perform a “cover fill” operation to coalesce the samples

matching the fill rule so that the most-significant stencil bit

1. Polygonal geometry (line segments, bevel and miter joins,

is set and all the lower bits are cleared. For example, if a sam-

square and triangular end caps) with no shader.

ple’s stencil value is non-zero, replace the stencil value with

0x80. This step updates only the stencil buffer (disable any

2. Triangle fans corresponding to quadratic B´ezier segment hulls

color writes).

(curved path segments), rendered with a stroked quadratic dis-

card shader.

3. Stencil the draw path into the stencil buffer with a “stencil

3. Triangle fans corresponding to round hulls (round end caps

fill” operation, but (a) modify only the bottom 7 bits of the

and joins) rendered with a round coverage shader.

stencil buffer, and (b) discard any rasterized samples without

the topmost bit of the stencil buffer set.

4. Conservative covering geometry, typically a triangle fan or

quadrilateral.

4. Perform a shaded “cover” operation on the draw path. Update

any color sample whose stencil value’s bottom 7 bits are non-

Primitive sets #1 through #3 are rendered during the stencil stroke

zero and zero the bottom 7 bits of the sample’s stencil value.

step. Primitive set #4 is rendered during the cover stroke step.

Write shaded color samples during this step; due to the sten-

cil configuration, only samples within both the clip and draw

The REPLACE stencil operation used for stroking is order-invariant.

paths get shaded and updated.

Therefore we select a static rendering order during the baking pro-

cess that minimizes GPU state changes during rendering.

5. Finally to undo the clip path’s stencil manipulation from step

The geometry, texture coordinates, and per-hull quadratic discard

1, perform a “cover” operation on the clip path to reset the

shader coefficients are all packed into a single GPU buffer alloca-

most significant stencil bit back to zero.

tion. The rendering process for stenciling the baked path is very

Many variations on this approach are possible. For example, steps

straightforward, requiring no more than three GPU state reconfigu-

3 and 4 can be repeated for each path in a layered group of paths.

rations, one per primitive set above.

This avoids having to re-render the clip path for each and every path

The GPU storage for the linear and quadratic path segments in a

in a group of paths.

baked stroked path is linearly proportional to the number of seg-

ments (post-dashing). For cubic B´ezier segments and partial ellip-

Most standards allow nested clipping of paths to other paths. Clever

tical arcs, the storage depends on their required level of subdivision.

manipulation of the stencil bit-planes allows such nested clipping.

Because this subdivision is tied to the stroke width, narrower stroke

Standards such as SVG allow for clipping to the union of an ar-

widths require more storage while wider stroke widths require less

bitrary number of paths as shown in Figure 7. Again, we can ac-

storage.

complish this by clever use of stencil bit-planes and re-coalescing

coverage from different clip paths.

3.4

Clipping to Arbitrary Paths

3.5

Painting

All major path rendering standards support clipping a draw path to

the filled region of a clip path. Our two-step “stencil, then cover”

What path rendering standards often call “painting” a filled or

approach readily supports clipping to arbitrary paths. We briefly de-

stroked path is called shading in 3D graphics. Our goal is to al-

scribe the process assuming an 8-bit stencil buffer, initially cleared

low the full generality of GPU-based programmable shading to be

to zero:

exposed when painting paths.

7 - ACM SIGGRAPH 2012, Singapore, November 28–December 1, 2012

Figure 8: Bump map shader applied to path rendered text, ren-

dered from to different light positions, shown in yellow.

During the cover step where a conservative bounding box or con-

vex hull is rendered to cover fully the stenciling of the path, the

application can configure arbitrary OpenGL shading. This could be

Figure 9: Various path rendering implementations drawing a dif-

fixed-function shading, assembly-level shaders, or shaders written

ficult cubic B´ezier curve (the centurion head).

in a high-level language such as Cg or GLSL.

In conventional path rendering systems, linear and radial gradients

(whether fill or stroke) with an OpenGL glTextureBarrierNV

are a common form of paint for paths. We note how straightforward

command and reading the pixel’s color value as a fetch to the frame-

radial gradient paint can be implemented, including mipmapped fil-

buffer bound as a texture, reliable programmable blending with the

tering of the lookup table accesses, with the following Cg shader:

fragment shader is possible.

void radialFocalGradient(float3 str : TEXCOORD0.CENTROID,

float4 c

: COLOR.CENTROID,

4

Discussion

out float4 color

: COLOR,

uniform sampler1D ramp : TEXUNIT0)

{

4.1

Quality

color = c*tex1D(ramp, length(str.xy) + str.z);

}

Our system’s rendering quality is directly tied to how many color

and stencil samples the framebuffer maintains per pixel. This deter-

The texture coordinates needed for this shader can be generated

mines the quality of our antialiasing. Our GeForce GPUs support

as a linear function of the path-space coordinate system. Painting

up to 16 samples per pixel while our Quadro GPUs support 32 and

need not be limited to conventional types of path rendering paint.

64 samples per pixel as well.

Arbitrary fragment shader processing can be performed during the

At 16 samples per pixel, our rendering quality compares quite fa-

cover step (see Figure 8).

vorably with CPU-based path renderers. Because our GPUs have

8 bits of sub-pixel precision, irregular coverage sample positions,

3.6

Blending and Blend Modes

and our point containment determinations are numerically sound,

we are well-justified in stating our quality exceeds what can rea-

OpenGL blending is sufficient for most path rendering where the

sonably be expected for CPU-based path renderers. We focus on

default path compositing operation is the “over” blend mode, as-

two aspects of path rendering quality where our implementation

suming pre-multiplied alpha. Color writes during our cover step

has superior quality.

apply the currently configured OpenGL blend state. Modern GPUs

also have efficient first-class support for blending in the widely-

4.1.1

Stroking Quality

used sRGB device color space.

Sophisticated path rendering systems have additional blend modes

For stroking, our quadratic B´ezier stroke discard shader is mathe-

[SVG Working Group 2011b] beyond the standard Porter-Duff

matically consistent with the sweep of an orthogonal pen travers-

compositing algebra [1984]. Digital artists are familiar with these

ing the path’s trajectory. In Figure 9 we compare our very fast

modes with names such as ColorDodge, HardLight, etc. However

stroking result to alternatives that are generally substantially slower

GPU blending does not support these blend modes because they are

on a difficult cubic B´ezier stroke test case. Notice three path ren-

rare, complex, and not used by 3D graphics. While some of these

dering implementation get this test case quite wrong—whereas

blend modes can be simulated with multiple rendering passes, many

NV path rendering matches the OpenVG reference imple-

of these modes are impossible to construct from conventional GPU

mentation and Direct2D version.

blending operations.

4.1.2

Conflation Avoidance

Our “stencil, then cover” approach makes it possible to implement

these blend modes despite their lack of direct GPU hardware sup-

Conflation is an artifact in path rendering systems that occurs when

port. Normally, GPUs do not reliably support reading-as-a-texture

coverage (a Boolean concept) is conflated with opacity. This gen-

a framebuffer currently being rendered. However a recent OpenGL

erally occurs when sub-pixel coverage is converted to a fractional

extension called NV texture barrier [Bolz 2009] provides a

value and multiplied into the alpha color component for composit-

reliable memory barrier under restrictive conditions. A fragment

ing. While this approach is standard practice, it can result in notice-

shader must ensure there is a single read and write for any particu-

ably incorrect colors.

lar pixel done from that pixel’s fragment shader instance.

Conflation is particularly noticeable when two opaque paths exactly

The “stencil, then cover” approach provides precisely such a “no

seam at a shared boundary. Say path A covers 40% of the pixel and

double blending” guarantee. So by preceding each cover operation

an adjacent path B covers the other 60%. But if A is drawn first,

8 - Computer Graphics Proceedings, Annual Conference Series, 2012

Figure 11: Mixing 3D and path rendering in a single window.

Figure

10:

Flash

scene

with

shared

edges.

NV path rendering shows no conflation while Direct2D

implementation makes use of a configurable front-end processor

(and Cairo, Skia, Qt, and OpenVG) shows conflation. Upper left

within the GPU—not otherwise accessible to applications—to tran-

corner shows the background clear color; conflation is tinted by

sition quickly between the stencil step and cover step and back.

this color in the bottom scenes. Notice the conflated blue tint on

This avoids the driver performing expensive revalidations of CPU-

the girl’s cheek.

managed state so our rendering stays GPU-limited rather than CPU-

bottlenecked, even when presented with otherwise overwhelming

numbers of small paths.

the pixel picks up 40% of A’s color and 60% of the background

color. Now when B is drawn, the pixel gets 60% of B’s color and

4.3

New Functionality

40% of the combination of 40% of A’s color and 60% of the back-

ground color. The result is some fraction of the background color

Because NV path rendering is integrated into the OpenGL

has leaked into the pixel when a more accurate assessment of cov-

pipeline and the coverage information is accessible through the

erage would have no background color.

stencil buffer, we are able to implement unconventional algorithms

Flash content is particularly prone to conflation artifacts because

such as mixing path rendering with arbitrary 3D graphics.

path edges are typically authored for exact sharing of edges.

Adobe’s Flash player specifically works to avoid conflation arti-

Figure 11 demonstrates an example of this capability. No textures

facts. This is possible because Flash player has complete knowl-

are used in this scene. Arbitrary zooming into the tigers’ detail is

edge of all the paths in a Flash shape and how those path edges

supported. Notice how the tigers properly occlude each other and

are shared. Exact sharing of edges is helpful from a content cre-

the teapot. Due to the perspective 3D view, the path rendering is

ation standpoint because a shared edge can be stored once and

properly rendered in perspective as well.

used by two paths (more compact) and reduces the overall lay-

ered depth complexity of the scene by avoiding overlaps. Because

5

Future Work

NV path rendering maintains distinct sub-pixel color samples,

the scene in Figure 10 renders free of conflation artifacts.

We believe our performance can be further improved. We are inves-

4.2

Performance

tigating hardware improvements to mitigate some of the memory

bandwidth expense involved in our underlying stencil-based algo-

The rendering performance of

rithms. In particular, we are seeking to reduce the GPU memory

NV path rendering scales with

GPU performance. Because the baked paths reside on the GPU

footprint.

and are resolution-independent, once baked, path rendering perfor-

mance is decoupled from CPU performance. Figure 2 charts the

Web browser architecture should change to incorporate GPU-

performance of NV path rendering relative to alternatives—

accelerated path rendering. Today web browsers respecify paths

including GPU-accelerated alternatives such as Direct2D and

every time a web page with path content is re-rendered assuming re-

Skia’s Ganesh approach.

specifying paths is cheap relative to the expense of rendering them.

When path rendering is fully GPU-accelerated, a retained model

Our performance advantage is attributable to the overall rendering

of rendering is more appropriate and efficient. We believe web

and shading performance of our underlying GPUs. Several as-

browsers should behave more like video games in this respect to

pects are particularly noteworthy. Our underlying GPUs support

exploit the GPU.

a fast stencil culling mode so hundreds of pixels can be culled in

a single clock if a coarse grain test can determine the stencil test

Mobile devices are power constrained so off-loading path render-

for all the pixels would fail. This mitigates much of what might

ing to a graphics processor designed for efficient pixel processing

otherwise seem very inefficient about the “stencil, then cover” ap-

makes good sense. Mobile devices in particular prize a low-latency

proach. Also stencil processing generally is very well optimized.

experience for the user so the sooner the device can complete its

The 8-bit memory transactions during the stencil and cover steps

resolution-independent 2D rendering, the better the user experience

can often run at memory bus saturating rates. Our OpenGL driver

and the sooner the device can power down to a low power state.

9 - ACM SIGGRAPH 2012, Singapore, November 28–December 1, 2012

Acknowledgements

KOKOJIMA, Y., SUGITA, K., SAITO, T., AND TAKEMOTO, T.

2006. Resolution independent rendering of deformable vector

Michael Toksvig corrected Mark’s 3D bigotry and insisted 2D ren-

objects using graphics hardware.

In ACM SIGGRAPH 2006

dering deserved acceleration. Chris Dalton assisted building our

Sketches, SIGGRAPH ’06. 4

test bed. Tero Karras provided crucial math insights. Barthold

L

Lichtenbelt supported this work throughout.

ANE, J. M., MAGEDSON, R., AND RARICK, M. 1983. An algo-

rithm for filling regions on graphics display devices. ACM Trans.

Graph. 2, 3 (July), 192–196. 4

References

LOOP, C., AND BLINN, J. 2005. Resolution independent curve

ADOBE SYSTEMS. 1985. PostScript Language Reference Manual,

rendering using programmable graphics hardware. In ACM SIG-

1st ed. Addison-Wesley Longman Publishing Co., Inc. 2

GRAPH 2005 Papers, SIGGRAPH ’05, 1000–1009. 4

ADOBE SYSTEMS.

1992.

Adobe Type 1 Font Format, 2nd ed.

NEHAB, D., AND HOPPE, H. 2008. Random-access rendering of

Addison-Wesley Longman Publishing Co., Inc. 2

general vector graphics. In ACM SIGGRAPH Asia 2008 papers,

SIGGRAPH Asia ’08, 135:1–135:10. 4

ADOBE SYSTEMS. 1993. Display PostScript System–Introduction:

Perspective for Software Developers. 2

NEIDER, J., DAVIS, T., AND WOO, M. 1993. OpenGL Program-

ming Guide, 1st edition. See ”Drawing Filled, Concave Polygons

ADOBE SYSTEMS. 2008. Document management–Portable doc-

Using the Stencil Buffer”, 398–399. 4

ument format–Part 1: PDF 1.7. Also published as ISO 3200.

2

NILSSON, P., AND REVEMAN, D. 2004. Glitz: hardware accel-

erated image compositing using OpenGL. In Proceedings of the

ADOBE SYSTEMS. 2008. SWF File Format Specification, version

FREENIX Track: 2004 USENIX Annual Technical Conference,

10. 2

29–40. 3

AKELEY, K., AND FORAN, J., 1995. Apparatus and method for

PACKARD, K., AND WORTH, C. 2003. A realistic 2D drawing

controlling storage of display information in a computer system.

system. A rejected SIGGRAPH 2003 paper submission

. 3, 4

US Patent 5,394,170. 4

PACKARD, K. 2001. Design and implementation of the X Ren-

BOLZ, J., 2009. NV texture barrier.

dering Extension. In Proceedings of the FREENIX Track: 2001

http://www.opengl.org/registry/specs/NV/texture barrier.txt . 8

USENIX Annual Technical Conference, USENIX Association,

213–224. 3

CARDANO, G. 1545. Artis magnae sive de regulis algebraicis,

liber unus. 5

PORTER, T., AND DUFF, T. 1984. Compositing digital images. In

Proceedings of the 11th annual Conference on Computer Graph-

FABRIS, A., SILVA, L., AND FORREST, A. 1997. An efficient fill-

ics and Interactive Techniques, SIGGRAPH ’84, 253–259. 8

ing algorithm for non-simple closed curves using the point con-

tainment paradigm. In Proceedings of X Brazilian Symposium

QIN, Z. 2009. Vector Graphics for Real-time Rendering. PhD

on Computer Graphics and Image Processing, 2 –9. 5

thesis. University of Waterloo. 4

FAROUKI, R., AND NEFF, C. 1990. Algebraic properties of plane

RUEDA, A. J., RUIZ DE MIRAS, J., AND FEITO, F. R. 2008. GPU-

offset curves. Computer Aided Geometric Design 7, 101–127. 5

based rendering of curved polygons using simplicial coverings.

Computer Graphics 32, 5 (Oct.), 581–588. 4

FREESCALE, MULTIMEDIA APPLICATIONS DIVISION.

2010.

i.MX35 accelerated 2D graphics: Optimizing 2D graphics with

RUF, E. 2011. An inexpensive bounding representation for offsets

OpenVG and i.MX35, application note, doc. # an3975. 3

of quadratic curves. In Proceedings of the ACM SIGGRAPH

Symposium on High Performance Graphics, HPG ’11, 143–150.

GOSLING, J., ROSENTHAL, D. S. H., AND ARDEN, M. J. 1989.

6

The NeWS book: an introduction to the network/extensible win-

dow system. Springer-Verlag. 2

SALMON, G. 1960. A Treatise on Conic Sections. Chelsea New

York (reprint). 5

HUANG, R., AND CHAE, S.-I.

2006.

Implementation of an

OpenVG rasterizer with configurable anti-aliasing and multi-

SVG WORKING GROUP, 2011. Scalable Vector Graphics (SVG)

window scissoring.

In Proceedings of the 6th IEEE Interna-

1.1 (2nd edition). 2

tional Conference on Computer and Information Technology,

IEEE Computer Society, CIT ’06, 179. 3

SVG WORKING GROUP, 2011. SVG compositing specification.

W3C working draft March 15, 2011. 8

KERR, K. 2009. Introducing Direct2D. MSDN Magazine (June).

W

3, 4

ARNOCK, J., AND WYATT, D. K. 1982. A device independent

graphics imaging model for use with raster devices. In Proceed-

KHRONOS GROUP, 2008. OpenVG specification version 1.1. 2

ings of the 9th Annual Conference on Computer Graphics and

Interactive Techniques, SIGGRAPH ’82, 313–319. 1

KILGARD, M., 2012. NV path rendering.

http://www.opengl.org/registry/specs/NV/path rendering.txt . 5

KIM, Y., AND AHN, Y. 2009. Explicit error bound for quadratic

spline approximation of cubic spline. Journal of the Korean So-

ciety for Industrial and Applied Mathematics 13, 4, 257–265. 6

KIM, D., CHA, K., AND CHAE, S.-I. 2008. A high-performance

OpenVG accelerator with dual-scanline filling rendering. Con-

sumer Electronics, IEEE Transactions on 54, 3 (August), 1303

–1311. 3

10