X265 OPEN SOURCE H.265 ENCODER OPTIMIZATION DETAILS
X265 OPEN SOURCE H.265 ENCODER OPTIMIZATION DETAILS
H.265/HEVC FINALIZED JANUARY 25, 2013 NOTABLE CHANGES FROM H.264 ! H.264’s 16x16 macroblocks replaced with 64x64 CUs and QuadTrees ‒ Coding QuadTree can be recursively split down to 8x8 blocks ‒ At all levels, the coding blocks can chose inter or intra predic]on ‒ The ﬁnal coding blocks can be further split
‒ The residual is signaled in a second QuadTree which can have more depth than the coding QT ! Inter predic]on has more accuracy ‒ HPEL ﬁlter has 8- ‐taps, QPEL has 7- ‐taps. (H.264 has 6- ‐tap HPEL and avg QPEL) ‒ Merge candidates replace direct and skip H.264 modes ‒ AMVP allows mo]on predic]on to be selected from a list, in H.264 it was en]rely implicit 4 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
H.265/HEVC FINALIZED JANUARY 25, 2013 NOTABLE CHANGES FROM H.264 ! More intra predic]ons ‒ DC and planar modes, similar to H.264 ‒ 33 angular predic]ons with emphasis on near- ‐ver]cal and near- ‐horizontal angles ‒ 35 predic]ons in total (for all block sizes from 32x32 to 4x4) but few special cases ! Sample Adap]ve Oﬀset loop ﬁlter for reduced compression ar]facts 5 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
H.265/HEVC PARALLELIZATION CONSIDERATIONS NOTABLE CHANGES FROM H.264 ! WaveFront Parallel Processing ‒ Each row of largest CU blocks can be encoded in parallel, with a two block lag to row above ‒ The CABAC state of block 2 is communicated to block 0 of row below ‒ <1% loss of compression eﬃciency, much more eﬃcient than slices or ]les ! Tiles – split each frame into regular rectangular parts, encode each in parallel ! Deblocking only on 8x8 boundaries, and beler ordering of opera]ons 6 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
H.265/HEVC PARALLELIZATION CONSIDERATIONS THE FINE PRINT ! Larger block sizes reduce the eﬀec]veness of frame parallelism ‒ Only a quarter of the available block rows as H.264 for the same resolu]on video ‒ Aner accoun]ng for deblocking, and SAO there is a three row (192 line) lag between references ‒ Wavefront analysis or ]les must be used in conjunc]on with frame parallelism to make up for this ‒ High percentage of B frames to P frames alleviates this bolleneck ! Large blocks increase serial opera]ons, add longer data dependencies ‒ Each CU in the quad- ‐tree must be analyzed in Z- ‐scan order ‒ Since each CU can chose intra, all prior blocks must generate recon pixels – no shortcuts ‒ Varia]ons in CU encode ]mes reduce the eﬀec]veness of wavefront analysis by causing stalls 7 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
X265 – A SHORT HISTORY ! x265 Consor]um founded in April of 2013 ‒ Dual commercial and GPLv2+ license ‒ Development primarily centered in Chennai, India with contribu]ons from China and US ‒ Started from the HEVC reference encoder (HM), less than half of HM source remains today ‒ Achieved 1080p 15fps in June ‒ Public announcement and ﬁrst open source release in July ! Op]miza]ons ‒ WPP wavefront CTU analysis and frame parallelism ‒ Compiler intrinsic SIMD based performance primi]ves ‒ Hand- ‐wrilen assembly performance primi]ves ‒ Data ﬂow improvements, early outs, RDO reduc]ons ! Today ‒ 1080p@30fps or 720p@200fps on 16- ‐core SandyBridge Xeon 9 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
X265 – A SHORT HISTORY ! Ecosystem ‒ Licensed to reuse x264 source code and algorithms ‒ Open development on mailing list and IRC ‒ Public repositories on Bitbucket and VideoLan.org ‒ Integra]on into VLC, libav, ﬀmpeg, and Handbrake in various stages of comple]on ! x264 feature adop]on ‒ Lookahead / slicetype decision and scene cut detec]on ‒ Mo]on es]ma]on and bitcost func]ons ‒ CLI interface and public C interface ‒ Assembly primi]ves for SAD, SATD, SSD, etc ‒ ABR and CRF rate control – VBV adop]on in progress by O/S contributor ! It took eight years for x264 to dominate H.264 encoding market ‒ We would like to achieve dominance in the HEVC market sooner 10 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
Encoding and GPUs
GPU CONSIDERATIONS A SAD HISTORY ! Historically, GPUs have been poor for video encoding ‒ Intra predic]on requires blocks above and to the len to be ful y encoded and decoded ‒ Inter predic]on requires blocks above and to the len to be ful y analyzed ‒ Rate distor]on op]miza]ons require all blocks to be encoded in scan order ‒ Together, these dependencies severely limit the amount of parallelism that can be exposed to the GPU ! Encoder data dependencies are complex ‒ Copying data to and from GPU device memory generally outweighs any performance improvements ‒ Even zero copy memory is insuﬃcient, the CPU and GPU must share structures at ful speed ! Previous alempts at GPU encoding take short cuts ‒ One can ignore some of these dependencies at the cost of compression eﬃciency and quality ‒ In x264, we only used the GPU for lookahead analysis that has no intra and RDO dependencies 12 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
APU CONSIDERATIONS A WELL BALANCED COMPUTE PROCESSOR ! Heterogeneous architecture ‒ GPU compute units can perform high bandwidth opera]ons and highly parallel opera]ons ‒ CPU performs necessary serial and logis]cal opera]ons ‒ CPU and GPU can see each other’s memory ! x265 opportunity ‒ Via WPP and frame parallelism we can expose two dozen parallel CU blocks to be encoded ‒ Each parallel CU block requires recursive analysis ‒ Control must transfer between the CPU and GPU many ]mes to complete analysis ‒ GPU performs all cost es]mates for inter and inter compression, loop ﬁlters, and pixel weigh]ng ‒ CPU makes QT split and encode decisions, entropy encoding, and dependency tracking ‒ Many CUs can be busy on the GPU at once, only four may use the CPU cores at a ]me. ‒ Making use the GPU compute units with minimal CPU overhead is the key 13 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
DISCLAIMER & ATTRIBUTION The informa]on presented in this document is for informa]onal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informa]on contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product diﬀerences between diﬀering manufacturers, sonware changes, BIOS ﬂashes, ﬁrmware upgrades, or the like. AMD assumes no obliga]on to update or otherwise correct or revise this informa]on. However, AMD reserves the right to revise this informa]on and to make changes from ]me to ]me to the content hereof without obliga]on of AMD to no]fy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.