MEMBERSHIP TABLE Membership Level Number List Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc., Qualcomm Inc., Samsung Electronics Co Ltd Promoter 1 LG Electronics Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical Limited, CEVA Inc., Digital Media Professionals, Electronics and Telecommunications Research, Institute (ETRI), General Processor, Huawei, Industrial Technology Res. Institute, Marvell International Ltd., Mobica, Oracle, Sonics, Inc, Sony Mobile, Communications, Swarm 64 GmbH, Synopsys, Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA Technologies, Vivante Corporation Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software, Fabric Engine, Kishonti, Lawrence Livermore National Laboratory, Linaro, MultiCoreWare, Oak Ridge National Laboratory, Sandia Corporation, StreamComputing, SUSE LLC, UChicago Argonne LLC, Operator of Argonne National Laboratory Academic 17 Institute for Computing Systems Architecture, Missouri University of Science & Technology, National Tsing Hua University, NMAM Institute of Technology, Northeastern University, Rice University, Seoul National University, System Software Lab National, Tsing Hua University, Tampere University of Technology, TEI of Crete, The University of Mississippi, University of North Texas, University of Bologna, University of Bristol Microelectronic Research Group, University of Edinburgh, University of Illinois at Urbana-Champaign Department of Computer Science
INFLECTIONS IN PROCESSOR DESIGN Heterogeneous Single-Core Era Multi-Core Era Systems Era Enabled by: Constrained by: Enabled by: Constrained Enabled by: Temporarily Moore’s Power Moore’s Law by: Abundant data Constrained by: Law Complexity SMP Power parallelism Programming Voltage architecture Parallel SW Power efficient models Scaling Scalability GPUs Comm.overhead Assembly C/C++ Java … pthreads OpenMP / TBB … Shader CUDA OpenCL C++ and Java
on a e
e i t t e nc ? a nc hre a a nc t- a orm we are orm here Applic we are
When the API is invoked, the reference count is decreased by 1.
When the reference count < 1
All the resources associated with the runtime instance (queues, signals, topology information, etc.) are considered invalid and any attempt to reference them in subsequent API calls results in undefined behavior.
The user might call hsa_init to initialize the HSA runtime again.
The runtime passes asynchronous notifications by calling user-defined callbacks.
For instance, queues are a common source of asynchronous events because the tasks queued by an application are asynchronously consumed by the packet processor. Callbacks are associated with queues when they are created. When the runtime detects an error in a queue, it invokes the callback associated with that queue and passes it an error flag (indicating what happened) and a pointer to the erroneous queue.
The HSA runtime does not implement any default callbacks.
An HSA memory node is a node that delineates a set of system components (host CPUs and HSA Components) with “local” access to a set of memory resources attached to the node's memory controller and appropriate HSA-compliant access attributes.
One of the key features of HSA is its ability to share global pointers between the host application and code executing on the HSA component.
This ability means that an application can directly pass a pointer to memory allocated on the host to a kernel function dispatched to a component without an intermediate copy
When a buffer created in the host is also accessed by a component, programmers are encouraged to register the corresponding address range beforehand.
Registering memory expresses an intention to access (read or write) the passed buffer from a component other than the host. This is a performance hint that allows the runtime implementation to know which buffers will be accessed by some of the components ahead of time.
A single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.”
A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.”
REGULAR GPGPU WORKLOADS M N Generally: HSA works well with Define Partition Communicate Communicate Problem Space Hierarchically Locally Globally regular data-parallel workloads N times M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern =
OPENCL HAS MEMORY MODELS TOO MAPPING ONTO HSA’S MEMORY MODEL
OPENCL 1.X MEMORY MODEL MAPPING
It is straightforward to provide a mapping from OpenCL 1.x to the proposed model OpenCL Operation HSA Memory Model Operation Atomic load ld_global_wg ld_group_wg Atomic store atomic_st_global_wg atomic_st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) fence ; barrier_wg
OpenCL 1.x atomics are unordered and so map to atomic_op_X
Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
Which given a deque
returns (%k) the position of the left most of RN atomic_ld_global_scacq used to read node from array Makes one if necessary (i.e. if there are only LN or DN) atomic_cas_global_scar, required to make new RN
returns (%left) the left node (i.e. the value to the left of the left most RN position)
Atomic add writeIndex with number of packets to reserve Producer must wait until packetID < readIndex + size before writing to packet Queue can be sized so that wait is unlikely (or impossible) Suitable when many threads use one queue
AGENT DISPATCH PACKET Start Offset Format Field Name Description (Bytes) 0 uint16_t header Packet header The function to be performed by the destination Agent. The type value is split into the following ranges: 2 uint16_t type
Initiated when launch conditions are met All preceding packets in the queue must have exited launch phase If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase Includes memory acquire fence
Execute the packet
Barrier packets remain in Active phase until conditions are met.
First step is memory release fence – make results visible.