Key Design Goals (reminder) • Keep it off-heap for data, on-heap for metadata: heap < 0.5gb • Support reasonable JNI interplay as desired • Specified, compatible wire level formats • Pipelined vectorized columnar execution • Nested data and late schema • Full SQL
Julian Hyde’s work on SQL parser • GitHub push soon • Support for basic scan, project and filter. – Includes sub-queries, scalar function pass- through, nested references and the any data type • Next Up: Group By, Union, Join
Topics • Configuration • In Memory Formats • Schema Management • RPC Framework • Specific RPC protocols • Cluster Coordination and cache
Configuration • Leverage HOCON for modular configuration – JSON++ for configurations: allows composite configuration definitions and looser syntax. • Hierarchical precedence – Common module loads drill-default.conf top-level configuration. – All other classpath drill-module.conf files loaded to integrate additional classes – drill-override.conf provides user-level properties
Schema Management and RecordBatch • RecordBatch is the relational operator unit of work • Targets ~256k in size, designed to fit in single core L2 cache • Internal y manages a set of fields – Focused on fields required for completion of the query. Inference provides some type information. – Untouched or asterisk fields may be stored in secondary compound inline fields depending on RecordReader implementation • Each next() cal moves forward the set of records – Each movement forward informs whether a new schema was found—if so, consumer should reconfigure based on updated schema. – Schema can be expanded: from one type to any type. Ultimately may be able to contract as wel (e.g. nul able to not-nul able). – An incoming schema changing does not necessarily modify an outgoing schema.
In-memory Formats • Values are managed in one of three ValueMode’s: ValueVector, RLE or Dict • More concrete than some research work such as C-store but also al ows for simpler implementation with most of the benefits. • Physical plan describes the ValueMode of the particular fields. (a field level physical property) • Depending on the particular requirements of a query and operator capabilities, data can be maintained in a compressed value-based structure. – Decision occurs at physical plan level prior to scans (requires format foreknowledge)
ValueVector • Primary common representation is ValueVector, a vectorized (array) uncompressed structure. • Off-heap native buffers manually reference counted and fronted by Netty4’s ByteBuf abstraction. • Support zero-copy transfer semantics when moving between operators. • Zero data serialization/dserialization allows direct write to and from sockets along with batch level metadata • Ultimately generate a JNI operator stub so that individual operators or groups of operators can be outside core system • Designed to leverage shared mmap between StorageEngine record readers and Dril bits to minimize overhead and reduce necessity for storage engine level pushdown. • Data Type variations include required, nul able and repeated • First few implementations made such as SInt32, Variable Length bytes, Nullables • Repeated wil support cross field references as for record level and repeated-node boundaries.
RPC Framework • Zero-copy byte buffer transfers wrapped in a protobuf envelope. • Fully symmetric push+pull based protocol • Top-level envelope utilizes standard protobuf envelope encoding so that any language can interact: CompleteRpcMessage – Composed of three parts: RpcHeader, ProtobufBody, RawBody. RawBody is optional (bytes). – For Java, we manual y encode/decode the top level envelope so that we can keep RawBody off-heap • Fully asynchronous using futures
RPC Protocols: Two Key Types User to Bit • UserClient and UserServer • Supports RunQuery > Handle, RequestResults > QueryResult, CancelQuery > Ack • Query Results mode can operate in: STREAM_FULL, STREAM_FIRST, QUERY_FOR_STATUS Bit to Bit • Each Dril bit can interact with al other Dril bits • Locations are managed via a cluster cache • Either Dril bit can act as server or client (bi-directional) • Managed via BitCom: which maintains server sessions and client connections as necessary • Supports activity such as fragment announcement, send record batch, node progress, cancel fragment.
Cluster Coordination and Cache • Cluster coordination is done through ClusterCoordinator abstraction – Manages node-level service registration, currently singular across both RPC types – Leverages Netflix’s Curator framework – Manages a cache of available Dril bits and associated capabilities per node – Used by clients and Dril bits • DistributedCache implemented through embedded Hazelcast – Sets up a distributed topic for queue depth management – Wil be used for query plan caching, other shared state – Expected to be used only by Dril bits, not clients
Other Discussion • Timothy: Overview of Supersonic exploration • David: Ideas around HBase and other work
Where we need help • Addition of Values operator to Reference Interpreter for SQL parser • Modify reference interpreter to avoid modification of existing records • First-level code reviews • Physical plan definition and documentation • More tests cases • TPC-H logical and physical plans • Simple identity transformer/optimizer (logical > physical) • Execution fragment format • First full-execution level storage operator, potentially using mmap shared memory • Forman implementation for query processing management • Review and evaluation of newer file formats and interaction with in-memory formats • First POP implementations • Lots of scalar function vector implementations