Conceptual Level Architecture • Hive Components: – Parser (antlr): HiveQL Abstract Syntax Tree (AST) – Seman:c Analyzer: AST DAG of MapReduce Tasks • Logical Plan Generator: AST operator trees • Op:mizer (logical rewrite): operator trees operator trees • Physical Plan Generator: operator trees - ‐> MapReduce Tasks – Execu:on Libraries: • Operator implementa:ons, UDF/UDAF/UDTF • SerDe & ObjectInspector, Metastore • FileFormat & RecordReader
Hive User/Applica:on Interfaces CliDriver Hive shel script HiPal* Driver HiveServer ODBC JDBCDriver (inheren:ng Driver) *HiPal is a Web-basedHive client developed internally at Facebook.
Parser • ANTLR is a parser generator. • $HIVE_SRC/ql/src/java/org/ apache/hadoop/hive/ql/parse/ Hive.g • Hive.g deﬁnes keywords, ParseDriver Driver tokens, transla:ons from HiveQL to AST (ASTNode.java) • Every :me Hive.g is changed, you need to ‘ant clean’ ﬁrst and rebuild using ‘ant package’
Seman:c Analyzer • BaseSeman:cAnalyzer is the base class for DDLSeman:cAnalyzer and Seman:cAnalyzer BaseSeman:cAnalyzer Driver – Seman:cAnalyzer handles queries, DML, and some DDL (create- ‐table) – DDLSeman:cAnalyzer handles alter table etc.
Logical Plan Genera:on • Seman:cAnalyzer.analyzeInternal() is the main funciton – doPhase1(): recursively traverse AST tree and check for seman:c errors and gather metadata which is put in QB and QBParseInfo. – getMetaData(): query metastore and get metadata for the data sources and put them into QB and QBParseInfo. – genPlan(): takes the QB/QBParseInfo and AST tree and generate an operator tree.
Logical Plan Genera:on (cont.) • genPlan() is recursively called for each subqueries (QB), and output the root of the operator tree. • For each subquery, genPlan create operators “bocom- ‐up”* staring from FROM WHERE GROUPBY ORDERBY SELECT • In the FROM clause, generate a TableScanOperator for each source table, Then genLateralView() and genJoinPlan(). *Hive code actually names each leaf operator as “root” and its downstream operators as children.
Logical Plan Genera:on (cont.) • genBodyPlan() is then called to handle WHERE- ‐ GROUPBY- ‐ORDERBY- ‐SELECT clauses. – genFilterPlan() for WHERE clause – genGroupByPalnMapAgg1MR/2MR() for map- ‐side par:al aggrega:on – genGroupByPlan1MR/2MR() for reduce- ‐side aggrega:on – genSelectPlan() for SELECT- ‐clause – genReduceSink() for marking the boundary between map/ reduce phases. – genFileSink() to store intermediate results
Op:mizer • The resul:ng operator tree, along with other parsing info, is stored in ParseContext and passed to Op:mizer. • Op:mizer is a set of Transforma:on rules on the operator tree. • The transforma:on rules are speciﬁed by a regexp pacern on the tree and a Worker/Dispatcher framework.
Physical Plan Genera:on • genMapRedWorks() takes the QB/QBParseInfo and the operator tree and generate a DAG of MapReduceTasks. • The genera:on is also based on the Worker/ Dispatcher framework while traversing the operator tree. • Diﬀerent task types: MapRedTask, Condi:onalTask, FetchTask, MoveTask, DDLTask, CounterTask • Validate() on the physical plan is called at the end of Driver.compile().
Preparing Execu:on • Driver.execute takes the output from Driver.compile and prepare hadoop command line (in local mode) or call ExecDriver.execute (in remote mode). – Start a session – Execute PreExecu:onHooks – Create a Runnable for each Task that can be executed in parallel and launch Threads within a certain limit – Monitor Thread status and update Session – Execute PostExecu:onHooks
Preparing Execu:on (cont.) • Hadoop jobs are started from MapRedTask.execute(). – Get info of all needed JAR ﬁles with ExecDriver as the star:ng class – Serialize the Physical Plan (MapRedTask) to an XML ﬁle – Gather other info such as Hadoop version and prepare the hadoop command line – Execute the hadoop command line in a separate process.
Star:ng Hadoop Jobs • ExecDriver deserialize the plan from the XML ﬁle and call execute(). • Execute() set up # of reducers, job scratch dir, the star:ng mapper class (ExecMapper) and star:ng reducer class (ExecReducer), and other info to JobConf and submit the job through hadoop.mapred.JobClient. • The query plan is again serialized into a ﬁle and put into DistributedCache to be sent out to mappers/ reducers before the job is started.
Operator • ExecMapper create a MapOperator as the parent of all root operators in the query plan and start execu:ng on the operator tree. • Each Operator class comes with a descriptor class, which contains metadata passing from compila:on to execu:on. – Any metadata variable that needs to be passed should have a public secer & gecer in order for the XML serializer/deserializer to work. • The operator’s interface contains: – ini:lize(): called once per operator life:me – startGroup() *:called once for each group (groupby/join) – process()*: called one for each input row. – endGroup()*: called once for each group (groupby/join) – close(): called once per operator life:me
Example: adding Semijoin operator • Lek semijoin is similar to inner join except only the lek hand side table is output and no duplicated join values if there are duplicated keys in RHS table. – IN/EXISTS seman:cs – SELECT * FROM S LEFT SEMI JOIN T ON S.KEY = T.KEY AND S.VALUE = T.VALUE – Output all columns in table S if its (key,value) matches at least one (key,value) pair in T
Semijoin Implementa:on • Parser: adding the SEMI keyword • Seman:cAnalyzer: – doPhase1(): keep a mapping of the RHS table name and its join key columns in QBParseInfo. – genJoinTree: set new join type in joinDesc.joinCond – genJoinPlan: • generate a map- ‐side par:al groupby operator right aker the TableScanOperator for the RHS table. The input & output columns of the groupby operator is the RHS join keys.
Semijoin Implementa:on (cont.) • Seman:cAnalyzer – genJoinOperator: generate a JoinOperator (lek semi type) and set the output ﬁelds as the LHS table’s ﬁelds • Execu:on – In CommonJoinOperator, implement lek semi join with early- ‐exit op:miza:on: as long as the RHS table of lek semi join is non- ‐nul , return the row from the LHS table.
Debugging • Debugging compile- ‐:me code (Driver :ll ExecDriver) is rela:vely easy since it is running on the JVM on the client side. • Debugging execu:on :me code (aker ExecDriver calls hadoop) need some conﬁgura:on. See wiki hcp://wiki.apache.org/hadoop/Hive/ DeveloperGuide#Debugging_Hive_code