SFHUG presentation from February 2, 2016. One of the key values of the Hadoop ecosystem is its f...
SFHUG presentation from February 2, 2016. One of the key values of the Hadoop ecosystem is its flexibility. There is a myriad of components that make up this ecosystem, allowing Hadoop to tackle otherwise intractable problems. However, having so many components provides a significant integration, implementation, and usability burden. Features that ought to work in all the components often require sizable per-component effort to ensure correctness across the stack.
Lenni Kuff explores RecordService, a new solution to this problem that provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
Lenni discusses the architecture of the service and the integration work done for MapReduce and Spark. Many existing applications on those frameworks can take advantage of the service with little to no modification. Lenni demonstrates how this provides fine grain (column level and row level) security, through Sentry integration, and improves performance for existing MapReduce and Spark applications by up to 5×. Lenni concludes by discussing how this architecture can enable significant future improvements to the Hadoop ecosystem.
About the speaker: Lenni Kuff is an engineering manager at Cloudera. Before joining Cloudera, he worked at Microsoft on a number of projects including SQL Server storage engine, SQL Azure, and Hadoop on Azure. Lenni graduated from the University of Wisconsin-Madison with degrees in computer science and computer engineering.