In this Q&A with Spark inventor Matei Zaharia — also the CTO and co-founder of (and a professor at MIT) — on the heels of the recent Spark Summit, we cover the difference between Hadoop MapReduce and Spark; what are the ingredients of a successful open source project; and the story of how Spark almost helped a friend win a million dollars.
Matei Zaharia won the 2014 Doctoral Dissertation Award for his innovative solution to tackling the surge in data processing workloads, and accommodating the speed and sophistication of complex multi-stage applications and more interactive ad-hoc queries. His work proposed a new architecture for cluster computing systems, achieving best-in-class performance in a variety of workloads while providing a simple programming model that lets users easily and efficiently combine them.
The main difference between Spark SQL and systems like Presto is that Spark SQL is integrated into the full Spark engine, so it can also call more complex non-SQL code written in Spark (e.g. the machine learning library), and likewise you can call it inside a normal Spark program (e.g. run SQL on an RDD of objects you have there). The goal is to enable much richer integration between SQL and complex analytics. As far as I know none of the other SQL engines for Hadoop are doing that yet.
BTW this paper on Spark SQL talks a bit about the motivation of integrating SQL with the traditional Spark API.
Today the Association for Computer Machinery (ACM) announced that CSAIL researcher Matei Zaharia has won the 2014 Doctoral Dissertation Award for his innovative solutions to tackling the surges in data processing workloads.
Apache Spark has been an integral part of Mesos from its inception. Spark is one of the most widely used big data processing systems for clusters. Matei Zaharia, the CTO of Databricks and creator of Spark, talked about Spark's advanced data analysis power and new features in its upcoming 2.0 release in his keynote.