SQL Server PDW w. Polybase: What, Why, How Srinath Shankar
Transcription
SQL Server PDW w. Polybase: What, Why, How Srinath Shankar
SQL Server PDW w. Polybase: What, Why, How Srinath Shankar Microsoft Gray Systems Lab Madison, WI graysystemslab.com Today, many organizations have data sets in both a relational DBMS and Hadoop. While systems like Hive and Impala can be used to query HDFS-resident data, they are not capable of answering queries that combine data from the two universes of structured (relational) and unstructured (Hadoop) data. In this talk I will describe Polybase, an extension to the SQL PDW parallel database system that is capable of executing queries against both types of data. Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance. Using PDW’s CTAS (“create table as select”) syntax Polybase provides the ability to move data in parallel between nodes of the Hadoop and PDW clusters. In addition, users can create external tables over HDFS-resident data (all standard HDFS file formats are supported as well as custom file formats as long as RecordReader and RecordWriter implementations are provided) allowing queries to reference data stored in HDFS as if the data had been loaded into a relational table. While some other parallel database systems provide similar capabilities, Polybase pushes the state-ofthe-art further into two significant ways. First, when optimizing a SQL query that references data stored in HDFS, the Polybase query optimizer makes a cost-based decision (using statistics that it has stored on the HDFS file in the PDW catalog) on whether or not it should transform relational operators over HDFSresident data into MapReduce jobs for execution on the Hadoop cluster. Consider for example a simple query with two selections, a join, and an aggregate. If one of the two input tables is stored in HDFS, the Polybase query optimizer will evaluate whether or not to perform the select operator on that table as a Map job on the Hadoop cluster or whether it is more efficient to pull the entire file in PDW and perform the selection using the SQL Server instances running on the PDW appliance. If both input tables are in HDFS, not only will the Polybase QO consider pushing the selection as Map jobs, but it will also consider the benefits of pushing the join and aggregate as well. All of this is totally transparent to the user and is driven by a state-of-the art parallel query optimizer initially developed at the Gray Systems Lab. About the Speaker Srinath Shankar is a Research SDE working for Microsoft at the Jim Gray Systems Lab in Madison, WI. His primary field of work is database management systems, with a focus on parallel and distributed data management. He obtained his PhD at the University of Wisconsin, Madison and his bachelor’s degree at IIT Madras before that. At Microsoft, he helped develop a cost-based query optimizer for SQL Server Parallel DataWarehouse (PDW), a shared-nothing parallel DBMS. He currently works on Polybase, a splitquery execution framework capable of processing both relational data in PDW and unstructured data in Hadoop.