Windowing Functions on MySQL with InfiniDB
Transcription
Windowing Functions on MySQL with InfiniDB
Windowing Functions on MySQL with InfiniDB Dipti Joshi, Data Architect, InfiniDB Copyright © 2014 InfiniDB. All Rights Reserved. What is InfiniDB? • • • • • • Massively Parallel MySQL Storage Engine for Fast Analytics Linear scale to handle exponential growth Open-Source Runs on premise, on AWS cloud or Hadoop HDFS cluster Standard ANSI SQL compliance First MySQL storage engine to support ANSI SQL-compliant windowing functions Copyright © 2014 InfiniDB. All Rights Reserved. InfiniDB Parallelism User Module – Processes SQL Requests Performance Module – Executes the Queries Single Server MPP or Copyright © 2014 InfiniDB. All Rights Reserved. Analytic Use Cases With Traditional MySQL and With InfiniDB Copyright © 2014 InfiniDB. All Rights Reserved. Example: Website Visitor Tracking Database Website visitor tracking database Website visit table Id_visit (Unique visit identifier server_date (date of visit) vistor_id (visitor identifier) visit_total_time (time spent during the visit) 1 02-02-2014 John_A_001 600 2 02-02-2014 Tom_J_002 400 3 03-28-2014 Dan_K_001 200 4 02-03-2014 John_A_001 70 5 03-04-2014 Jane_M_009 340 6 03-05-2014 Jane_M_009 660 7 03-06-2014 John_A_001 800 Find the top visitor of the site every month based on time spent on the site Copyright © 2014 InfiniDB. All Rights Reserved. Top Visitor by time spent on the site Simple ! Id_visit server_date (Unique visit identifier (date of visit) vistor_id (visitor identifier) visit_total_time (time spent during the visit) 1 02-02-2014 John_A_001 600 2 02-02-2014 Tom_J_002 400 3 03-28-2014 Dan_K_001 200 4 02-03-2014 John_A_001 70 5 03-04-2014 Jane_M_009 340 6 03-05-2014 Jane_M_009 660 7 03-06-2014 John_A_001 800 vistor_id (visitor identifier) SUM(visit_total_time) John_A_001 1470 Jane_M_009 1000 Tom_J_002 400 Dan_K_001 200 SELECT vistor_id, SUM(visit_total_time) total_time FROM log_visit GROUP BY visitor_id ORDER by 2 desc LIMIT 1; Copyright © 2014 InfiniDB. All Rights Reserved. Top visitor for each month Id_visit (Unique visit identifier server_date (date of visit) vistor_id (visitor identifier) visit_total_time (time spent during the visit) 1 02-02-2014 John_A_001 600 2 02-02-2014 Tom_J_002 400 3 03-28-2014 Dan_K_001 200 4 02-03-2014 John_A_001 70 5 03-04-2014 Jane_M_009 340 6 03-05-2014 Jane_M_009 660 7 03-06-2014 John_A_001 800 Totals for Each Visitor by Month Month vistor_id (visitor identifier) Monthly_Vistor_time= SUM(visit_total_time) February John_A_001 670 February Tom_J_002 400 March Dan_K_001 200 March Jane_M_009 1000 March John_A_001 800 Copyright © 2014 InfiniDB. All Rights Reserved. MAX of total visitor time for each month Month MAX(Monthly_ Vistor_time) February 670 March 1000 Top Visitor for each month : MySQL Getting Complicated ! SELECT visitor_id, visit_month, m_time FROM Totals for Each Visitor by Month (SELECT visitor_id, month(visit_server_date) visit_month, SUM(visit_total_time) total_time FROM log_visit GROUP BY visitor_id, visit_month) t1, JOIN (SELECT month(visit_server_date) visit_month, MAX of total visitor time for each month MAX(total_time) m_time FROM ( SELECT visitor_id, month(visit_server_date) visit_month, SUM(visit_total_time) total_time FROM log_visit GROUP BY visitor_id, visit_month) subq GROUP BY MONTH) t2 WHERE t1.total_time = t2.m_time AND t1.visit_month = t2.visit_month Copyright © 2014 InfiniDB. All Rights Reserved. Top 2 visitors for each month Totals for Each Visitor by Month Id_visit (Unique visit identifier server_date (date of visit) vistor_id (visitor identifier) visit_total_time (time spent during the visit) 1 02-02-2014 John_A_001 600 2 02-02-2014 Tom_J_002 400 3 03-28-2014 Dan_K_001 200 4 02-03-2014 John_A_001 70 5 03-04-2014 Jane_M_009 340 6 03-05-2014 Jane_M_009 660 7 03-06-2014 John_A_001 800 Month vistor_id (visitor identifier) Monthly_Vistor_time= SUM(visit_total_time) February John_A_001 670 February Tom_J_002 400 March Dan_K_001 200 March Jane_M_009 March John_A_001 Top 2 total visitor time for each month Month Top 2 Monthly_Vistor_time 1000 February 670 800 February 400 March 1000 March 800 Copyright © 2014 InfiniDB. All Rights Reserved. Top N Visitors for each month With MySQL: Getting Complicated ! Totals for Each Visitor by Month With Windowing Function: Simple ! Top N total visitor time for January Rank of each visitor with in each month Top N total visitor time for February • • • Totals for Each Visitor by Month Top N total visitor time for December Copyright © 2014 InfiniDB. All Rights Reserved. Simplified: How do we do that ? SELECT visitor_id, total_time, visit_month, RANK() OVER (PARTITION BY visit_month ORDER BY t1.total_time desc) time_rank Windowing Function Totals for Each Visitor by Month Month vistor_id (visitor identifier) Total_time Time_rank February John_A_001 670 1 February Tom_J_002 400 March Dan_K_001 Jane_M_009 200 1000 March Jane_M_009 John_A_001 1000 800 March John_A_001 Dan_K_001 800 200 Partition rows for February Rank in February 2 1 Partition rows for March 2 3 Top 1 : Time_rank = 1 Top 2 : Time_rank <= 2 Top N: Time_rank <= N Copyright © 2014 InfiniDB. All Rights Reserved. Rank in March Another Use Case: Running Average Website Sales Items Daily Revenue table Item Id Server_date Revenue Running Average 1 02-01-2014 20000.00 20000.00 2 02-01-2014 15000.00 15000.00 3 02-01-2014 17250.00 17250.00 1 02-02-2014 5001.00 12500.50 3 02-03-2014 25010.00 250100.00 3 02-04-2014 21034.00 23022.00 2 02-04-2014 34029.00 34029.00 3 02-05-2014 4120.00 12577.00 2 02-05-2014 7138.00 20583.50 For each item, for each day find average of revenue for that day and the previous day Copyright © 2014 InfiniDB. All Rights Reserved. Running Daily Average per item MySQL:Union of two Cartesian join Queries, Complex! Item’s Revenue of previous day Item’s Revenue of the current day Windowing Function: Simple! Copyright © 2014 InfiniDB. All Rights Reserved. Simplified: How do we do that ? SELECT item_id, server_date, daily_revenue, AVG(revenue) OVER (PARTITION BY item_id ORDER BY server_date RANGE INTERVAL '1' DAY PRECEDING ) running_avg FROM web_item_sales Item Id Server_date Revenue Running Average 1 02-01-2014 20000.00 20000.00 1 02-02-2014 5001.00 12500.50 2 02-01-2014 15000.00 15000.00 2 02-04-2014 34029.00 34209.00 2 02-05-2014 7138.00 20583.50 3 02-01-2014 17250.00 17250.00 3 02-03-2014 25010.00 250100.00 3 02-04-2014 21034.00 12577.00 3 02-05-2014 4120.00 20583.50 Copyright © 2014 InfiniDB. All Rights Reserved. Complex Analytics with Traditional MySQL vs InfiniDB Traditional MySQL - Complex Sub Query Joins - Unions of Cartesian Joins - Reduced Efficiency InfiniDB Windowing function - Easy Powerful Simplified syntax Liberating from complexity Copyright © 2014 InfiniDB. All Rights Reserved. InfiniDB Windowing Functions Copyright © 2014 InfiniDB. All Rights Reserved. Windowing functions Aggregate over a series of related rows Simplified function for complex statistical analytics over sliding window per row - Cumulative, moving or centered aggregates - Simple Statistical functions like rank, max, min, average, median - More complex functions such as distribution, percentile, lag, lead - Without running complex sub-queries or writing stored procedures Applications - Data warehousing advanced aggregate analytics - Business Intelligence - Mathematical time-series functions Copyright © 2014 InfiniDB. All Rights Reserved. 17 Partition PARTITION BY expr1, expr2,…exprn - One or more columns or expression on which rows are grouped for windowing function calculation Each input row belongs to one partition Similar to GROUP BY - But, each row retains its identity for output c1 C2 c3 c1 C2 c3 Group By Output Rows Partition By Output Rows Copyright © 2014 InfiniDB. All Rights Reserved. PARTITION BY ORDER BY ORDER BY - One or more columns or Functions - Column does not need to be in projection list - Rows with in the Partitions are ordered by given columns Copyright © 2014 InfiniDB. All Rights Reserved. Frame FRAME for each row is a subset of a PARTITION for the row - Range of rows within partition Range of values within partition Default frame for a row is the entire partition Windowing function calculated by aggregation over the FRAME As the row moves, the frame can move SUM(X) OVER (PARTITION BY Y ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) Row Number X Y 1 1 1 2 4 1 3 7 1 4 10 1 5 2 2 6 5 2 7 8 2 8 3 3 9 6 3 10 9 3 PARTITION Partition for rows 1 to 4 FRAME Frame for row 1 FRAME Frame for row 2 sum(x) = 21 FRAME Partition for rows 8 to 10 Frame for row 5 sum(x) = 15 Frame for row 6 sum(x) = 13 Frame for row 8 sum(x) = 18 Frame for row 9 sum(x) = 15 Sum(X) 22 Frame for row 3 sum(x) = 17 sum(x) = 22 Partition for rows 5 to 7 FRAME Frame for row 4 sum(x) = 10 21 17 10 15 Frame for row 7 sum(x) = 8 13 8 18 Frame for row 10 sum(x) = 9 Copyright © 2014 InfiniDB. All Rights Reserved. 20 15 9 What is Windowing Function ? Provide aggregated value based on a group of rows –the PARTITION Performs a calculation across a set of rows that are somehow related to the current row – the FRAME Rows not grouped into a single output row — the rows retain their separate identities Returns multiple rows for each PARTITION Copyright © 2014 InfiniDB. All Rights Reserved. 21 Traditional vs Windowing aggregation Traditional Aggregate Function Windowing Aggregate Function with GROUP BY compute aggregates by creating groups compute aggregates via partitions and frames output is one row for each group output is one row for each input row only one way of aggregating for each group different rows in the same partition can have different frames only one way of grouping for each same SELECT can use different partitions SELECT for each aggregate function Copyright © 2014 InfiniDB. All Rights Reserved. 22 Windowing Function Processing Order Analytic Function(arg1,...,argn) OVER ( [PARTITION BY <...>] [ORDER BY <...>] [<FRAME_CLAUSE>] ) OVER clause indicates a query result set that the function operates on JOIN, WHERE, GROUP BY, HAVING CLAUSE of the main query PARTITIONS created, ordered and function applied to each row Copyright © 2014 InfiniDB. All Rights Reserved. 23 Final ORDER BY of the main query PARTITION BY - One or more columns or Functions PARTITION BY server_date, item_id PARTITION BY MONTH(server_date) - Column does not need to be in projection list - If omitted, all the input rows are in one partition ORDER BY - One or more columns or Functions - Column does not need to be in projection list FRAME - ROWS [BETWEEN <start> and <end> | <end>] CURRENT ROW UNBOUNDED PRECEDING UNBOUNDED FOLLOWING <Number of rows> PRECEDING <Number of rows> FOLLOWING - RANGE [BETWEEN <start> and <end> | <end>] UNBOUNDED PRECEDING UNBOUNDED FOLLOWING <value1> PRECEDING <value2> FOLLOWING 24 Copyright © 2014 InfiniDB. All Rights Reserved. InfiniDB Windowing Functions In Database statistical windowing functions Distributed computation over distributed data Aggregate - MAX, MIN, COUNT, SUM, AVG - STD, STDDEV_SAMP, STDDEV_POP, VAR_SAMP, VAR_POP Ranking - ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, PERCENTILE, PERCENTILE_CONT, PERCENTILE_DISC, MEDIAN FIRST/LAST - NTH_VALUE, FIRST_VALUE, LAST_VALUE LEAD/LAG - LAG, LEAD Copyright © 2014 InfiniDB. All Rights Reserved. 25 EXAMPLES Copyright © 2014 InfiniDB. All Rights Reserved. Moving Aggregate AVG(expression): Average of expression over the frame of the current row Moving Centered Daily Average SELECT item_id, server_date, daily_revenue, AVG(daily_revenue) OVER (PARTITION BY item_id ORDER BY server_date RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING) centered_avg FROM web_item_sales Item Id Server_date Revenue 1 02-01-2014 400.00 1200.00 1 02-02-2014 2000.00 1300.00 1 02-03-2014 1500.00 2500.00 1 02-04-2014 4000.00 2750.00 2 02-01-2014 500.00 750.00 2 02-02-2014 1000.00 900.00 2 02-03-2014 1200.00 1400.00 2 02-04-2014 2000.00 Copyright © 2014 InfiniDB. All Rights Reserved. AVG(daily_revenue) OVER(…) 1600.00 RANK/DENSE RANK RANK() - Ranking of the row within the row’s frame with Gaps DENSE_RANK() - Ranking of the row within row’s frame with no gap SELECT item_id, server_date, daily_revenue, RANK() OVER (PARTITION BY server_date ORDER BY daily_revenue desc) item_rank, DENSE_RANK() OVER (PARTITION BY server_date ORDER BY daily_revenue desc) item_rank FROM web_item_sales Item Id Server_date Revenue 1 02-01-2014 20000.00 2 2 2 02-01-2014 15000.00 3 3 7 02-01-2014 15000.00 3 3 4 02-01-2014 5001.00 5 4 6 02-01-2014 21034.00 1 1 4 02-02-2014 4120.00 2 2 5 02-02-2014 7138.00 1 1 Copyright © 2014 InfiniDB. All Rights Reserved. RANK DENSE_RANK NTILE(N) Bucket number of a row when partition is divided in N buckets : PERCENTILE = NTILE(100), QUARTILE=NTILE(4) Ntile(3) of an item based on daily revenue SELECT item_id, server_date, daily_revenue, NTILE(3) OVER (PARTITION BY server_date ORDER BY daily_revenue) item_ntile FROM web_item_sales Item Id Server_date Revenue 1 02-01-2014 20000.00 2 2 02-01-2014 15000.00 1 3 02-01-2014 17250.00 2 4 02-01-2014 5001.00 1 5 02-01-2014 25010.00 3 6 02-01-2014 21034.00 3 4 02-02-2014 4120.00 1 5 02-02-2014 7138.00 2 6 02-02-2014 12577.00 3 Copyright © 2014 InfiniDB. All Rights Reserved. NTILE(3) NTH_VALUE(expression, n) nth value of expression in the frame - FIRST_VALUE = first value in the frame - LAST_VALUE = last value in the frame SELECT item_id, server_date, daily_revenue, NTH_VALUE(daily_revenue, 3) OVER (PARTITION BY server_date ORDER BY daily_revenue RANGE UNBOUNDED FOLLOWING) FROM web_item_sales Item Id Server_date Revenue 1 02-01-2014 20000.00 25010.00 2 02-01-2014 15000.00 20000.00 3 02-01-2014 17250.00 21034.00 4 02-01-2014 5001.00 17250.00 5 02-01-2014 25010.00 NULL 6 02-01-2014 21034.00 NULL 4 02-02-2014 4120.00 12577.00 5 02-02-2014 7138.00 NULL 6 02-02-2014 12577.00 NULL Copyright © 2014 InfiniDB. All Rights Reserved. NTH_VALUE(daily_revenue, 3) LAG/LEAD LAG(expression,offset): Value of the expression in the row offset before the current row in the partition LEAD(expression, offset): Value of the expression in the row offset after the current row in the partition SELECT item_id, server_date, daily_revenue, LEAD(daily_revenue, 3) OVER (PARTITION BY server_date ORDER BY daily_revenue) FROM web_item_sales Item Id Server_date Revenue LEAD(daily_evenue, 1) 1 02-01-2014 20000.00 21034.00 17250.00 2 02-01-2014 15000.00 17250.00 5001.00 3 02-01-2014 17250.00 20000.00 15000.00 4 02-01-2014 5001.00 15000.00 NULL 5 02-01-2014 25010.00 NULL 21034.00 6 02-01-2014 21034.00 25010.00 20000.00 4 02-02-2014 4120.00 7138.00 NULL 5 02-02-2014 7138.00 12577.00 4120.00 6 02-02-2014 12577.00 NULL 12577.00 Copyright © 2014 InfiniDB. All Rights Reserved. LAG(daily_evenue, 1) More Analytics Use cases Ranking - Top N or Bottom N items by monthly sales revenue - Top N or Bottom N visitors per month by monthly spending on site - Items in top N-tile range of monthly sales revenue Reporting Aggregates - Report for each page: Total revenue that resulted from click on the page per month, average monthly revenue per page and its percentage contribution towards the site monthly revenue Moving Aggregates - Running total revenue per item over previous 7 days - Year to date revenue per visitor - Sliding standard deviation of hourly stock price over the day Lead and Lag Analytics - Report for each page: current and previous order - Report for each page: How much the page lags behind the best performer by revenue Copyright © 2014 InfiniDB. All Rights Reserved. Summary InfiniDB first MySQL Storage Engine to support ANSI SQL compliant windowing analytic functions Windowing analytic functions simplifies complex analytics Copyright © 2014 InfiniDB. All Rights Reserved. Thanks ! More at http://infinidb.co Download InfiniDB at http://infinidb.co/download Follow us on twitter @InfiniDB Follow presenter on @dipti_smg Visit Our Booth InfiniDB Copyright © 2014 InfiniDB. All Rights Reserved. Questions ? Copyright © 2014 InfiniDB. All Rights Reserved.