Presntation DataStorm Kick-off

Transcription

Presntation DataStorm Kick-off
Environmental Data Processing and
Communication Infrastructure
DataStorm 2013 Workshop on Large-Scale Data
Management
Paulo Carreira | IST & INESC-­‐ID
Source: GREENPLUM blog
Drivers
Green Economy
Will require integrating faster from data
faster from a wider ranges of data sources
Green Produc:on
Metering & Sensing
Energy-­‐Use Op:miza:on
BPM Op:miza:on User Engagement
Smart Produc
:on
Green Services
Smart Grids
Smart Transpor
ta:on
Green ICT
Green Consump:on
Smart Smart Consum
Facili:es
p:on
Shift towards predictive
operation
past
present
historical data analytics
real-time data analytics
predictive operation
future
Enablers
Internet-of-Things
All objects in the world are identified with minuscule devices
•
We could localize and count everything
Objects can communicate over an ubiquitous infrastructure
•
Coordinating each other to improve our well-being
Internet-of-Things
All objects in the world are identified with
minuscule devices
•
We could localize and count everything
All objects communicate over an ubiquitous
infrastructure
•
Coordinating each other to improve our well-being
Open Source Electronics
After the devastating tsunami in 2011, DYIers in
Japan built their own devices to detect radiation
levels, then posted their finding on the Internet.
Hackerspaces worldwide
DY Geiger Counter
A couple of motivating
examples
Smart Cities Example
Query
“Which groups of users had the smaller
energy foot-print over the last week?”
Smart Cities Example
Smart Buildings Example
Smart
Building
Smart
Building
Building
Aggregators
...
Smart
Grid
Building
Aggregator
Utility
Aggregator
Aggregator
Smart
Building
...
Building
Aggregator
Smart Buildings Example
Query
“Which office rooms will be require more
energy in the two hours?”
Smart Buildings Example
Answer based on a simulation model!
Real-time Information
Processing
Stream Processing Systems
filters
sources
sinks
channels
Applications
• Finance (real-time trading decisions)
• Fraud detection
• Business Intelligence
• Building Automation
• Assisted Living
• Monitoring
• ...
Several Challenges
• Billions of events
• Detecting very complex event conditions
• Minimization of the development effort (to
setup a system that reacts to complex
situations)
• ...
Engineering Stream
Processing Software
• Currently Stream Processing software is (still)
developed ad-hoc
• In the 1970s Database Management
Systems detached applications from data
storage logic
• By the 2000s Data Steam Management
Systems promise to detach applications from
streaming data processing logic
Database Management
Systems
Query
Data is
stored;
user queries
Data Sinks
Queries are
run
persistent
store
Data Stream
Management Systems
Query
user queries
Data Sources
Data Sinks
Queries are
stored;
persistence
Data is run
DSMSs Available
• Research Systems
‣ TelegraphCQ, COUGAR, Borealis, STREAMON
• Commercial
‣ Oracle 11g RT
‣ (IBM DB2, Microsoft SQL-Server soon to
follow)
• Open Source
‣ Esper, S4, ...
DBMS vs DSMS
•
•
DBMS
•
•
•
Queries are run when submitted and terminate
Data is pulled
Optimization occurs upfront
DSMS
•
•
•
Queries are installed and run continuously
Data is pushed
Optimization is dynamic
Streaming Queries:
A short introduction
Preliminaries
• Model
• Abstracted as an append-only relations
with transient tuples
• Streams have continuous and infinite
nature
• The events in the stream arrive online
• Events arrive out-of-order
Challenges
• Unbounded Memory: Queries require
an unbounded amount of memory to
evaluate precisely.
• Approximate query processing
• Sliding window query processing
• High data-flow rate: Data arrives at a
pace (multi-GB) that floods the CPU
• Sampling
• Data synopsis
Basics
• Most Relational Algebra operators extend
naturally to stream processing
• Select, project, join are non-blocking
• Therefore, SQL also extends naturally
• Some new operators need to be introduced
Real-Time Query Block
select col_expr1, ..., col_exprn
from stream_def
where select_cond
group by aggr_expr
having having_cond
output supression_expr
order by ordering_expr
Window
• Batch vs Sliding
• Based on length, based on time (or both)
Windows
Last 5 withdrawals with amount >= 200
select *
from Withdrawal.win:length(5)
where amount >= 200
Withdrawals over the last 5 seconds
select *
from Withdrawal.win:time(5 sec)
Withdrawals at each 5 seconds (batch)
select *
from Withdrawal.win:time_batch(5 sec)
Aggregation and Grouping
Sums of the amounts every 5-seconds
select sum(amount)
from Withdrawal.win:time_batch(5 sec)
Accounts where the average withdrawal amount per account
for the last hour of withdrawals is greater then 1000
select account, avg(amount)
from Withdrawal.win:time(1 hour)
group by account
having amount > 1000
Output control
The ids of temperature sensor events every 10 seconds
select sensorId
from TemperatureSensorEvent
output every 10 seconds
The first of a series temperature drops below 21C a the
specified output rate
select *
from TemperatureSensor where temp<21
output first every 60 seconds
The total price for all orders in the 30- minute time window,
output every minute but only after 30 minutes have passed
select sum(price)
from OrderEvent.win:time(30 min)
output after 30 min snapshot every 1 min
Ordering
Batches of 5 or more stock tick events that are sorted first by
price ascending and then by volume descending:
select symbol
from StockTickEvent.win:time(60 sec)
output every 5 events
order by price, volume desc
Insert and remove streams
What should be result of the query that alerts all accounts
that have more that 2 withdrawals with the last 5 mins (of
1000 euro)?
select accnt_no, count(*) as n_withdrawals
from Withdrawal.win:time(5 min)
where amount > 1000 group by acct_no
having count(*) = 2
query
1 min
<123, 1100>
+
2 min
<123, 2000>
+
<123, 2>
U+
3 min
<123, 3000>
+
<123, 2>
U-
Insert and remove streams
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - -
• Sources, channels, operators and sinks must handle
positive (insert) and negative (remove) update
streams
• The semantics of all operators must be defined
accordingly.
Insert and remove streams
Select only the events that are leaving the 30 second time
window.
select rstream *
from TempSensor.win:time(30 sec)
Research Challenges/
Opportunities
Ubiquitous integration
• Embedded stream processing
•
•
Even small devices will have to integrate data
from multiple source and in real time to perform
optimally
How can they ship parts of the processing to
peers and to the cloud?
Optimization and
constraint solving
• Answering to optimization queries
•
•
E.g. what is the combination of actuators that
maximize comfort and minimize energy
consumption
This requires incorporating MAX-SAT solving
algorithms into the stream query processors
Querying persistent
and streaming data
• Efficiently answering queries that
join streamed with persisted (big
data)
•
•
Some very recent research (2008- ) on active
data warehousing, new join and update
algorithms
Still unclear how this will scale to big data
containers
Model-Driven Engineering
• Systematic approaches to developing
‘data intensive applications’
•
Creating languages tailored for different users and
domains
•
•
•
Applications to energy/building automation domain
(e.g. energy-awareness built into the language)
Simplifying (already overloaded) stream query
languages
Creating development environments that assist
Open Data Infrastructure
• Enabling open access to data storage
and querying
•
•
•
Standard Common Data Models
Public data infrastructures
Policy making
pjc@inesc-id.pt