Agile Data Warehousing Service User Guide

Transcription

Page #1
Agile Data Warehousing Service User
Guide
Document publish date: 11/01/14
Copyright © GoodData Corporation 2007 - 2014
All Rights Reserved.
GOODDATA CONFIDENTIAL
Page #2
PROPRIETARY AND CONFIDENTIAL INFORMATION. This document may
not be disclosed to any third party, reproduced, modified or distributed without
the prior written permission of GoodData Corporation.
GOODDATA CORPORATION PROVIDES THIS DOCUMENTATION AS-IS
AND WITHOUT WARRANTY, AND TO THE MAXIMUM EXTENT
PERMITTED, GOODDATA CORPORATION DISCLAIMS ALL IMPLIED
WARRANTIES, INCLUDING WITHOUT LIMITATION THE IMPLIED
WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT AND
FITNESS FOR A PARTICULAR PURPOSE.
Page #3
Table of Contents
Table of Contents
3
Getting Started with Data Warehouse
8
Data Warehouse and Vertica
8
Project Hierarchy
9
Key Terminology
9
Data Warehouse Quick Start Guide
12
Prerequisites
12
Creating a Data Warehouse Instance
12
Reviewing Your Data Warehouse Instances
15
Connecting to Data Warehouse from CloudConnect
17
Connecting to Data Warehouse from a SQL Client Tool
18
Deprovision Your Data Warehouse Instance
19
Data Warehouse Management Guide
Managing your Data Warehouse Instances
20
20
Pass Data Warehouse Instance Ownership
20
Data Warehouse Instance Details Page
20
Managing Users and Access Rights
22
Data Warehouse User Roles
22
Adding a User in Data Warehouse
23
Getting a List of Data Warehouse Users
26
Data Warehouse User Details
28
Page #4
Changing a User Role in the Data Warehouse Instance
28
Removing a User from Data Warehouse
29
Resource Limitations
30
Data Warehouse Backups
30
Data Warehouse Developer Guide
31
Data Warehouse System Architecture Overview
31
Data Warehouse and the GoodData Platform Data Flow
31
Data Warehouse Architecture
32
Data Warehouse Technology
34
Column Storage and Compression in Data Warehouse
34
Data Warehouse Logical and Physical Model
35
Intended Usage for Data Warehouse
37
Working with Data Warehouse from CloudConnect
37
Creating a Connection between CloudConnect and Data Warehouse
Loading Data through CloudConnect to Data Warehouse
37
40
Project Parameters for Data Warehouse
41
Creating Tables in Data Warehouse from CloudConnect
41
Loading Data to Data Warehouse Staging Tables through
CloudConnect
43
Merging Data from Data Warehouse Staging Tables to Production
45
Exporting Data from Data Warehouse using CloudConnect
47
Connecting to Data Warehouse from SQL Client Tools
Download the JDBC Driver
49
50
Page #5
Data Warehouse Driver Version
50
Prepare the JDBC connection string
51
Access Data Warehouse From SQuirrel SQL
52
Connecting to Data Warehouse from Java
56
Connecting to Data Warehouse from JRuby
56
Install JRuby
56
Access Data Warehouse using the Sequel library
57
Installing database connectivity to Ruby
58
Installing the Data Warehouse JRuby support
58
Example Ruby code for Data Warehouse
58
Database Schema Design
59
Logical Schema Design - tables and views
60
Primary and Foreign Keys
60
Altering Logical Schema
61
Physical Schema Design - projections
61
Columns Encoding and Compression
62
Columns Sort Order
63
Segmentation
63
Configuring the initial superprojection with CREATE TABLE
command
64
Creating a New Projection with CREATE PROJECTION command
65
Changing Physical Schema
66
Loading Data into Data Warehouse
68
Page #6
Loading Compressed Data
69
Use RFC 4180 Compliant CSV files for upload
70
Error Handling
71
Merging Data Using Staging Tables
72
Statistics Collection
73
Querying Data Warehouse
74
Performance Tips
76
Do Not Overnormalize Your Schema
76
Use Run Length Encoding (RLE)
76
Use the EXPLAIN Keyword
76
Use Monitoring Tables
77
Write Large Data Updates Directly to Disk
79
Avoid Unnecessary UPDATEs
80
General Projection Design Tips
80
Minimize Network Joins
80
Choose Projection Sorting Criteria
84
Limitations and Differences in Data Warehouse from Other RDBMS
86
Single Schema per Data Warehouse Instance
86
No Vertica Admin Access
86
Use COPY FROM LOCAL to Load Data
86
Limited Parameters of the COPY Command
87
Limited Access to System Tables
87
Limited Access to System Functions
91
Page #7
Reserved Entity Names
92
Database Designer Tool not available
92
JDBC Driver Limitations
92
Page #8
Getting Started with Data Warehouse
GoodData Agile Data Warehousing Service (Data Warehouse) is a fully
managed, columnar data warehousing service for the GoodData platform.
l
Agile Data Warehousing Service may be known to some customers by its
former name, Data Storage Service. See A Note about Names.
Data Warehouse is designed for storage of the full history of your business data
and for easy and quick data extracts. Using standard technologies, you can
quickly deliver Data Warehouse data into information marts (such as GoodData
projects) or other information delivery systems.
In this cloud-based service, Data Warehouse instances can be provisioned and
managed through scripts or through the GoodData gray pages.
l
l
l
For more information about managing your instance, see Data
Warehouse Management Guide.
For more information about the GoodData APIs, see GoodData
API Documentation.
The gray pages are a simple web wrapper over the GoodData APIs. For
more information, see https://developer.gooddata.com/article/accessinggray-pages-for-a-project.
Using a provided JDBC driver and SQL queries, a developer may interact with a
created Data Warehouse instance through CloudConnect Designer or a locally
installed SQL client tool.
l
For more information on accessing Data Warehouse, best practices for
schema design, and loading and extracting data, see Data
Warehouse Developer Guide.
A Note about Names:
Some customers may be familiar with Data Warehouse under its former name,
Data Storage Service. This document may contain references to "Data Storage
Service" or "DSS". Most of these references occur in code snippets, which have
not been updated to the new name.
Data Warehouse and Vertica
Page #9
Data Warehouse is running on top of an HP Vertica backend, which enables
developers to use the scalability of Vertica’s distributed massively parallel (MPP)
columnar engine and powerful SQL extensions.
Version in Use: HP Vertica 6.1.
Project Hierarchy
Your projects are organized into dashboards, dashboard tabs, reports, and the
metrics that are contained within those reports. At the lowest level, facts,
attributes, and source data represent the foundational components that are
aggregated to form the metrics displayed in dashboard reports.
Figure: Project Hierarchy
Key Terminology
Term
Definition
Data
Raw records that are loaded into project 1, IBM, $50000,
data sets for use in the project’s data
10/10/2012
model.
Examples
Facts
Individual numerical measurements
attached to each data set in the source
data.
Facts are always numbers and are the
smallest units of data.
Attributes
Descriptors used to break apart metrics
and provide context to report data.
Attributes dictate how metrics are
calculated and represented. Attributes
may be text (e.g. region) or numerical
(e.g. size) data.
Page #10
Opportunity amount
(i.e. $25,000)
Campaign clicks (i.e.
212);
Website views (i.e.
4,508)
(by) month;
(by) store;
(by) employee;
(by) region;
(by) department
Metrics
Aggregations of facts or counts of
distinct attribute values, which are
represented as numbers in reports.
Metrics are defined by customizable
aggregation formulas.
Metrics represent what is being
measured in a report.
Reports
sum of sales;
average salary;
total costs;
count of Opportunity
(attribute)
Visualizations of data that fall into one of A table showing
three categories: tables, charts, and
employee salaries
headline reports.
(metric) broken down
by quarter (attribute)
All reports contain at least one metric
(what is being measured), and often
A line graph showing
contain one or more attributes (dictating revenue (metric)
how that metric is broken down).
generated across each
month in the past year
(attribute)
A bar graph showing
sales figures (metric)
broken down by region
(attribute)
Dashboard
Tabs
The pages in the GoodData Portal in
which reports (either tables or charts)
and other dashboard elements (lines,
embedded content from the web,
widgets, and filters) are displayed.
Page #11
ROI
Funnel/Goals
Dashboard tabs are typically used to
organize reports within a given
dashboard.
Dashboards Groups of one or more dashboard tabs
that contain reports belonging to a
common category of interest.
From Leads to Won
Deals
Projects
Sales Management
A set of dashboards and the users who
have permission to interact with them. A
project also includes the underlying
dashboard, tabs, reports, metrics, and
data models.
(marketing dashboard)
Leads to Cash
Subscription
Management
Projects are often provisioned for use by
an entire team or department. In these
cases, a change made by one user is
visible to all.
Data Set
A collection of related facts and
attributes typically provided from a
single data source.
An Opportunity data
set, containing facts
related to attributes
like Name, Opportunity
Amount, and Stage.
Logical
Data Model
(LDM)
A model of the definition of all facts,
attributes, and datasets in a project, as
well as the relationships between them.
To see an example,
click Model in the
Manage page of the
GoodData Portal.
Page #12
Data Warehouse Quick Start Guide
Prerequisites
To get started using Agile Data Warehousing Service, please verify that you have
the following in place:
1. A GoodData platform account. If you are already accessing the GoodData
Portal, you need these credentials in your Data Warehouse implementation.
2. Data Warehouse-enabled authorization token. Your existing project
authorization token must be enhanced by GoodData Customer Support to
enable the creation and management of your Data Warehouse instances.
3. Application access. This Quick Start Guide covers Java-based graphical
database client tools such as SQuirrel SQL and GoodData’s CloudConnect
Designer application. You can also connect to Data Warehouse
programmatically from Java, JRuby, or other Java-based languages and
platforms.
4. (Optional) GoodData project as a target. If you plan to load the data from
Data Warehouse into the GoodData platform, you must have the
Administrator role for at least one GoodData project. For more information
on GoodData projects, see Project Hierarchy.
Creating a Data Warehouse Instance
Use the steps in this section to initialize a new Data Warehouse instance.
In most environments, only one Data Warehouse instance is needed. An Data
Warehouse instance can receive data from multiple sources and can be used to
populate one or more GoodData projects (datamarts), which deliver the
information to business users.
l
You may find it useful to maintain separate instances for development,
testing, and production uses.
To initialize a new Data Warehouse instance, you must provide the following
information:
l
Name of the Data Warehouse instance
l
Description (optional)
l
Authorization token
Page #13
NOTE: Your project authorization token must be enabled
to create Data Warehouse instances. Please file a request
with GoodData Customer Support.
Steps:
To create your Data Warehouse from the gray pages, please complete the
following steps:
1. Login to the GoodData Portal:
https://secure.gooddata.com
2. If you are logged in, reload the page, which refreshes your session.
3. Navigate to the following URL:
https://secure.gooddata.com/gdc/dss/instances
4. If you have not logged into the GoodData Portal previously, you must enter
your credentials first. Navigate to the above URL after logging in.
5. The gray page for creating an Data Warehouse instance is displayed. Any
previously created Data Warehouse instances are displayed above the
form.
Page #14
Figure: Create an Data Warehouse instance
6. Enter the name, description and project authorization token for your
instance into the form. Click Create.
7. In rare cases, you may receive the following error message. If so, please
refresh the page opened to the GoodData Portal:
This server could not verify that you are authorized to
access the document requested. Either you supplied the wrong
credentials (e.g., bad password), or your browser doesn't
understand how to supply the credentials required.Please see
Authenticating to the GoodData API for details.
8. The task is queued for execution in the platform. You may use the link in the
gray page to query the status of this task. Reload the page until you see the
following:
Page #15
Figure: Completed task of Data Warehouse instance creation
9. Click the link to access your Data Warehouse instance. See Data
Warehouse Instance Details Page.
After you have created Data Warehouse instances, you may access them through
the gray pages:
Page #16
Figure: List of Data Warehouse Instances
Each Data Warehouse record includes basic information, including links to the
Data Warehouse Detail page for the instance and to a page for managing users
who can access the Data Warehouse.
l
l
Page #17
For more information on the Data Warehouse Detail page, see ADS User
Details.
For more information on managing users, see Managing Users and Access
Rights.
Data Warehouse Status:
The status field identifies the current status of the instance:
l
ENABLED - available and ready for read-write operations
l
DELETED - a deleted instance.
l
ERROR - an instance that failed to be created.
Connecting to Data Warehouse from CloudConnect
Using the JDBC driver, you can create integrations between CloudConnect
Designer and your Data Warehouse instance.Then, using CloudConnect
components you can create ETL graphs to load data into your Data Warehouse
instance, transform it within the database using SQL, and extract it from Data
Warehouse for use in your GoodData projects.
l
l
l
CloudConnect Designer is a downloadable application for building ETL
graphs and logical data models for your GoodData projects. For more
information on CloudConnect Designer, see Developer Tools.
To get started building your first GoodData project, see Developer Tools.
If you have already installed it, please upgrade CloudConnect Designer to
the most recent version of CloudConnect Designer. The driver is provided
as part of the upgrade package.
Steps:
1. Create CloudConnect connection for your Data Warehouse instance. See
Creating a Connection between CloudConnect and Data Warehouse.
2. Your connection must include the appropriate JDBC connection string. See
Prepare the JDBC connection string.
Page #18
3. After the connection is established, you can begin loading data into Data
Warehouse from CloudConnect.
4. For your project, you should create project parameters for username,
password, and JDBC string. See Project Parameters for Data Warehouse.
5. You can create tables using a CREATE TABLE statement using the
DBExecute component. See Creating Tables in Data Warehouse from
CloudConnect.
NOTE:Data Warehouse does not support upsert
operations and does not enforce unique row identifiers at
load time.
6. Data is loaded through CloudConnect by using the COPY LOCAL
command in the DBExecute component. See Loading Data to Data
Warehouse Staging Tables through CloudConnect.
7. When data has been loaded into the staging tables, you can perform any
necessary in-database operations. Data can then be merged into your
production tables. See Merging Data from Data Warehouse Staging Tables
to Production.
8. When data is ready to be moved from Data Warehouse to the datamart, you
can extract it using the DBInputTable component and pass the metadata
into a GD Dataset Writer component for storage in your GoodData project.
See Exporting Data from Data Warehouse using CloudConnect.
Connecting to Data Warehouse from a SQL Client Tool
Data Warehouse supports connection from Java based SQL client tools such as
SQuirrel SQL using the GoodData JDBC driver. To connect, please complete the
following steps.
Steps:
1. Download and install the JDBC driver. See Download the JDBC Driver.
2. Prepare the connection string. See Prepare the JDBC connection string.
3. Create the connection from your favorite Java-based SQL tool. Example
documentation is provided for the following:
l
Page #19
See Access Data Warehouse From SQuirrel SQL.
If needed, you can deprovision an Data Warehouse instance.
l
l
When an instance is deleted, it is still listed as one of your Data Warehouse
instances. However, its status is marked as DELETED.
Deleted instances are still visible through the gray pages. However, data
cannot be added to or removed from the instance using the JDBC driver.
Deprovisioning an instance removes it from access
through the JDBC driver. Before you deprovision an
instance, you should review all of the users and projects
that are connected to the instance. All users should be
informed of the change in advance of removing the
instance, so that they can verify that none of their projects
is affected. Also, you should make arrangements so that
any project using the Data Warehouse instance has access
to data through another resource, such as a replacement
Data Warehouse instance.
Deleting an instance cannot be undone.
Steps:
1. Visit your list of instances:
2. Click the instance you wish to remove.
3. Then, click Delete.
4. The status of the instance is changed: DELETED.
Page #20
Data Warehouse Management Guide
This section provides guidance in how to manage your Data Warehouse
instances and the users in those instances through the GoodData gray pages.
l
The gray pages reflect the structure of the underlying GoodData APIs. The
commands you execute through the gray pages can be managed
programmatically through the APIs.
Data Warehouse users are specific to the instances in which you create them.
They are not equivalent to platform users.
Managing your Data Warehouse Instances
Through the gray pages, you can review your Data Warehouse instances and
manage aspects of them.
Topics:
l
Creating an Data Warehouse Instance
l
l
l
l
If the Data Warehouse instance needs to be passed to another user, please add
the user in the instance.
l
See Adding a User in Data Warehouse.
After the user has been added, please contact GoodData Customer Support.
You may access the details of individual Data Warehouse instances through the
self links in the List of Data Warehouse Instances or through direct URI:
Page #21
Figure: Your created Data Warehouse instance
Key URLs:
You may wish to retain the following URLs for later use in the gray pages:
l
l
self - The URL to the Data Warehouse instance is used to construct a
JDBC connection string required to connect to your Data Warehouse
instance using CloudConnect or other Java-based tool.
users - URL to list of users in the Data Warehouse instance
l
Page #22
jdbc - URL to the JDBC access point for the Data Warehouse instance,
which is used internally by the JDBC driver to establish a database
connection. See Connecting to Data Warehouse from a SQL Client Tool.
Managing Users and Access Rights
This section provides details on how to manage Data Warehouse users and their
permissions within your instance.
NOTE: You must have GoodData platform account before you
may be added as a new user to an Data Warehouse instance.
Topics:
l
l
l
Get List of Data Warehouse Users
l
l
Change a User’s Role in the Data Warehouse Instance
l
The following roles may be assigned to Data Warehouse users. General
permissions are listed for each role:
Data Admin role
Role identifier: dataAdmin
This role should be assigned to any Data Warehouse user who needs to use the
instance for loading and processing data.
l
Read all tables or views.
l
Import data into Data Warehouse tables.
Page #23
l
Create, drop or purge Data Warehouse tables.
l
Create other objects such as functions and views in the database.
NOTE: The Data Admin role is sufficient for basic use of the
Data Warehouse instance.
Admin role
Role identifier: admin
This role should be reserved for the user or users who need to have control over
the other users in the Data Warehouse instance.
l
All permissions of the Data Admin role, plus the following:
l
Add user.
l
Remove user. User cannot be the Owner of the instance.
l
l
Change a user’s role. User cannot be the Owner of the instance and cannot
be changed to the Owner role.
Edit the name or description of an Data Warehouse instance.
Data Warehouse instance owner
The user who created the Data Warehouse instance is automatically assigned
ownership of the instance. Ownership is not a formal role in the instance. l
l
The Owner of an Data Warehouse instance cannot be changed via
programming.
In some cases, Data Warehouse instance ownership may be changed. For
more information, please contact GoodData Customer Support.
The Owner is also automatically assigned the Admin role. The Owner has all of
the permissions of the Admin role, as well as the permission to delete the Data
Warehouse instance.
Use the following steps to add a user to the Data Warehouse instance via the
gray pages.
Page #24
NOTE: The account creating the new user must be an Admin for
the Data Warehouse instance.
NOTE: Only users with existing GoodData platform accounts
may be added to the Data Warehouse instance.
NOTE: Users are added silently. No email is delivered to the
user.
Steps:
1. Get a list of users to add. For each user, you must acquire either the user
profile URI or the GoodData platform account identifier.
NOTE: A user’s profile URI is available when the user logs in
through the gray pages. You may also query for the list of users
in a project via API; the returned JSON includes user profile
identifiers for each user in the project.
2. Determine the role to apply to the user. See Data Warehouse User Roles.
3. Acquire the instance identifier for the Data Warehouse instance to which
you wish to add the user. See Prepare the JDBC connection string.
4. Visit the following gray page:
https://secure.gooddata.com/gdc/dss/instances/[DW_ID]
/users
5. The list of users and their roles is displayed. At the bottom of the page,
complete the form:
Page #25
Figure: Add new Data Warehouse user form
6. From the drop-down, select the Data Warehouse user role to assign to the
user. See Data Warehouse User Roles.
7. Enter the Profile URI for the user or the user’s GoodData platform identifier.
Do not enter both.
NOTE: When adding users by GoodData platform login
identifier, you should retrieve and store the profile URI, as
other API endpoints may not permit use of the login
identifier for entry.
8. To add the user, click Add user to the storage.
9. In rare cases, you may receive the following error message. If so, please
refresh the page opened to the GoodData Portal:
This server could not verify that you are authorized to
access the document requested. Either you supplied the
wrong credentials (e.g., bad password), or your browser
doesn't understand how to supply the credentials
required.Please see Authenticating to the GoodData API
for details.
10. The task is queued for execution in the platform. You may use the link in the
gray page to query the status of this task. Reload the page until you see the
following:
Page #26
Figure: User added to Data Warehouse instance
11. The user has been added to the instance.
12. Repeat these steps to add additional users to the Data Warehouse
instance.
Getting a List of Data Warehouse Users
To retrieve the list of users in your Data Warehouse instance, please visit the
following URL:
https://secure.gooddata.com/gdc/dss/instances/[DW_ID]/users
NOTE: Admin users can see all users in the Data Warehouse
instance. More restricted users can retrieve only information on
their own accounts.
Page #27
Figure: List of Users in an Data Warehouse Instance
You may use the form at the bottom of the screen to add users to the Data
Warehouse instance. See Adding a User in Data Warehouse.
Profile identifiers:
profile - This URI provides access to the profile of an user.
In the following profile URI:
/gdc/account/profile/2374a6d5d45cca7a405d6c690
The profile identifier is the long string at the end of the profile URI. For example in
the following URL:
Page #28
2374a6d5d45cca7a405d6c690
To review details of the user or to make changes to the user account, click the self
link.
l
See Data Warehouse User Details.
In the Data Warehouse User Details gray page, you can review the profile of the
selected user of your Data Warehouse instance.
l
In the list of Data Warehouse users in your instance, click the self link for
the user.
Figure: Data Warehouse User Details Page
l
To verify the username of the Data Warehouse user, click the profile
link.
l
See Change a User Role in the Data Warehouse Instance.
l
See Removing a User from Data Warehouse.
Changing a User Role in the Data Warehouse Instance
Page #29
Use the following steps to change the role assigned to a user in your Data
Warehouse instance.
NOTE: The user applying the change must be an Admin in the
instance.
NOTE: The owner of the Data Warehouse instance cannot be
demoted from an Admin role or removed from the instance.
Steps:
1. In the list of Data Warehouse users in your instance, click the self link for the
user. The user details are displayed. See Data Warehouse User Details.
2. To verify the user’s identity, click the profile link.
3. From the role drop-down, select the new user role. See Data
Warehouse User Roles.
4. Click Update role.
5. The user’s role is updated. Verify the value for role.
Please complete the following steps to remove a selected user from your Data
Warehouse instance.
NOTE: The user applying the change must an Admin in the
instance.
NOTE: The owner of the instance cannot be demoted from an
Admin role or removed from the instance.
Steps:
1. In the list of Data Warehouse users in your instance, click the self link for the
user. The user details are displayed. See Data Warehouse User Details.
2. To verify the user’s identity, click the profile link.
Page #30
3. Click Delete.
4. The user is removed from the instance.
Resource Limitations
l
l
Memory limitation: by default, individual customer queries are not allowed to
allocate more than 10 GB of RAM. This limitation may vary depending on
your license. For more information, please contact GoodData Account
Management.
Time limitations: queries running for longer than 2 hours to execute are
terminated.
Data Warehouse Backups
GoodData performs standard backups of the Vertica clusters hosting Data
Warehouse on a daily basis. In the unlikely event of service disruption or other
network-wide failure, data is restored to the previous backup.
NOTE: GoodData does not provide on-demand data recovery.
Users may manage their own backups through third-party tools.
This backup policy matches the backup policy provided by the GoodData
platform.
Page #31
Data Warehouse Developer Guide
Data Warehouse System Architecture Overview
This section outlines Agile Data Warehousing Service integration with the
GoodData BI platform architecture and data flows, as well as the architecture of
Data Warehouse itself.
Data Warehouse and the GoodData Platform Data Flow
Figure: Data Warehouse and Platform Data Flows
Typically, the data flow is the following:
1. Source data may be uploaded by the customer to GoodData’s incoming
data storage via WebDAV, where it is collected by the BI automation layer,
which is typically custom CloudConnect ETL or Ruby scripts. Source data
may also be retrieved directly by the automation layer from the source
system’s API.
2. For more information on project-specific storage, see
https://developer.gooddata.com/article/project-specific-storage.
Page #32
3. ETL graphs may be created and published from CloudConnect Designer.
For more information on CloudConnect Designer, see Developer Tools.
4. Ruby scripts may be built and deployed using the GoodData Ruby SDK.
See http://sdk.gooddata.com/gooddata-ruby/.
NOTE: Developers may access Data Warehouse remotely
with SQL via the JDBC driver provided by GoodData.
Other JDBC drivers cannot be used with Data Warehouse.
For more information, see Download the JDBC Driver.
5. The automation layer imports the data into Data Warehouse and does any
necessary in-database processing using SQL. See Querying Data
Warehouse.
6. At this point, the data is ready for import into a presentation layer. Data is
extracted from Data Warehouse using SQL and is typically imported into a
GoodData project.
7. During extraction, any additional transformations may be performed in the
database using SQL or using CloudConnect Designer before the data is
uploaded to the presentation layer.
8. The data is available through the presentation layer for users.
For a simple example of a data flow implementation in CloudConnect, see
Working with Data Warehouse from CloudConnect.
Data Warehouse Architecture
Internally, Agile Data Warehousing Service consists of a number of shared or
dedicated clusters.
Page #33
Figure: Agile Data Warehousing Service architecture
Each Data Warehouse instance is spread across multiple nodes within either
shared clusters or a dedicated cluster.
NOTE: Data Warehouse instances are isolated; it is not
possible to run a query that references data stored in two
different Data Warehouse instances, even if both instances are
accessible to the same user.
Licensees of the GoodData platform receive an authorization token for creating
projects in the platform. After it has been enhanced, this token may also be used
to create an Data Warehouse instance, and it ensures that your Data Warehouse
instance is created on an appropriate cluster.
NOTE: Your project authorization token must be enabled to
create Data Warehouse instances. Please file a request with
GoodData Customer Support.
l
See Management Guide.
Page #34
Data Warehouse Technology
Agile Data Warehousing Service is based on the HP Vertica database. Each
node in an Data Warehouse cluster runs an instance of the Vertica database.
Data Warehouse supports the SQL:99 standard with several Vertica-specific
extensions.
l
l
For more information on the limitations against the SQL:99 standard or the
Vertica documentation, see Limitations.
For additional details, see Vertica documentation.
NOTE: You need GoodData’s Data Warehouse JDBC Driver to
connect to Data Warehouse. The Vertica JDBC driver cannot be
used to connect to Data Warehouse. See Download the JDBC
Driver.
Column Storage and Compression in Data Warehouse
Unlike standard row-based relational databases, the Data Warehouse stores data
using a columnar storage mechanism:
Figure: Row vs. Column storage
Columnar storage is particularly suitable for improve disk performance when
retrieving complex analytical queries or running in-database business
transformations. Queries can be answered by accessing only the columns
required by the query, which fits well with data warehousing and other readintensive use cases.
Moreover, the data in columns are compressed using various encoding and
compression mechanisms, which further improves the disk I/O. For example, run
length encoding can be applied to the “symbol” column:
Page #35
Figure: Columnar storage enhances disk storage and access
Data Warehouse Logical and Physical Model
Schema design elements such as tables and views are considered a database’s
logical database model. These objects provide information about available data
elements.
However, they do not define how the data is actually stored on the disk or how
they are distributed across the nodes within an Data Warehouse cluster. Those
structures are part of the physical data model.
The structures that define how table columns are organized on the disk and how
the data are distributed in the cluster are called projections:
Figure: Logical model vs. physical model
The following parameters of a physical data representation can be configured
using a projection:
l
l
l
Columns to be included and column encoding (run-length encoding, delta
encoding, etc)
Sequence of the table columns
Segmentation: The rows to keep on the same node and the projections to
be replicated instead of split
Page #36
A single table may have multiple projections to support different query types. A
projection does not have to necessarily include all table columns. For example, in
a large transactional table, you may wish to have a projection sorted by
timestamp to support queries over the most recent data. You may also need a
projection sorted by customer to support a customer segmentation query, which
could be expensive. The table would have two additional projections to support
these use cases.
l
l
A projection is similar to a materialized view in traditional database. Like a
materialized view, a projection stores result sets on disk, instead of recomputing them with each query. As data is added or updated, these results
are automatically refreshed.
For more information on projections, see Physical Schema.
Data Warehouse users create SQL queries against the logical model. The
underlying engine automatically selects the appropriate projections.
l
As a feature of Vertica, Data Warehouse databases lack indexes. In place of
indexes, you use optimized projections to optimize queries.
Each logical table requires a physical model. This model can be described by a
projection for the table that includes all table columns. A projection with all table
columns is called a superprojection.
l
l
The CREATE TABLE command automatically creates a superprojection for
the new table.
Proprietary SQL extensions are available for configuring the parameters of
the default superprojection.
When building your Data Warehouse database, you can start with one
superprojection for each table. You may consider adding additional projections
from HP Vertica to improve performance of slow queries.
l
l
Additional projections can be defined using the CREATE PROJECTION
command.
In addition to specifying the columns in the projection, developers may
specify the compression, sort order, and the distribution of data across the
nodes of the cluster (segmentation) for the projection.
For more information about designing and optimizing the data model:
l
See Database Schema Design.
l
See Performance Tips.
l
Page #37
For additional details, see the Physical Schema section of the Vertica
documentation.
Intended Usage for Data Warehouse
Agile Data Warehousing Service is a relational database designed for data
warehousing use cases. Although Data Warehouse may be accessed from any
JDBC-capable client application and supports a superset of the SQL:99
specification, Data Warehouse is expected to be used as a data warehouse and
persistent staging environment.
NOTE: It is not recommended to use Data Warehouse as an
OLTP database with a large number of parallel queries and data
updates with a very short expected response time.
The CloudConnect Designer installation package includes the GoodData JDBC
driver, which is needed to connect to Agile Data Warehousing Service.
l
l
l
CloudConnect Designer is a downloadable application for building ETL
graphs and logical data models for your GoodData projects. For more
information on CloudConnect Designer, see Developer Tools.
To get started building your first GoodData project, see Developer's Getting
Started Tutorial.
If you have already installed it, please upgrade CloudConnect Designer to
the most recent version of CloudConnect Designer. The driver is provided
as part of the upgrade package.
Creating a Connection between CloudConnect and Data
Warehouse
Please complete the following to create a connection between CloudConnect and
Data Warehouse.
Steps:
Page #38
1. Open your project’s ETL graph or create a new one.
2. To create a new database connection in CloudConnect using the built-in
Data Warehouse driver, secondary-click Connections in the Project
Outline. Then select Connections > Create DB Connection....
NOTE: Do not create your connection from the File menu.
3. Select a <custom> database connection.
4. You may wish to use project parameters for the username and password
and to pass them into the connection at runtime. See Loading Data through
CloudConnect to Data Warehouse.
5. The connection requires a connection string. Remember to insert the
identifier for the Data Warehouse instance (DW_ID) as part of the
connection string. See Prepare the JDBC connection string.
6. Your CloudConnect connection should look similar to the following:
Page #39
Figure: Data Warehouse connection from CloudConnect
7. Click Validate connection to test it.
8. If the connection succeeds, save it.
You have configured your graph in CloudConnect to connect to the specified
Data Warehouse instance. This connection must be referenced in each
component instance that interacts with Data Warehouse.
Reusing the connection:
The connection to Data Warehouse is local to the graph in which you created it. It
must be copied into other graphs to be used in other projects, if project
parameters have been created for it.
l
By default, each database component instance creates a new connection.
Page #40
NOTE: To avoid repeating yourself, the connection settings
should reference CloudConnect variables rather than using
hard-coded constants. See Project Parameters for Data
Warehouse.
NOTE: To create a connection that can be reused by multiple
connections, you may deselect the Thread-safe button in the
Advanced tab of the Connection Settings dialog. However, this
configuration should be avoided unless truly necessary for
operations such as retrieving an auto-incremented key. Do not
turn off the thread safety for connections used by components
expected to issue long-running queries. Transactions running
for more than 2 hours will be terminated.
Loading Data through CloudConnect to Data
Warehouse
Using the DBExecute component, you specify the CloudConnect connection to
use and the COPY LOCAL commands to execute against your Data Warehouse
instance.
l
If you have not done so already, create a connection in CloudConnect
Designer so that the application can interact with Data Warehouse. See
Connecting to Data Warehouse from CloudConnect.
NOTE: You must use the COPY LOCAL command to load data
into Data Warehouse. For more information on the command,
supported parameters, and its Data Warehouse-specific
implementation, see Loading Data into Data Warehouse.
Data Warehouse does not support upsert operations and does not require unique
record identifiers at load time. For this reason, you should utilize staging tables to
load your data and then to perform a merge. See Merge data using staging tables.
Tip: In CloudConnect Designer, you should build the load and
merge operations separately. You can validate that the loading
Page #41
operation has successfully completed before kicking off the
merge operation.
Project Parameters for Data Warehouse
When you are creating CloudConnect projects, you should parameterize values
that may change based on the target GoodData project. For example, if the same
basic ETL process is to be used for multiple projects from multiple source
systems, you should turn access parameters such as username, password, and
access URL into CloudConnect parameters.
In your project, you should define the following project parameters:
Parameter
Description
DSS_USER
The Data Warehouse user identifier to use to connect
DSS_PASSWORD
This parameter can be used at runtime to apply a password
to connect to Data Warehouse.
DSS_JDBC_URL
This parameter should be used to define the JDBC URL to
access the Data Warehouse project.
Creating Tables in Data Warehouse from CloudConnect
Before you load data into staging tables, you must create the tables for staging
and production.
Tip: For organization purposes, it is a recommended practice
that you create your table initialization ETL in a separate graph
in CloudConnect Designer.
In this example, staging tables with the “in_” prefix are created for a dataset called
opportunities.
Steps:
Page #42
1. In the graph, add a DBExecute component. Edit the component.
2. Properties:
1. DB Connection: select the connection you created
2. SQL query: see below.
3. Print statements: true
4. Transaction set: All statements
3. For the SQL query, you must create the staging tables. The example below
creates the table for in_opportunities. Note the use of the in_ prefix for the
staging environment:
CREATE TABLE IF NOT EXISTS in_opportunities (
_oid IDENTITY PRIMARY KEY,
id VARCHAR(32),
name VARCHAR(255) NOT NULL,
created TIMESTAMP NOT NULL,
closed TIMESTAMP,
stage VARCHAR(32) NOT NULL,
is_closed BOOLEAN NOT NULL,
is_won BOOLEAN NOT NULL,
amount DECIMAL(20,2),
last_modified TIMESTAMP
)
Page #43
4. Your DBExecute component should look like the following:
Figure: Creating staging tables
5. Save your graph.
Loading Data to Data Warehouse Staging Tables through
CloudConnect
Using a separate DbExecute component, you can use the COPY LOCAL
command to populate your staging tables with data from a locally referenced file.
l
l
For more information on the COPY LOCAL command, see Loading Data
into Data Warehouse.
For more information on creating the staging tables, see Creating Tables in
Data Warehouse from CloudConnect.
In the following example, you create a DbExecute instance to load the staging
table for in_opportunities from the local file opportunities.csv.
Steps:
Page #44
2. Properties:
3. For the SQL query, you must specify at least two commands in the following
order.
4. Before copying into the table, the TRUNCATE command is used to ensure
the staging table is empty.
5. The COPY LOCAL commands to copy from the local source file
(opportunities.csv in this case) to the staging table you created.
6. Commands are separated by a semicolon. Your SQL might look like the
following:
TRUNCATE in_opportunities;
COPY in_opportunities
(id, name, created, closed, stage, is_closed, is_won,
amount, last_modified)
FROM LOCAL '${DATA_SOURCE_DIR}/opportunities.csv'
SKIP 1
ABORT ON ERROR
Page #45
Figure: Loading staging tables
8. Save your graph.
Merging Data from Data Warehouse Staging Tables to
Production
After data has been staged in Data Warehouse, you can use the following basic
steps to merge into your production environment. In this case, you create a
DBExecute instance to MERGE INTO records from the staging tables.
Steps:
2. Properties:
Page #46
3. For the SQL query, you must specify the MERGE INTO commands to merge
from staging to production:
MERGE INTO opportunities t
USING in_opportunities s
ON s.id = t.id
WHEN MATCHED THEN
UPDATE SET name = s.name,
created = s.created,
closed = s.closed,
stage = s.stage,
is_closed = s.is_closed,
is_won = s.is_won,
amount = s.amount
WHEN NOT MATCHED THEN
INSERT
(id, name, created, closed, stage, is_closed, is_
won, amount)
VALUES
(s.id, s.name, s.created, s.closed, s.stage,
s.is_closed, s.is_won, s.amount)
Figure: Merging into Production
Save your graph.
Page #47
After this graph is executed, you may truncate the staging tables.
Exporting Data from Data Warehouse using CloudConnect
To export data from Data Warehouse using CloudConnect, you deploy the
DBInputTable component to extract data from your production tables. This
component can be connected to a Writer component to store the data in its new
destination. Typically, this component is the GD Dataset Writer component, which
writes the data to a specified GoodData project.
Steps:
1. Add the DBInputTable component. Edit it.
2. Properties:
3. Data policy: Strict is recommended.
4. Print statements: false
3. For the SQL query, you must specify the SELECT command to retrieve the
data from the production table. In the SQL Query Editor, the fields in the
query must be mapped to the output metadata fields for consumption by the
next component in the graph:
Page #48
Figure: SQL Query for exporting data tables
Tip: You should specify manually each field in the table
that you are extracting. If the table schema changes in the
future, then the ETL process continues to function, as long
as the change does not include modifications to the
source fields. Avoid using SELECT *.
4. Click OK.
5. The data that is extracted is mapped to the metadata of the DBInputTable
component.
6. To write to a GoodData project, add a GD Dataset Writer component.
Page #49
7. Create an edge between the two components.
8. In the GD Dataset Writer component, specify the GoodData project
identifier, the target dataset, and the field mappings from DBInputTable
metadata to dataset fields.
9. Save your graph.
Your graph should look like the following:
Figure: Final export graph
Connecting to Data Warehouse from SQL Client Tools
This section describes how to connect to Data Warehouse from your preferred
SQuirrel SQL Java-based client tools.
NOTE: CloudConnect Designer is pre-packaged with the JDBC
driver and automatically receives any updates if the driver is
updated. Downloading and installing it in CloudConnect
Designer is unnecessary. See Working with Data
Warehouse from CloudConnect.
There are many free and commercial Java-based SQL client tools. Feel free to
use your preferred tool, as the set up is consistent.
Steps:
These are the basic steps:
1. Download the Data Warehouse JDBC driver. See Download the JDBC
Driver.
2. Add the Data Warehouse JDBC driver into your tool. For more information,
please consult the production documentation provided with your SQL client
tool.
Page #50
3. Build your Data Warehouse instance’s JDBC connection string. See
Prepare the JDBC connection string.
4. Use the Data Warehouse driver and JDBC connection string to set up a
connection.
5. You may be also asked for the driver class name:
com.gooddata.dss.jdbc.driver.DssDriver
l
See Access Data Warehouse From SQuirrel SQL.
Download the JDBC Driver
Connection to Data Warehouse is supported only by using the JDBC driver
provided by GoodData.
This driver enables Data Warehouse connectivity from:
l
CloudConnect ETL Designer
NOTE: The driver is pre-installed in supporting versions of
CloudConnect Designer.
l
Java-based visual SQL client tools
l
Java programming environment, such as JRuby
NOTE: For third-party SQL client tools, the driver is available at
the following URL:
https://developer.gooddata.com/downloads/dss/
ads-driver.zip
To download the JDBC driver, click Developer Tools.
Data Warehouse Driver Version
If you have having issues with your Data Warehouse connection, you may be
asked by GoodData Customer Support to provide the version number of the
JDBC driver that you are using for Data Warehouse.
Page #51
NOTE: You cannot use a Vertica JDBC driver for Data
Warehouse. You must use the Data Warehouse driver provided
by GoodData.
To acquire the Data Warehouse driver version:
l
l
l
l
An updated version of CloudConnect Designer always contains the latest
version of the Data Warehouse driver. CloudConnect enables automatic
updates of the application.
If you have downloaded the driver from the Developer Portal, you can locate
the driver version through one of the following methods:
When the ZIP file is unzipped, the driver version number is embedded in the
filename.
If you no longer have the ZIP file, the version number can be retrieved
through standard method calls on the Data Warehouse driver. Use:
Driver.getMajorVersion() & Driver.getMinorVersion()
DatabaseMetaData.getDriverMinorVersion() &
DatabaseMetaData.getDriverMajorVersion()
DatabaseMetaData.getDriverVersion()
For more information, please visit GoodData Customer Support.
Prepare the JDBC connection string
Also known as the JDBC URL, the JDBC connection string instructs Java-based
database tools how to connect to a remote database.
For Data Warehouse, the format of the JDBC connection string is the following:
jdbc:dss://secure.gooddata.com/gdc/dss/instances/[DW_ID]
Page #52
To acquire your DW_ID:
1. Review your Data Warehouse instances at the following URL:
2. For the Data Warehouse instance to use, click the self link.
3. Copy the last part of the URL:
Figure: Data Warehouse Instance ID
Suppose your Data Warehouse instance URL was the following:
https://secure.gooddata.com/gdc/dss/instances/
Your JDBC connection string is the following:
jdbc:dss://secure.gooddata.com/gdc/dss/instances/
This connection string must be applied in CloudConnect Designer or the
database tool of your choice.
l
l
SQuirrel SQL is a powerful, open-source JDBC database interface. For more
information, see http://squirrel-sql.sourceforge.net/.
Steps:
Page #53
To connect from SQuirrel SQL to your instance, please complete the following
steps.
1. If you have not done so already, download and install the JDBC driver. See
Download the JDBC Driver.
2. Download and install SQuirrelSQL. See http://squirrelsql.sourceforge.net/#installation.
3. Launch the application. Select File > New Session Properties.
4. To add the JDBC driver, select Drivers > New Driver.
5. Properties:
1. Name: Enter something like GoodData Data Warehouse JDBC.
2. Example URL: Use the following:
jdbc:dss://secure.gooddata.com/gdc/dss/instances/
[Data Warehouse_ID]
3. Website URL: (optional) You may enter:
https://developer.gooddata.com
4. Click the Extra Class Path tab. Click Add. Navigate your local hard
drive to locate the JDBC driver you downloaded.
5. For the Class Name, enter the following value:
com.gooddata.dss.jdbc.driver.DssDriver
Page #54
6. The configuration for your new driver should look like the following:
Figure: New JDBC driver for SQuirreLSQL
7. Click OK.
8. A success message indicates that the driver has been properly installed
and registered with the application.
9. In the left navigation bar, click Drivers. Select the GoodData JDBC driver
from the list.
10. Create a new database alias for the connection. From the menu, select
Aliases > Connect....
Page #55
11. Click the Plus icon.
12. Properties:
1. Name: Suggest GoodData Data Warehouse JDBC.
2. Driver: Select the GoodData JDBC driver that you just created.
3. URL: This value should be modified to be a direct reference to your
Data Warehouse instance. The final value of the URL should be the
internal identifier of the Data Warehouse instance. See Reviewing
Your Data Warehouse Instances.
4. User Name and Password: Specify the GoodData platform account to
use to connect to the instance.
13. Your alias should look like the following:
Figure: SQuirreLSQL alias for Data Warehouse
14. Click Test to validate the connection.
Page #56
15. If the connection works, click OK.
16. You are now able to connect to Data Warehouse.
Connecting to Data Warehouse from Java
You can connect the Data Warehouse from Java using the Data Warehouse
JDBC Driver. See Download the JDBC Driver.
NOTE: Data Warehouse is designed as a service for building
data warehousing solutions, and it is not expected to be used as
an OLTP database. To provide Data Warehouse data to your
end users, you should push it into either a GoodData project to
deliver analytical dashboards or to your operational database,
which should be optimized to be a backend of your user-facing
application code.
For more information on GoodData projects, see Project Hierarchy.
Connecting to Data Warehouse from JRuby
You may use the following set of instructions to connect to Data Warehouse using
Ruby.
Install JRuby
The driver for connecting to Data Warehouse is available only as a JDBC driver
at this time. No native library in Ruby is available. As a result, you must first install
JRuby.
Steps:
1. Install Ruby.
2. The easiest method is to install using the Ruby Version Manager (RVM).
3. If you don’t have the RVM installed, please visit https://rvm.io/rvm/install.
4. After you have RVM installed, run the following command to install the
Java-based implementation of Ruby onto your local machine:
Page #57
$ rvm install jruby
5. To switch to the installed version of JRuby:
$ rvm use jruby
6. Optionally, to make JRuby your default Ruby environment, use the following
command:
$rvm --default use jruby
l
(Optional) To restore the original Ruby system as your default, you
may use:
$rvm use system
7. To verify your installed version of Ruby, execute the following command:
$ ruby -v
8. The output should look like the following:
jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot
(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_
64]
Access Data Warehouse using the Sequel library
The Sequel library provides relatively low-level access with nice abstractions and
a friendly programming interface, active development, and JDBC support. It has
been tested for use with JRuby for purposes of integrating with Data Warehouse.
l
l
l
Page #58
For more information on the Sequel library, see
http://sequel.jeremyevans.net/.
For quick start documentation, see
http://sequel.jeremyevans.net/rdoc/files/doc/cheat_sheet_rdoc.html.
For a complete reference guide, see http://sequel.jeremyevans.net/rdoc/.
Installing database connectivity to Ruby
Use the following command:
$ gem install sequel
The Ruby database abstraction layer (Sequel) is installed.
Installing the Data Warehouse JRuby support
To install Data Warehouse support for JRuby, you must clone a Git repository and
perform the following installation. Please execute the following steps in the order
listed below.
Steps:
$ git clone https://github.com/gooddata/gooddata-dss-ruby
$ cd jdbc-dss
$ rvm use jruby
$ rake install
Example Ruby code for Data Warehouse
The following Ruby script provides a simple example for how to connecting to
Data Warehouse using JRuby and then to execute a simple query using Sequel:
#!/usr/bin/env ruby
require 'rubygems'
require 'sequel'
require 'jdbc/dss'
Jdbc::Data Warehouse.load_driver
Page #59
Java.com.gooddata.dss.jdbc.driver.DssDriver
# replace with your Data Warehouse instance:
dss_jdbc_url =
'jdbc:dss://secure.gooddata.com/gdc/dss/instances/[DW_ID]'
# replace with your GoodData platform login name:
username = 'joe.user@example.com'
# replace with your GoodData platform password:
password = 'MyPassword'
# example query
Sequel.connect dss_jdbc_url, :username => username, :password
=> password do |conn|
conn.run "CREATE TABLE IF NOT EXISTS my_first_table (id
INT, value VARCHAR(255))"
conn.run "INSERT INTO my_first_table (id, value) VALUES (1,
'one')"
conn.run "INSERT INTO my_first_table (id, value) VALUES (2,
'two')"
conn.fetch "SELECT * FROM my_first_table WHERE id < ?", 3
do |row|
puts row
end
end
NOTE: Data Warehouse is designed as a service for building
data warehousing solutions and it is not expected to be used as
an OLTP database. If you are looking for a way of exposing the
data in Data Warehouse to your end users, consider pushing
necessary data from Data Warehouse into either a GoodData
project to deliver analytical dashboards or to your operational
database that is optimized to be a backend of your end user
facing application code.
Database Schema Design
Data Warehouse makes a clean distinction between the logical data model,
which defines the tables and columns, and the physical data model, which
identifies how the data is organized using the columnar storage and distributed
across the cluster nodes.
Page #60
Figure: Agile Data Warehousing Service architecture
l
For more information on differences between the LDM and the PDM, see
Data Warehouse Logical and Physical Model.
Logical Schema Design - tables and views
The logical database schema can be created with a standard CREATE TABLE
command. Similarly, views can be created using the CREATE VIEW command.
An Data Warehouse view is just a persisted SELECT statement. There is no
significant performance difference between querying a derived result set inlined
as a sub-select versus persisted as view.
l
Materialized views are not supported.
Primary and Foreign Keys
Data Warehouse does not enforce the uniqueness of primary keys. However, a
non-unique value in a primary key column causes errors in the following
situations:
l
l
Page #61
During the load, if data is loaded into a table that has a pre-joined projection
In join queries at query time, if there is not exactly one dimension row that
matches each foreign key value.
Tip: To ensure the uniqueness of your primary keys, use
staging tables and the MERGE command (see Merging Data
Using Staging Tables). If you want to store a version history in
your table, the identifier of the source entity should be neither
declared as a PRIMARY KEY nor referenced by a FOREIGN
KEY column.
Similarly, the referential integrity declared by a foreign key constraint is not
enforced during the data load, unless there is a pre-join projection. However, it
may result in a constraint validation error later if a join query is processed or a
new pre-join projection is created.
Altering Logical Schema
Data Warehouse supports table modification via the standard ALTER TABLE
command.
The underlying columnar storage enables adding or removing columns very
quickly, even for very large tables.
NOTE: New columns are not automatically propagated to
associated views, even if a view is created with the wildcard (*).
To add new columns to a view, the view must be recreated
using the CREATE OR REPLACE VIEW command.
Limitations:
l
Maximum table columns: 1600 columns
l
You cannot add columns to a temporary table
Physical Schema Design - projections
A projection defines how the records specified by logical tables are actually
stored and distributed across the cluster nodes.
Page #62
The following parameters of a physical data representation can be configured
using a projection:
l
l
l
Columns to be included and column encoding (run-length encoding, delta
encoding, etc)
Column ordering
Segmentation: The rows to keep on the same node and the projections to
be replicated instead of split
NOTE: Having multiple projections for the same table impedes
data updates. You should retain only the necessary projections
in your production design.
When a new table is created using the CREATE TABLE command, a new
superprojection is created automatically.
l
For more information on the CREATE TABLE extensions that can configure
the initial superprojection, see Configuring the Initial Superprojection with
CREATE TABLE command.
Additional projections can be created using the CREATE PROJECTION
command.
l
l
See Creating a New Projection with CREATE PROJECTION command.
For more information on replacing existing projections, see Changing
Physical Schema.
Columns Encoding and Compression
Data Warehouse column storage space can be reduced by applying encoding
and compression techniques. Available encoding methods include:
l
run-length encoding (sorted repeating values replaced with the value and a
number of occurrences)
l
dictionary
l
various delta encodings
Page #63
By default, the AUTO encoding is used on column values. This method applies
LZO compression to CHAR/VARCHAR, BOOLEAN, BINARY/VARBINARY, and
FLOAT columns.
l
For INTEGER, DATE/TIME/TIMESTAMP, and INTERVAL type columns,
Data Warehouse uses a compression scheme based on the delta between
consecutive column values.
Tip: For sorted, many value columns such as primary keys, the
AUTO encoding is usually the best choice. For repeating sorted
low cardinality columns, run-length encoding (RLE) may be the
right choice.
For more detailed information on individual encoding types, see
https://my.vertica.com/docs/6.1.x/HTML/index.htm#9273.htm.
Columns Sort Order
The sort order optimizes for queries based on the query predicate, especially
WHERE clauses, GROUP BY or ORDER BY.
See also the Choose Projection Sorting Criteria performance tip.
Segmentation
In a typical Data Warehouse instance, data may be distributed across three or
more nodes of a shared or dedicated cluster. By default, Data Warehouse and the
underlying Vertica database can manage automatically this distribution. However,
in a high-performance database, developers may require better control over how
data is distributed.
In a Vertica database, segmentation controls how data from a table may be
distributed across nodes of a cluster. When a table or projection is created, you
can specify segmentation parameters to define how data is distributed.
l
If segmentation is not specified explicitly, data is segmented by a hash of
the projection columns; columns with fewer than 8 bytes are listed first,
followed by larger columns up to the first 32 columns of the table.
Page #64
NOTE: If segmentation is not configured for your Data
Warehouse instance,the default segmentation is likely to utilize
broadcast joins, which can impact performance.
A custom segmentation can be specified using SEGMENTED BY or
UNSEGMENTED clauses of the CREATE PROJECTION or CREATE TABLE
commands.
The following segmentations options are available:
l
l
l
l
l
l
l
Hash segmentation
Using SEGMENTED BY expression ALL NODES clause after CREATE
PROJECTION or CREATE TABLE command
The expression is expected to return an integer x for each row in the range:
0 <= x < 263.
You should compute this expression using HASH or MODULARHASH
function on one or more columns.
Replication
Using the UNSEGMENTED ALL NODES clause after CREATE
PROJECTION or CREATE TABLE command
Recommended for small tables with no more than a few million rows that
are joined with large ones
NOTE: Avoid using range segmentation, which Vertica
supports. Range segmentation ties your data to explicitly named
nodes on the actual underlying cluster.
For more information about tuning the segmentations of your physical design, see
the Minimize Network Joins section.
Configuring the initial superprojection with CREATE TABLE command
Each Data Warehouse table must have at least one projection with all columns: a
superprojection. Without a superprojection, the database engine does not
understand how the data should be stored.
Page #65
Whenever a new table is created using the CREATE TABLE command, a
superprojection is also created automatically. The parameters of this initial
projection can be configured by the following extensions of the CREATE TABLE
command:
l
l
ORDER BY specifies the sort order of the default superprojection
SEGMENTED BY expression ALL NODES and UNSEGMENTED ALL
NODES configure the segmentation of the default superprojection
Example:
CREATE TABLE customer (
id INTEGER PRIMARY KEY,
name_first VARCHAR(255),
name_last VARCHAR(255)
) ORDER BY id
SEGMENTED BY HASH(id) ALL NODES
NOTE: Unlike the CREATE PROJECTION command, the
KSAFE 1 parameter is not required when defining the initial
projection.
Creating a New Projection with CREATE PROJECTION command
A new projection can be created using the CREATE PROJECTION command:
CREATE PROJECTION projection-name
( columns-and-encodings )
AS
SELECT columns FROM tables
[ WHERE join-predicates ]
[ ORDER BY columns ]
[ hash-segmentation-clause
| range-segmentation-clause
| UNSEGMENTED ALL NODES ]
[ KSAFE [ k-num ] ]
Page #66
In this command, you may define the list of columns included in the projection,
optionally with encoding information. The sort criteria can be specified as a
SELECT query in the ORDER BY clause.
The SELECT query may include a join of multiple tables, which creates a prejoined projection.
NOTE: A physical design with a pre-join projection requires a
strict referential integrity between the involved tables, which can
make your loading process more fragile.
Unlike a materialized view, a projection is intended to hold the raw table data. For
this reason, the SELECT queries in projection definitions cannot include complex
transformations, aggregations, or analytic functions.
NOTE: The k-num parameter must be always set to 1 when
creating an Data Warehouse projection using the CREATE
PROJECTION command.
NODE: For every projection created using the CREATE PROJECTION command,
one of the following must be true:
l
The projection is replicated (by specifying UNSEGMENTED ALL NODES)
l
The projection is segmented and created with KSAFE 1.
Example:
CREATE PROJECTION customer_p AS
SELECT id, name_first, name_last FROM customer
ORDER BY id
SEGMENTED BY HASH(id) ALL NODES
KSAFE 1
NOTE: After a new projection is created, the data should be
refreshed using the START_REFRESH function. For additional
details, see Changing Physical Schema.
Changing Physical Schema
Page #67
In addition to your initial superprojections, new projections can be created using
the CREATE PROJECTION command. A single table can have multiple
projections if necessary.
NOTE: Having multiple projections for the same table impedes
data updates. You should retain only the necessary projections
in your production design.
NOTE: if you want to replace a superprojection, a new
superprojection must be created before the old one is removed.
Steps:
To change the physical model, please complete the following steps.
1. Review the existing projections in your Data Warehouse instance:
SELECT * FROM projections p WHERE p.projection_schema =
'[DW_ID]'
2. Run the CREATE PROJECTION commands to create new projections
3. Run the following command to start an asynchronous process of populating
the new projections with data:
SELECT START_REFRESH();
4. Check the status of the refresh process until the process finishes:
SELECT * FROM projection_refreshes;
5. Drop projections that are no longer necessary using the DROP
PROJECTION command.
Page #68
NOTE: If you are replacing only a superprojection, you must
wait until the previously executed projection refresh completes.
You cannot drop a projection if it’s the only refreshed
superprojection.
Troubleshooting:
If a projection refresh fails, the projection_refreshes table reports the
following error:
failed: projection is unsafe
Make your projection safe by specifying KSAFE 1 in the CREATE PROJECTION
command. You will need the drop your projection and re-create it with the KSAFE
parameter.
Occasionally, the following error may occur when dropping a projection:
ROLLBACK: Projection cannot be dropped because history after
AHM would be lost
Please try again later. If the problem persists, please contact GoodData
Customer Support.
Loading Data into Data Warehouse
Data is loaded into Data Warehouse using the COPY LOCAL command.
NOTE: You must use the LOCAL keyword with the COPY
commmand for Data Warehouse. See Use COPY FROM
LOCAL to Load Data.
NOTE: Although INSERT commands are available, it is
strongly recommended that you use the COPY command for
batch uploads over row-by-row inserts for an optimal load
performance.
Page #69
The COPY command supports the following options:
COPY table [ column_list ] FROM LOCAL file_list
[ BZIP | GZIP ] [WITH PARSER GdcCsvParser]
[ DELIMITER STRING_LIT ]
[ ESCAPE BY STRING_LIT ]
[ ENCLOSED BY STRING_LIT ]
[ SKIP NUMBER_LIT ]
[ REJECTMAX NUMBER_LIT ]
[ EXCEPTIONS exceptions_file ]
[ REJECTED DATA rejected_data_file ]
[ ABORT ON ERROR ]
[ AUTO | DIRECT | TRICKLE ]
NOTE:Some common parameters of the COPYcommand are not
supported in Data Warehouse. See Limited Parameters of the
COPY Command.
NOTE: To load uncompressed data, do not include the BZIP or
GZIP keywords, and reference an uncompressed source file.
The UNCOMPRESSED keyword is not supported.
NOTE: Unlike most databases, Data Warehouse does not
enforce the uniqueness of primary key columns during load;
dupe rows are inserted silently. However, duplicate keys may
trigger an error at query time in join queries. Do not assume that
duplicate rows will break the load or will be merged in target
tables. To avoid inserting unwanted duplicates, use staging
tables. See Merge data using staging tables.
Loading Compressed Data
To minimize network bandwidth and latency issues, you should compress your
data prior to upload to Data Warehouse.
NOTE: Only BZIP and GZIP formats are supported by Data
Warehouse.
Example:
Page #70
COPY customers FROM LOCAL 'customers.csv.gz' GZIP
NOTE: To load uncompressed data, do not include the BZIP or
GZIP keyword in the above command, and reference an
uncompressed file.
Use RFC 4180 Compliant CSV files for upload
By default, the COPY command expects delimited data even if the delimiter
character is not present inside individual data fields.
In the RFC 4180 document, the CSV format describes an encoding structure with
a delimiter, double quotes, or even newline characters within data fields.
The following example is a valid CSV file with a header line and a single data
record:
product_id,product_name,product_description,product_price
12345,"1"" by 5 Yards Duct Tape","Great choice for your
creative projects
Super performance strength
Available in white, red, green and black",9.95
This CSV file looks like the following in a spreadsheet application:
product_
product_name
id
12345
1" by 5 Yards Duct
Tape
product_
price
product_description
Great choice for your creative
projects
9.95
Super performance strength
Available in white, red, green
and black
To load CSV data with all escaping possibilities defined in RFC 4180, you must
explicitly specify the CSV parser using WITH PARSER GdcCsvParser, which
is a GoodData-specific CSV parser in Data Warehouse.
Example:
Page #71
COPY customers FROM LOCAL 'customers.csv.gz' GZIP WITH PARSER
GdcCsvParser
Managing Escape Characters:
To load CSV data with all escaped characters, as specified in RFC 4180, you
must explicitly specify the CSV parser using the GdcCsvParser, a GoodDataspecific parser for Data Warehouse, and include the escape character using
ESCAPE AS:
COPY customers FROM LOCAL 'customers.csv.gz' GZIP
WITH PARSER GdcCsvParser ESCAPE AS '"'
Error Handling
By default, any row with fields that cannot be inserted into target columns is
discarded silently.
This behavior can be overridden with the following modifiers to the COPY
command:
l
l
l
l
REJECTMAX number - A maximum number of permitted rejected records
before a load fails
EXCEPTIONS ’path’ - Local file containing load exceptions, which
includes useful error messages for debugging issues
REJECTED DATA ’path’ - Local file where rejected rows are to be
stored
ABORT ON ERROR - The load is cancelled (i.e. no data are loaded) on the
first rejected load
Examples:
COPY customers FROM LOCAL 'customers.csv' ABORT ON ERROR;
COPY customers FROM LOCAL 'customers.csv.gz' GZIP WITH PARSER
GdcCsvParser
Page #72
EXCEPTIONS '/tmp/exceptions.txt' REJECTED DATA
'/tmp/rejected.csv'
REJECTMAX 100
Merging Data Using Staging Tables
For the following reasons, you should use staging tables when loading data into
Data Warehouse:
l
l
Data Warehouse does not support upsert operations.
Data Warehouse does not enforce the uniqueness of primary key during
data load. However, duplicate records may trigger an error at query time in
join queries.
You cannot load your data directly into the target table and expect that any
matching records already in the table are automatically overwritten.
Tip: To manage adding data for records that may already exist
in the target table, you should load your data into an empty
staging table first. Use the MERGE command to merge the
staged data into the target table.
Segmentation in Staging:
There are performance impacts in shuffling data across cluster nodes.
Tip: Where possible, use the same segmentation for the staging
area as is used in the target table’s projections.
Example:
CREATE TEMP TABLE in_customer (
id VARCHAR(32) PRIMARY KEY,
name_first VARCHAR(255),
name_last VARCHAR(255),
created_at DATETIME,
is_deleted BOOLEAN,
) ON COMMIT PRESERVE ROWS
SEGMENTED BY HASH(id) ALL NODES; -- consistent with the
Page #73
target table
COPY in_customer FROM LOCAL '/data/customers.csv' ABORT ON
ERROR DIRECT;
MERGE /*+direct*/ -- "direct" improves the performance of
large batch operations
INTO customer tgt USING in_customer src
ON src.id = tgt.id
WHEN MATCHED THEN UPDATE
SET name_first = src.name_first, name_last = src.name_
last,
created_at = src.created_at, is_deleted = src.is_
deleted
WHEN NOT MATCHED THEN INSERT
(id, name_first, name_last, created_at, is_deleted)
VALUES
(src.id, src.name_first, src.name_last, src.created_at,
src.is_deleted);
Statistics Collection
The query optimizer uses statistics about data in your projections to build the
optimal query plan.
l
l
l
Keep the statistics up to-date by running the SELECT ANALYZE_
STATISTICS SQL command. The ANALYZE_STATISTICS function
returns 0 when it completes successfully.
The ANALYZE_STATISTICS command works on a 10% sample of the
specified disk data. The sample size can be overridden by using the
ANALYZE_HISTOGRAM function instead.
Both ANALYZE_STATISTICS and ANALYZE_HISTOGRAM functions autocommit the current transaction.
Examples:
-- collect statistics for a single table
SELECT ANALYZE_STATISTICS('table');
-- collect statistics for a specific column
SELECT ANALYZE_STATISTICS('table.column');
-- collect table statistics based on a 0.5% sample
Page #74
SELECT ANALYZE_HISTOGRAM('table', 0.5);
-- collect ble statistics based on all data
SELECT ANALYZE_HISTOGRAM('table', 100);
NOTE: Collecting statistics is a CPU- and memory-intensive
operation. You should run ANALYZE_STATISTICS only after
the data in your projections changes significantly.
Significant changes include the following:
l
First data load into a table
l
A data introduces a significant deviation in the data distribution.
l
A new projection is created and refreshed
l
l
l
The number of rows or minimum/maximum values in table’s columns
change by 50%
New primary key values are added to tables with referential integrity
constraints (In this case, both parent and child tables should be reanalyzed.)
The relative table size, compared to tables to which it is being joined,
changes materially. For example, if a table becomes only five times larger
than the other, when it was previously 50 times larger, statistics should be
analyzed.
Querying Data Warehouse
Data Warehouse is built on HP Vertica, a leading-edge columnar database, and
supports the SQL:99 standard with Vertica specific extensions.
l
For Vertica version information, see Data Warehouse and Vertica.
Vertica supports SQL standards for creating and querying for data. For more
information on query capabilities, please use the following Vertica references:
l
l
Page #75
General querying for data: See Documentation of the SELECT SQL
Command.
SQL Functions:
l
Aggregate Functions
l
Analytic Functions
l
Date/Time Functions
l
Formatting Functions
l
Geospatial Package SQL Functions
l
IP Conversion Functions
l
Mathematical Functions
l
NULL-handling Functions
l
Pattern Matching Functions
l
Regular Expression Functions
l
Sequence Functions
l
String Functions
l
Timeseries Functions
l
URI Encode/Decode Functions
l
HP Vertica Meta-functions
NOTE: Vertica functions that require a superuser permissions
and most System Information Functions are not supported by
Data Warehouse.
l
Analytic functions:Analytic functions return aggregated results. However,
the result set is not grouped. Group values are returned with each record.
See Using SQL Analytics.
l
l
Page #76
Time Series analytics:These analytic functions evaluate the values of a
given set of variables over time. Those values are then grouped into
buckets, based on a defined time interval, for analysis and aggregation. See
Using Time Series Analytics.
Event Series joins: This HP Vertica SQL extension enables analysis of
two series when their measurement intervals don’t align precisely. For
example, mismatched timestamps can be compared. You can compare
values from the two series directly, rather than normalizing the two series to
the same interval before comparison. See Event Series Joins.
Performance Tips
In high-volume environments, small changes to the database schema or methods
for using the database can have significant impacts on overall performance. Use
the tips in this section to improve performance in your Agile Data Warehousing
Service solution.
Do Not Overnormalize Your Schema
For a columnar database engine, denormalization is cheap from a storage point
of view, while table joins are expensive.
Example:
When storing the history of changes, you should store all columns of record
versions in the same table as the source data, instead of retaining the history of
each column in an extra table, to avoid table joins.
Use Run Length Encoding (RLE)
For sorted columns with many repeating values, you should explicitly use runlength encoding (RLE) in your projections. The default (AUTO) encoding may
save some space, but it's slower.
However, RLE is not suitable for big high cardinal columns, such as primary keys.
These should use the default encoding.
Use the EXPLAIN Keyword
Page #77
If your query is running slow, the EXPLAINcommand provides a quick overview
of the query plan. Examining the query plan may help to identify possible sources
of the inefficiency and can be used to derive possible improvements to the
physical database design.
The query plan can be retrieved by running the EXPLAIN command followed by
the actual SQL query.
For more information about analyzing the query plans with the EXPLAIN keyword
please use the following Vertica references:
l
EXPLAIN SQL Command Documentation
l
Understanding Query Plans
Use Monitoring Tables
QUERY_EVENTS system table
The QUERY_EVENTS system table provides useful information about queries that
have been recently executed in your Data Warehouse instance.
l
See https://my.vertica.com/docs/6.1.x/HTML/index.htm#17580.htm.
EVENT_TIMESTAMP column:
Timestamp that is recorded when the event occurs, which may assist in
identifying the code that was being executed at the time of the event.
EVENT_TYPE column:
The following values in the EVENT_TYPE column may indicate a problem that
requires attention:
PREDICATE OUTSIDE HISTOGRAM
The query optimizer encountered a predicate that was false for the entire
histogram created by ANALYZE_STATISTICS or ANALYZE HISTOGRAM.
NO HISTOGRAM
Page #78
The query optimizer encountered a predicate on a column lacking a histogram.
MEMORY LIMIT HIT
The optimizer used all allocated memory when creating the query plan. You
should simplify your query instead of increasing the memory allocation.
EVENT_DESCRIPTION column:
The EVENT_DESCRIPTION column provides clear information about the
captured events. Below are example messages:
GROUP BY key set did not fit in memory, using external sort
grouping
To fix, you should consider a projection sorted by the GROUP BY key to enable
pipelined GROUP BY sorting, which forces only the group that is currently being
processed to be retained in memory.
l
See https://my.vertica.com/docs/6.1.x/HTML/index.htm#16340.htm.
Many rows were resegmented during plan execution
To fix, you should consider using identically segmented projections.
l
For more information, see Minimize Network Joins.
The optimized encountered a predicate on a column for which
it does not have a histogram
The database needs to have statistics updated. For more information, see
Statistics Collection.
Statement identifiers:
Page #79
In the QUERY_EVENTS table, the combination of TRANSACTION_ID, REQUEST_
ID, and STATEMENT_ID fields uniquely identify the SQL statement, which is
located in the QUERY_REQUESTS table (see below).
QUERY_REQUESTS table
This table retains information about query plans, optimization, and execution
events. For any unique combination of TRANSACTION_ID and STATEMENT_ID,
the value of the REQUEST field contains the SQL request.
REQUEST_TYPE column:
The type of the statement (QUERY, DDL, or other)
REQUEST column:
The text of the SQL statement.
SUCCESS column:
Boolean value indicates whether the query executed properly.
l
Errors are logged in the ERROR_MESSAGES system table.
START_TIMESTAMP column:
Beginning timestamp for the logged event.
END_TIMESTAMP column:
Ending timestamp for the logged event.
Write Large Data Updates Directly to Disk
By default, all data updates are passed through an in-memory store first.
However, the capacity of this memory store is limited, which is why it is more
efficient to direct large bulk data uploads or modifications directly into disk. This
can be achieved by the DIRECTkeyword in the COPYcommand and the
/*+direct*/ hint in INSERT, DELETE and MERGE commands.
Examples:
Page #80
COPY table FROM LOCAL 'file.csv' WITH PARSER GdcCsvParser()
ABORT ON ERROR DIRECT;
MERGE /*+direct*/ INTO opportunities t
USING in_opportunities s
ON s.id = t.id
WHEN MATCHED THEN
UPDATE SET name = s.name,
created = s.created,
closed = s.closed,
stage = s.stage,
is_closed = s.is_closed,
is_won = s.is_won,
amount = s.amount
WHEN NOT MATCHED THEN
INSERT
(id, name, created, closed, stage, is_closed, is_won,
amount)
VALUES
(s.id, s.name, s.created, s.closed, s.stage, s.is_
closed, s.is_won, s.amount);
Avoid Unnecessary UPDATEs
An UPDATEis implemented as a combination of INSERT and UPDATE. Try to
design your model and loading routines to avoid UPDATEs of large tables.
Example 1: When storing full history of data, do not update old records with an
end-of-validity timestamp. Instead, insert new versions only and retrieve end-ofvalidity timestamp at query time when necessary.
Example 2: When computing derived columns for existing records, consider
inserting them into a separate table, rather than updating existing records.
General Projection Design Tips
When designing projections for your Data Warehouse instance, please observe
the following recommendations.
Minimize Network Joins
Page #81
Your Data Warehouse instance is typically created within a shared or dedicated
cluster of 3+ nodes running HP Vertica, and the data in your tables may be
spread across the cluster nodes. Data Warehouse provides options for controlling
how your data is distributed.
If you retain the system defaults, it is very likely that your table joins will join
records sitting on different parts of the clusters, which affect performance. This
section some insight into how to minimize network joins.
Basic Problem:
The following picture includes two randomly segmented projections. To join the
tables using these projections by cust_id, you must combine records from
different nodes. For example, orders associated with customer #2 are split across
all three nodes:
Figure: Randomly segmented projections
Solution #1: All joined records on the same node
To make the join operation more efficient, the records to be joined should be
available on the same node, as in the following:
Page #82
Figure: Identically segmented projections
NOTE: To force records from multiple tables to be stored on the
same node for performance efficiency, all tables should contain
identically segmented projections.
Segmentation can be defined in CREATE TABLE statements, like the following:
CREATE TABLE customers (
cust_id INTEGER,
name VARCHAR(255)
) SEGMENTED BY HASH(cust_id) ALL NODES;
CREATE TABLE orders (
order_id INTEGER,
cust_id, INTEGER,
order_dt DATE,
total DECIMAL(12,2)
) SEGMENTED BY HASH(cust_id) ALL NODES;
Solution #2: Replicate data across all nodes.
For smaller tables (up to a few hundred thousand records), network joins can be
avoided by replicating all customer data across all nodes:
Page #83
Figure: Replicated projection of the customers table
The corresponding SQL is the following:
cust_id INTEGER,
name VARCHAR(255)
) UNSEGMENTED ALL NODES;
order_id INTEGER,
cust_id, INTEGER,
order_dt DATE,
total DECIMAL(12,2)
);
Solution #3: Constrain data to a single node.
If all tables are small (up to a few million records), you should consider retaining
all data on a single node only:
Page #84
Figure: Both projections on the same single node only
You can force this constraint by segmenting your projections by a constant that is
unique to your implementation such as the Data Warehouse identifier identifier
(the HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif string in the example below).
cust_id INTEGER,
name VARCHAR(255)
) SEGMENTED BY HASH('HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif') ALL
NODES;
order_id INTEGER,
cust_id, INTEGER,
order_dt DATE,
total DECIMAL(12,2)
) SEGMENTED BY HASH('HQrKTXGedJ6OngbUJ4QAHrb0pEw5oEif') ALL
NODES;
Choose Projection Sorting Criteria
When developing your database projections, you should start with the default
projection, which is a superprojection containing all fields in the table. Soon,
however, you may discover that you need to optimize projections for performance
or to create custom projections to address specific query use cases.
Page #85
Tip: In a production implementation, it is not necessary to
optimize your physical model for each query that you intend to
run. Instead, you should start slowly and add or modify
projections to address specific performance issues or
requirements of your database.
The sorting criteria that you use in your projections depend on the uses for those
projections. Sort criteria should be specified based on requirements for speedy
retrieval, memory footprint, and join use cases for the projection.
These use cases are best demonstrated by example. The example below is
provided for illustrative purposes only.
Suppose you have the following three queries on two tables (table and table2):
SELECT a, b, c FROM table WHERE c = 'xxxx' ORDER BY b
SELECT b, c, SUM(a) FROM table GROUP BY b, c
SELECT t.a, t.b, t2.x FROM table t JOIN table2 t2 ON t.a =
t2.y
To build a physical model that is fully optimized for these three queries, you must
create the projections as outlined in the picture below:
Figure: Example projections
l
l
l
Page #86
projection1 is sorted by c and then by b. This sorting enables quick location
of the filtered value for c and returns the result sorted by b.
projection2 is sorted by columns used in the GROUP BY clause. This
sorting helps to minimize the memory footprint of the GROUP BY query, as
the database can retain only group-specific data in memory, instead of
maintaining a hash table containing all groups.
If table and table2 are large, projection3 and projection4 enable a merge
join of pre-sorted columns, instead of loading the smaller table into memory
and performing a hash join. Note that the columns C and Z are not present
in the projections, as they are not used by the query.
NOTE: When building your Data Warehouse database, you are
not required to create optimized query-specific projections from
scratch. You should start with one superprojection for each table
and consider adding additional projections from HP Vertica to
improve performance of slow queries.
Limitations and Differences in Data Warehouse from
Other RDBMS
Data Warehouse provides access to most features of HP Vertica with a few
exceptions listed in this section.
Single Schema per Data Warehouse Instance
The current version of Data Warehouse does not permit creation of your own
schemata. When you provision an Data Warehouse instance, a default schema
with the same name as the DW_ID is created and automatically added into your
search path. As a result, you are not required to qualify your tables and views with
the schema name.
Tip: To separate a logical group of entities, you should establish
a naming convention (e.g. “in_” prefix for input stage tables.).
No Vertica Admin Access
The admin-only features of Vertica are not accessible through Data Warehouse.
Use COPY FROM LOCAL to Load Data
Page #87
The LOCAL keyword in the COPY command tells Vertica that data is being
loaded from the client, instead of from a file already located on the nodes of the
Vertica cluster.
NOTE: The COPY command without the LOCAL keyword does
not work in Data Warehouse.
Limited Parameters of the COPY Command
The following parameters are currently supported by the COPY command. All
other parameters are not supported.
l
LOCAL
l
WITH PARSER GdcCsvParser
l
DELIMITER
l
ESCAPE AS
l
ENCLOSED BY
l
SKIP
l
REJECTMAX
l
EXCEPTIONS
l
REJECTED DATA
l
ABORT ON ERROR
l
AUTO, DIRECT, or TRICKLE
Only a plain list of columns is expected (optionally) after the table name.
NOTE: The parameters of the COPY command must be
specified in the same order as in the above list. For example,
ESCAPE AS must always precede any reference to
ENCLOSED BY.
Limited Access to System Tables
Page #88
The majority of Vertica’s system tables from V_MONITORING and V_CATALOG
schemata for introspecting your logical (tables) and physical (projections) data
model in Data Warehouse are accessible.
The available system tables can be accessed only without the schema quantifier:
SELECT * FROM tables will work but SELECT * FROM v_catalog.tables
will not.
The following tables are not available:
l
ALL_TABLES
l
DATABASES
l
ELASTIC_CLUSTER
l
EPOCHS
l
FAULT_GROUPS
l
GRANTS
l
LICENSE_AUDITS
l
NODES
l
ODBC_COLUMNS
l
PASSWORDS
l
PROFILE_PARAMETERS
l
PROFILES
l
RESOURCE_POOL_DEFAULTS
l
RESOURCE_POOLS
l
ROLES
l
STORAGE_LOCATIONS
l
SYSTEM_COLUMNS
l
SYSTEM_TABLES
l
USER_AUDITS
l
USERS
l
ACTIVE_EVENTS
l
COLUMN_STORAGE
l
CONFIGURATION_CHANGES
l
CONFIGURATION_PARAMETERS
l
CPU_USAGE
l
CRITICAL_HOSTS
l
CRITICAL_NODES
l
CURRENT_SESSION
l
DATA_COLLECTOR
l
DATABASE_BACKUPS
l
DATABASE_CONNECTIONS
l
DATABASE_SNAPSHOTS
l
DEPLOY_STATUS
l
DESIGN_STATUS
l
DISK_RESOURCE_REJECTIONS
l
DISK_STORAGE
l
ERROR_MESSAGES
l
EVENT_CONFIGURATIONS
l
HOST_RESOURCES
l
IO_USAGE
l
LOAD_STREAMS
Page #89
l
LOCK_USAGE
l
LOCKS
l
LOGIN_FAILURES
l
MEMORY_USAGE
l
MONITORING_EVENTS
l
NETWORK_INTERFACES
l
NETWORK_USAGE
l
NODE_RESOURCES
l
NODE_STATES
l
PARTITION_REORGANIZE_ERRORS
l
PARTITION_STATUS
l
PROCESS_SIGNALS
l
PROJECTION_RECOVERIES
l
PROJECTION_REFRESHES
l
QUERY_METRICS
l
REBALANCE_PROJECTION_STATUS
l
REBALANCE_TABLE_STATUS
l
RECOVERY_STATUS
l
RESOURCE_POOL_STATUS
l
RESOURCE_QUEUES
l
RESOURCE_REJECTIONS
l
RESOURCE_USAGE
l
STORAGE_TIERS
Page #90
Page #91
l
STORAGE_USAGE
l
SYSTEM
l
SYSTEM_RESOURCE_USAGE
l
SYSTEM_SERVICES
l
SYSTEM_SESSIONS
l
TUNING_RECOMMENDATIONS
l
UDX_FENCED_PROCESSES
l
USER_LIBRARIES
l
USER_LIBRARY_MANIFEST
l
USER_SESSIONS
l
WOS_CONTAINER_STORAGE
l
For the full list and the documentation of individual
system tables, see
https://my.vertica.com/docs/6.1.x/HTML/index.htm#9338
.htm.
Limited Access to System Functions
The following System Information Functions are not supported by Data
Warehouse:
l
CURRENT_DATABASE
l
CURRENT_SCHEMA
l
CURRENT_USER
l
HAS_TABLE_PRIVILEGE
l
SESSION_USER
l
USER
l
USERNAME
https://my.vertica.com/docs/6.1.x/HTML/index.htm - 16772.htm
Page #92
Moreover, the HP Vertica Meta-functions that require superuser permissions are
not supported neither.
Reserved Entity Names
Your table or view names may not collide with an existing Vertica system table.
l
For the full list of system tables, see
https://my.vertica.com/docs/6.1.x/HTML/index.htm#9338.htm.
For example, the following command fails:
CREATE TABLE users ( id INTEGER PRIMARY KEY, login VARCHAR
(32))
The error message is the following:
Referencing object "users" is not allowed
Database Designer Tool not available
Vertica includes the Database Designer tool to provide hints for designing the
physical model (projections). This tool is not available to Data Warehouse users.
JDBC Driver Limitations
The following methods of the JDBC interface are not implemented:
java.sql.Driver
l
getParentLogger
java.sql.Connection
l
abort
l
createArrayOf
l
createBlob
l
createClob
Page #93
l
createNClob
l
createSQLXML
l
createStruct
l
getClientInfo()
l
getClientInfo(java.lang.String)
l
getNetworkTimeout
l
getTypeMap
l
isWrapperFor
l
nativeSQL
l
prepareCall(java.lang.String)
l
prepareCall(java.lang.String, int, int)
l
prepareCall(java.lang.String, int, int, int)
l
prepareStatement(java.lang.String)
l
prepareStatement(java.lang.String, int)
l
prepareStatement(java.lang.String, int, int)
l
prepareStatement(java.lang.String, int, int, int)
l
prepareStatement(java.lang.String, int[])
l
prepareStatement(java.lang.String, java.lang.String[])
l
releaseSavepoint
l
rollback(java.sql.Savepoint)
l
setClientInfo(java.util.Properties)
l
setClientInfo(java.lang.String, java.lang.String)
l
setNetworkTimeout
l
setSavepoint()
l
setSavepoint(java.lang.String)
l
setTypeMap
l
unwrap
Page #94
java.sql.DatabaseMetaData
l
getRowIdLifetime()
l
supportsStoredFunctionsUsingCallSyntax()
l
autoCommitFailureClosesAllResultSets()
l
getClientInfoProperties()
l
getFunctions(java.lang.String,java.lang.String,java.lang.String)
l
l
getFunctionColumns
(java.lang.String,java.lang.String,java.lang.String,java.lang.String)
getPseudoColumns
(java.lang.String,java.lang.String,java.lang.String,java.lang.String)
l
generatedKeyAlwaysReturned()
l
unwrap(java.lang.Class)
l
isWrapperFor(java.lang.Class)
java.sql.Statement
l
addBatch
l
cancel
l
clearBatch
l
closeOnCompletion
l
execute(java.lang.String, int[])
l
execute(java.lang.String, java.lang.String[])
l
executeBatch
l
executeUpdate(java.lang.String, int[])
l
executeUpdate(java.lang.String, java.lang.String[])
l
getMaxFieldSize
l
getQueryTimeout
l
getResultSetHoldability
l
isClosed
l
isCloseOnCompletion
l
isPoolable
l
isWrapperFor
l
setCursorName
l
setEscapeProcessing
l
setMaxFieldSize
l
setPoolable
l
setQueryTimeout
l
unwrap
Page #95
java.sql.ResultSet
l
getTime(java.lang.String,java.util.Calendar)
l
getTime(int,java.util.Calendar)
l
getDate(int,java.util.Calendar)
l
getDate(java.lang.String,java.util.Calendar)
l
getTimestamp(int,java.util.Calendar)
l
getTimestamp(java.lang.String,java.util.Calendar)
l
isClosed()
l
getObject(int,java.lang.Class)
l
getObject(int,java.util.Map)
l
getObject(java.lang.String,java.util.Map)
l
getObject(java.lang.String,java.lang.Class)
l
getBytes(int)
l
getBytes(java.lang.String)
l
getArray(int)
l
getArray(java.lang.String)
l
getURL(java.lang.String)
l
getURL(int)
l
unwrap(java.lang.Class)
l
getRef(java.lang.String)
l
getRef(int)
l
isWrapperFor(java.lang.Class)
l
getCursorName()
l
getHoldability()
l
getNCharacterStream(int)
l
getNCharacterStream(java.lang.String)
l
getSQLXML(java.lang.String)
l
getSQLXML(int)
l
getNClob(int)
l
getNClob(java.lang.String)
Page #96
l
getClob(java.lang.String)
l
getClob(int)
l
getBlob(java.lang.String)
l
getBlob(int)
l
getCharacterStream(java.lang.String)
l
getCharacterStream(int)
l
getBinaryStream(int)
l
getBinaryStream(java.lang.String)
l
getUnicodeStream(java.lang.String)
l
getUnicodeStream(int)
l
getAsciiStream(int)
l
getAsciiStream(java.lang.String)
l
getRowId(int)
l
getRowId(java.lang.String)
Page #97
java.sql.ResultSet (special cases)
In addition to the above, java.sql.ResultSet instances returned by
DataBaseMetaData methods or by the Statement#getGeneratedKeys method do
not implement the following methods:
l
clearWarnings
l
getFetchSize
l
getWarnings
l
setFetchSize

Agile Data Warehousing Service User Guide

Transcription

Similar documents

HAIRHOUSE WAREHOUSE

230 BOSSARDSVILLE ROAD Sciota, Monroe County, PA

Cirrus Tech Core Capabilities THE CRANE

Rooms To Go Distribution Center

Egulf-1d

rampion headquarters

11820 Central Parkway

Y-WMS TM

1/35 Biscayne Way, JANDAKOT, WA, 6164

Simard