Partner Technical Guide

Transcription

Partner Technical Guide
Teradata ISV Partner Technical Guide
Integration with Teradata
ISV Partner Technical Guide
June 2015
Organization Name:
Location:
Teradata Partners Integration Lab
17095 Via Del Campo, San Diego, CA 92127
1
6/18/2015
PREFACE
Revision History:
Version
A01
A02
A03
A04
A05
A06
A07
A08
Author(s)
Partner Engineering
Partner Engineering
Partner Engineering
Partner Engineering
Partner Engineering
Partner Engineering
Partner Engineering
Partner Engineering
Date
9/20/2005
9/22/2005
3/19/2009
7/20/2009
9/3/2010
2/26/2013
1/2/2014
6/8/2015
Comments
Initial Revision
Minor Updates
Update
Minor Update
Update
Update
Update
Update and Addition of FAQ
material
Audience:
The audience for this document is Teradata ISV Partners.
2
Contents
Contents 3
1. Teradata Partners Program .............................................................................................. 5
Section 1.1 Teradata Database -- Introduction .......................................................................... 5
Section 1.2 Teradata Support ..................................................................................................... 6
Section 1.3 Teradata Partner Intelligence Network and the Teradata Education Network ....... 6
2. Teradata Basics ............................................................................................................... 7
Section 2.1 Unified Data Architecture ....................................................................................... 7
Section 2.2 Data Types .............................................................................................................. 8
Section 2.3 Primary Index......................................................................................................... 14
Section 2.4 NoPI Objects .......................................................................................................... 15
Section 2.5 Secondary Indexes ................................................................................................ 16
Section 2.6 Intermediate/Temporary Tables ............................................................................. 17
Section 2.7 Locking .................................................................................................................. 17
Section 2.8 Statistics ................................................................................................................. 19
2.8.1 Random AMP Sampling .............................................................................................. 20
2.8.2 Full Statistics Collection .............................................................................................. 22
2.8.3 Collection with the USING SAMPLE option .............................................................. 23
2.8.4 Collect Statistics Summary Option .............................................................................. 24
2.8.5 Summary: Teradata Statistics Collection ..................................................................... 25
2.8.6 New opportunities for statistics collection in Teradata 14.0[1] .................................... 26
2.8.7 Recommended Reading ............................................................................................... 28
Section 2.9 Stored Procedures .................................................................................................. 29
Section 2.10 User Defined Functions (UDF) ............................................................................ 32
UDF’s are invoked qualifying the database name where they are stored, e.g.
DBName.UDFname(), or if stored in the special database call SYSLIB, without database name
qualification, e.g. UDFname(). ......................................................................................... 33
Section 2.11 Table Operators .................................................................................................... 35
Section 2.12. QueryGrid ........................................................................................................... 37
Section 2.13 DBQL ................................................................................................................... 39
Section 2.14 Administrative Tips ............................................................................................. 43
3. Workload Management ................................................................................................. 45
Section 3.1 Workload Administration ...................................................................................... 45
4. Migrating to Teradata ................................................................................................... 46
Section 4.1 Utilities and Client Access ..................................................................................... 46
Section 4.1.1 Teradata Load/Unload Protocols & Products ..................................................... 46
Section 4.1.2 Input Data Sources with Scripting Tools ............................................................ 48
Section 4.1.3 Teradata Parallel Transporter .............................................................................. 48
Section 4.1.4 Restrictions & Other Techniques ........................................................................ 58
Section 4.2 Load Strategies & Architectural Options ............................................................... 61
Section 4.2.1 ETL Architectural Options ................................................................................. 61
Section 4.2.2 ISV ETL Tool Advantages vs. Teradata Tools ................................................... 62
3
Section 4.2.3 Load Strategies.................................................................................................... 62
Section 4.2.4 ETL Tool Integration .......................................................................................... 63
Section 4.3 Concurrency of Load and Unload Jobs .................................................................. 64
Section 4.4 Load comparisons ................................................................................................. 64
5. References ..................................................................................................................... 64
Section 5.1 SQL Examples ....................................................................................................... 64
Derived Tables ...................................................................................................................... 65
Recursive SQL ...................................................................................................................... 65
Sub Queries ........................................................................................................................... 66
Case Statement ...................................................................................................................... 67
Sum (SQL-99 Window Function) ......................................................................................... 69
Rank (SQL-99 Window Function)........................................................................................ 70
Fast Path Insert ...................................................................................................................... 71
Fast Path Delete .................................................................................................................... 72
Section 5.2 Set vs. Procedural................................................................................................... 72
Section 5.3 Statistics collection “cheat sheet” .......................................................................... 75
Section 5.4 Reserved words ...................................................................................................... 78
Section 5.5 Orange Books and Suggested Reading .................................................................. 78
4
1. Teradata Partners Program
Section 1.1 Teradata Database -- Introduction
This document is intended for ISV partners new to the Teradata partner program. There is a
wealth of documentation available on the database and client/utility software; in fact, the
majority of information contained in this document came from those sources.
ISV’s need to understand that although Teradata is an ANSI standard relational database
management system, it is different and in order to leverage Teradata strength’s ISV’s will need
to understand these differences in order to do an effective job in their integration. It is primarily
those key differences and strengths that are pointed out in this document; this document is
intended as a quick start guide and not a replacement for education or the utilization of the
extensive user documentation provided for Teradata.
The test environment Teradata normally recommends for our partners is the usage of the
Teradata Partner Engineering lab in San Diego. For testing in this environment, you can connect
to Teradata servers in the lab via high speed internet connections.
For some partners, the Teradata Partner Engineering lab will not be sufficient as it does not cover
all requirements. For example, the lab is not an applicable environment for performance testing.
While Teradata can accommodate some requests on a case-by-case basis, it is not a function of
the lab.
Therefore, if you decide to execute your testing on your own premises, the following test
environments are supported:
Teradata Database Support Matrix:
Operating System
TD 13.00 TD 13.10 TD 14.00 TD 14.10 TD 15.00 TD 15.10
X
Linux SLES 9
X
X
X
X (SP3)
X (SP3)
X (SP3)
Linux SLES 10
X
X
X (SP1)
X (SP1)
Linux SLES 11
X
MP-RAS
Windows Server 2003 (32-bit)
Windows Server 2003 (64-bit)
X
X
Teradata Database client applications run on many operating systems. See Teradata Tools and
Utilities Supported Platforms and Product Versions, at: http://www.info.teradata.com
This can be installed on a Teradata or non-Teradata server. The minimum hardware
requirements for non-Teradata systems are described in the Field Installation Guide for Teradata
Node Software for Windows, and in the Field Installation Guide for Teradata Node Software for
Linux SLES9, SLES10 and SLES11 manuals. The latest guides can be found on
http://www.info.teradata.com/.
5
Section 1.2 Teradata Support
Partners that have an active partner support contract with us can submit incidents via T@YS
(http://tays.teradata.com). In addition, T@YS gives access to download drivers and patches and
search the knowledge repositories.
All incidents are submitted via T@YS on-line to our CS organization. Our CS organization
assumes the partner has a decent level of Teradata knowledge. It is not in the scope of the CS
organization to educate the partner, hand hold a partner through an installation of TD s/w or
through the resolution of an issue or to provide general Teradata consulting advice, etc.
To sign up for T@YS, submit your name, title, address, phone number and e-mail to your global
alliance support partner.
Section 1.3 Teradata Partner Intelligence Network and the
Teradata Education Network
The Teradata Partner Intelligence Network is the single most valuable and complete source of
information for Teradata ISV Partners. Partners that have an active partner support contract with
Teradata can access this Network at http://partnerintelligence.teradata.com.
All partners should receive an orientation session for the Teradata Partner Intelligence Network
as part of their Teradata Partner benefit package. This network is the one source for all the tools
and resources partners need to develop and promote their integrated solutions for Teradata.
Partners that have an active partner support contract with us also have access to the Teradata
Education Network (TEN) (http://www.teradata.com/ten). In order to sign up, submit your name,
title, address, phone number and e-mail to your global alliance support partner.
Access to TEN is free. Depending on the type of education selected, there may be a cost as
follows:




Web Based Training- offered at no cost
Recorded Virtual Classes- offered at no cost
Live Virtual Classes- offered at nominal fee per course
Instructor Led Classes- offered at a discount
Of particular interest is “Teradata Essentials for Partners.” The primary focus of this four-day
technical class is to provide a foundational understanding of Teradata’s design and
implementation to Alliance Partners. The class is given in a lecture format and provides a
technical and detailed description of the following topics:



Data Warehousing
Teradata concepts, features, and functions
Teradata physical database design – make the correct index selections by understanding
distribution, access, and join techniques.
6



Explains, space utilization of tables and databases, join indexes, and hash indexes are
discussed in relation to physical database design
Teradata Application Utilities – BTEQ and TPT (Load, Update, Export and Stream
adapters); details on when and how to use
Key Teradata features and utilities (up to Teradata 14.0) are included
For more training and education information, including a Partner Curriculum Map with
descriptions of the courses as well as additional recommended training, visit the “Education”
webpage on the Teradata PartnerIntelligence website.
http://partnerintelligence.teradata.com
2. Teradata Basics
Section 2.1 Unified Data Architecture
In October of 2012, Teradata introduced the Teradata® Unified Data Environment™ , and the
Teradata Unified Data Architecture™ (UDA) integrates the Teradata analytics platform, the
Teradata Aster discovery platform, and Hadoop technology.
In the UDA, data is intended to flow where it is most efficiently processed. In the case of
unstructured data with multiple formats, it can be quickly loaded in Hadoop as a low-cost
platform for landing, staging, and refining raw data in batch. Teradata’s partnership with
Hortonworks enables enterprises to use Hadoop to capture extensive volumes of historical data
and perform massive processing to refine the data with no need for expensive and specialized
knowledge and resources.
The data can then be leveraged in the Teradata Aster analytics discovery platform for real-time,
deep statistical analysis and discovery, allowing users to identify patterns and trends in the data.
The patented Aster SQL-MapReduce® parallel programming framework combines the analytic
power of MapReduce with the familiarity of SQL.
After analysis in the Aster environment, the relevant data can be routed to the Teradata Database,
integrating the data discovered by the Teradata Aster discovery platform with all of the existing
operational data, resulting in intelligence that can be leveraged across the enterprise.
Some of the benefits that can be realized by employing the UDA:
 Capture and refine data from a wide variety of sources.
 Perform multi-structured data preprocessing.
 Develop rapid analytics.
 Process embedded analytics, analyzing both relational and non-relational data.
 Produce semi-structured data as output, often with metadata and heuristic analysis.
 Solve new analytical workloads with reduced time to insight.
7

Use massively parallel storage in Hadoop to efficiently retain data
For more information on UDA, contact your Partner Integration Lab consultant, or visit
the PartnerIntelligence website at http://partnerintelligence.teradata.com.
Section 2.2 Data Types
These data types are derived from the Teradata Database 15.00 SQL Reference Manual.
Every data value belongs to an SQL data type. For example, when you define a column in a
CREATE TABLE statement, you must specify the data type of the column. The set of data
values that a column defines can belong to one of the following groups of data types:
•
•
•
•
•
•
•
•
•
•
•
•
Array/VARRAY
Byte and BLOB
Character and CLOB
DateTime
Geospatial
Interval
JSON
Numeric, including Number
Parameter
Period
UDT
XML
Array/VARRAY Data Type
A data type used for storing and accessing multidimensional data. The ARRAY data type can
store many values of the same specific data type in a sequential or matrix-like format. ARRAY
data types can be single or multi-dimensional.
Teradata Database supports a one-dimensional(1-D) ARRAY data type and a multidimensional
(n-D) ARRAY data type, with up to 5 dimensions.
Single - The 1-D ARRAY type is defined as a variable-length ordered list of values of the same
data type. It has a maximum number of values that you specify when you create the ARRAY
type. You can access each element value in a 1-D ARRAY type using a numeric index value.
Multi-dimensional – an n-D ARRAY is a mapping from integer coordinates to an element type.
The n-D ARRAY type is defined as a variable-length ordered list of values of the same data type.
It has 2-5 dimensions, with a maximum number of values for each dimension, which you
specify when you create the ARRAY type.
You can also create an ARRAY data type using the VARRAY keyword and syntax for Oracle
compatibility.
8
Byte and BLOB Data Types
The BYTE, VARBYTE and BLOB data types are stored in the client system format – they are
never translated by Teradata Database. They store raw data as logical bit streams. For any
machine, BYTE, VARBYTE, and BLOB data is transmitted directly from the memory of the
client system. The sort order is logical, and values are compared as if they were n-byte, unsigned
binary integers suitable for digitized images. The following are examples of Byte data types.
BYTE - Represents a fixed-length binary string.
VARBYTE - Represents a variable-length binary string.
BLOB - Represents a large binary string of raw bytes. A binary large object (BLOB)
column can store binary objects, such as graphics, video clips, files, and
documents.
Character and CLOB Data Types
In general, CHARACTER, VARCHAR, and CLOB data types represent character data.
Character data is automatically translated between the client and the database. Its form-of-use is
determined by the client character set or session character set. The form of character data internal
to Teradata Database is determined by the server character set attribute of the column. The
following are examples of Character data types.
CHAR[(n)] - Represents a fixed length character string for Teradata Database internal
character storage.
VARCHAR (n) - Represents a variable length character string of length 0 to n for
Teradata Database internal character storage.
LONG VARCHAR - Specifies the longest permissible variable length character string
for Teradata Database internal character storage.
CLOB - Represents a large character string. A character large object (CLOB) column
can store character data, such as simple text, HTML, or XML documents.
Teradata does not support converting a character to its underlying ASCII integer value. To
accomplish this task there is a function CHAR2HEXINT that returns a hex representation of a
character, and there is also an ASCII() UDF in the Oracle library.
DateTime Data Types
DateTime values represent dates, times, and timestamps. Use the following SQL data types to
specify DateTime values.
DATE - Represents a date value that includes year, month, and day components.
TIME [WITH TIME ZONE] - Represents a time value that includes hour, minute, second,
fractional second, and [optional] time zone components.
TIMESTAMP [WITH TIME ZONE] - Represents a timestamp value that includes year,
month, day, hour, minute, second, fractional second, and [optional] time zone components.
Date and Time DML syntax can be tricky in Teradata. Below are a few common date/time
queries that have proven to be useful. Note that some of the output shown is dependent upon the
“Date/Time Format” parameter selected in the ODBC Setup Options.
9
a)
•
•
•
Current Date and Time:
SELECT Current_Date;
SELECT Current_Time;
SELECT Current_TimeStamp;
Retrieves the System Date (mm/dd/yyyy)
Retrieves the System Time (hh:mm:ss)
Retrieves the System TimeStamp (mm/dd/yyyy)
b)
Timestamps:
•
SELECT CAST (Current_Timestamp as DATE); Extracts the Date (mm/dd/yyyy)
•
SELECT CAST (Current_Timestamp as TIME(6)); Extracts the Time (hh:mm:ss)
•
SELECT Day_Of_Week
Uses TD System Calendar to
FROM Sys_Calendar.Calendar
compute the Day of Week (#)
WHERE Calendar_Date = CAST(Current_Timestamp AS DATE);
•
SELECT EXTRACT (DAY FROM Current_Timestamp); Extracts Day of Month
•
SELECT EXTRACT (MONTH FROM Current_Timestamp);
Extracts Month
•
SELECT EXTRACT (YEAR FROM Current_Timestamp); Extracts Year
•
SELECT CAST(Current_Timestamp as DATE) - CAST (AnyTimestampCol AS DATE)
Computes the # of days between timestamps
•
SELECT Current_Timestamp - AnyTimestampColumn day(4) to second(6)
Computes the length of time between timestamps
c)
Month:
•
SELECT CURRENT_DATE - EXTRACT (DAY FROM CURRENT_DATE) + 1
Computes the first day of the month
•
SELECT Add_Months ((CURRENT_DATE - EXTRACT (DAY FROM
CURRENT_DATE) + 1),1)-1
Computes the last day of the month
•
SELECT Add_Months (Current_Date, 3); Adds 3 months to the current date
Geospatial Data Types
Geospatial information identifies the geographic location of features and boundaries on the
planet. Geospatial data types provide a way for applications to store, manage, retrieve,
manipulate, analyze, and display geographic information to interface with Teradata Database.
Teradata Database geospatial data types define methods that perform geometric calculations and
test for spatial relationships between two geospatial values. You can use geospatial types to
represent geometries having up to three dimensions. The following are examples of Geospatial
data types.
ST_Geometry -- A Teradata proprietary internal UDT that can represent any of the
following geospatial types:
•ST_Point: 0-dimensional geometry that represents a single location in
two-dimensional coordinate space.
•ST_LineString: 1-dimensional geometry usually stored as a sequence of
points with a linear interpolation between points.
•ST_Polygon: 2-dimensional geometry consisting of one exterior
boundary and zero or more interior boundaries, where each interior
boundary defines a hole.
•ST_GeomCollection: Collection of zero or more ST_Geometry values.
10
•ST_MultiPoint: 0-dimensional geometry collection where the elements
are restricted to ST_Point values.
•ST_MultiLineString: 1-dimensional geometry collection where the
elements are restricted to ST_LineString values.
•ST_MultiPolygon: 2-dimensional geometry collection where the
elements are restricted to ST_Polygon values.
•GeoSequence: Extension of ST_LineString that can contain tracking
information, such as time stamps, in addition to geospatial information.
The ST_Geometry type supports methods for performing geometric calculations.
MBR -- Teradata Database also provides a UDT called MBR that provides a way to
obtain the minimum bounding rectangle (MBR) of a geometry for tessellation purposes.
ST_Geometry defines a method called ST_MBR that returns the MBR of a geometry.
The ST_Geometry and MBR UDTs are defined in the SYSUDTLIB database.
Interval Data Types
An interval value is a span of time. There are two mutually exclusive interval type categories.
Year- Month represents a time span that can include a number of years and months.
•INTERVAL YEAR
•INTERVAL YEAR TO MONTH
•INTERVAL MONTH
Day-Time represents a time span that can include a number of days, hours, minutes, or seconds.
•INTERVAL DAY
•INTERVAL DAY TO HOUR
•INTERVAL DAY TO MINUTE
•INTERVAL DAY TO SECOND
•INTERVAL HOUR
•INTERVAL HOUR TO MINUTE
•INTERVAL HOUR TO SECOND
•INTERVAL MINUTE
•INTERVAL MINUTE TO SECOND
•INTERVAL SECOND
JSON Data Type
JSON(Javascript Object Notation) is a text-based, data interchange format, often used in web
applications to transmit data. JSON has been widely adopted by web application developers
because compared to XML it is easier to read and write for humans and easier to parse and
generate for machines. JSON documents can be stored and processed in Teradata Database.
Teradata Database can store JSON records as a JSON document or store JSON records in
relational format. Teradata Database provides the following support for JSON data.
11
• Methods, functions, and stored procedures that operate on the JSON data type, such as parsing
and validation.
• Shredding functionality that allows you to extract values from JSON documents up to 16MB in
size and store the extracted data in relational format.
• Publishing functionality that allows you to publish the results of SQL queries in JSON format.
• Schema-less or dynamic schema with the ability to add a new attribute without changing the
schema. Data with new attributes is immediately available for querying. Rows without the new
column can be filtered out.
• Use existing join indexing structures on extracted portions of the JSON data type.
• Apply advanced analytics to JSON data.
• Functionality to convert an ST_Geometry object into a GeoJSON value and a GeoJSON value
into an ST_Geometry object.
• Allows JSON data of varying maximum length and JSON data can be internally compressed.
• Collect statistics on extracted portions of the JSON data type.
• Use standard SQL to query JSON data.
• JSONPath provides simple traversal and regular expressions with wildcards to filter and
navigate complex JSON documents.
Numeric Data Types
A numeric value is either an exact numeric number (integer or decimal) or an approximate
numeric number (floating point). The following are examples of Numeric data types.
BIGINT -- Represents a signed, binary integer value from
-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
INTEGER -- Represents a signed, binary integer value from -2,147,483,648 to
2,147,483,647.
SMALLINT -- Represents a signed binary integer value in the range -32768 to 32767.
BYTEINT -- Represents a signed binary integer value in the range -128 to 127.
REAL, DOUBLE PRECISION, FLOAT - Represents a value in sign/magnitude
form from 2.226*10-308 to 1.797*10308.
DECIMAL [(n[,m])] and NUMERIC [(n[,m])] - Represents a decimal number of n
digits, with m of those n digits to the right of the decimal point.
NUMBER – Represents a numeric value with optional precision and scale limitations.
If the default datatype used to aggregate numeric values is causing an overflow, then a cast to
BIGINT may resolve the problem. For the differences between NUMBER, DECIMAL, FLOAT
and BIGINT see the SQL Data Types and Literals manual..
Parameter Data Types
These are data types that can be used only with input or result parameters in a function, method,
stored procedure, or external stored procedure. They include the following data types.
TD_ANYTYPE - An input or result parameter data type that is used in UDFs,
UDMs, and external stored procedures, and that can accept any system-defined
data type or user-defined type (UDT). The parameter attributes and return type are
determined at execution time.
12
VARIANT_TYPE - An input parameter data type that can be used to package
and pass in a varying number of parameters of varying data types to a UDF as a single
UDF input parameter.
Period Data Types
A period data type represents a set of contiguous time granules that extends from a beginning
bound up to but not including an ending bound. Use Period data types to represents time periods.
The following are examples of Period data types.
PERIOD(DATE) - Represents an anchored duration of DATE elements that
include year, month, and day components.
PERIOD(TIME[(n)][WITH TIME ZONE]) - Represents an anchored duration
of TIME elements that include hour, minute, second, fractional second, a
and [optional] time zone components.
PERIOD(TIMESTAMP[(n)][WITH TIME ZONE]) - Represents an anchored
duration of TIMESTAMP elements that include year, month, day, hour,
minute, second, fractional second, and [optional] time zone components.
UDT Data Types
UDT data types are custom data types that you define to model the structure and behavior of data
that your application deals with. Teradata Database supports distinct and structured UDTs. The
following are examples of UDT data types.
Distinct - A UDT that is based on a single predefined data type, such as
INTEGER or VARCHAR .
Structured - A UDT that is a collection of one or more fields called attributes,
each of which is defined as a predefined data type or other UDT (which
allows nesting).
XML Data Type
An XML data type that allows you to store XML content in a compact binary form that preserves
the information set of the XML document, including the hierarchy information and type
information derived from XML validation. The document identity is preserved as opposed to
XML shredding, which only extracts values out of the XML document.
Related Topics
For detailed information on data types, see the SQL Data Types and Literals, SQL Geospatial
Types, and Teradata XML manuals. Also refer to the SQL Functions, Operators,
Expressions, and Predicates manual for a list of data type conversion functions, including
support of Oracle data type conversion functions.
13
Section 2.3 Primary Index
A primary index is required for all Teradata database tables. If you do not assign a primary
index explicitly when you create a table, then the system assigns one automatically according to
the following rules:
Stage
1
2
3
WHEN
a primary key column is
defined, but a primary
index is not
WHEN
neither a primary key nor
primary index is defined.
WHEN
no primary key, primary
index, or uniquely
constrained column is
defined
Process
THEN the system selects the…..
primary key column set to be the primary index
and defines it as a UPI.
THEN the system selects the…..
first column having a UNIQUE constraint to be the
primary index and defines it as a UPI.
THEN the system selects the…..
first column defined for the table to be the primary
index.
If the first column defined in the table has a LOB
data type, then the CREATE TABLE operation
aborts and the system returns an error message.
WHEN
THEN the system
selects the…..
the table has only one
UPI.
column and its kind is
defined as SET
anything else
NUPI.
Use the CREATE TABLE statement to create primary indexes. Data accessed using a primary
index is always a one-AMP operation because a row and its primary index are stored together in
the same structure. This is true whether the primary index is unique or non-unique, and whether
it is partitioned or non-partitioned.
Purpose of the Primary Index
The primary index has four purposes:

To define the distribution of the rows to the AMPs.
With the exception of NoPI tables and column-partitioned tables and join indexes,
Teradata Database distributes table rows across the AMPs on the hash of their primary
index value. The determination of which hash bucket, and hence which AMP the row is
to be stored on, is made solely on the value of the primary index.
14
The choice of columns for the primary index affects how even this distribution is. An
even distribution of rows to the AMPs is usually of critical importance in picking a
primary index column set.

To provide access to rows more efficiently than with full table scan.
If the values for all the primary index columns are specified in the constraint clause in the
DML statement, single-AMP access can be made to the rows using that primary index
value.
With a partitioned primary index, faster access is also possible when all the values of the
partitioning columns are specified or if there is a constraint on partitioning columns.
Other retrievals might use a secondary index, a hash or join index, a full table scan, or a
mix of several different index types.

To provide for efficient joins.
If there is an equality join constraint on the primary index of a table, it may be possible to
do a direct join to the table (that is, rows of the table might not have to be redistributed,
spooled, and sorted prior to the join).

To provide a means for efficient aggregations.
If the GROUP BY key is on the primary index of a table, it is often possible to perform a
more efficient aggregation.
The following restrictions apply to primary indexes:







No more than one primary index can be defined on a table.
No more than 64 columns can be specified in a primary index definition.
You cannot include columns having XML, BLOB, BLOB-based UDT, CLOB, CLOBbased UDT, XML-based UDT, Period, ARRAY, VARRAY, VARIANT_TYPE,
Geospatial or JSON data types in any primary index definition.
You cannot define a primary index for a NoPI table until Teradata database V15.10.
You cannot define a primary index for column-partitioned tables and join indexes until
Teradata database V15.10
Primary index columns cannot be defined on row-level security constraint columns
You cannot specify multivalue compression for primary index columns
Section 2.4 NoPI Objects
NoPI Objects
Starting with Teradata Database 13.0, a table can be defined without a primary index. This
feature is referred to as the NoPI Table feature. A NoPI object is a table or join index that does
15
not have a primary index and always has a table kind of MULTISET. Without a PI, the hash
value as well as AMP ownership of a row is arbitrary. A NoPI table is internally treated as a hash
table; it is just that typically all the rows on one AMP will have the same hash bucket value.
The basic types of NoPI objects are:
• Nonpartitioned NoPI tables
• Column-partitioned tables and join indexes (these may also have row partitioning)
The chief purpose of NoPI tables is as staging tables. FastLoad can efficiently load data into
empty nonpartitioned NoPI staging tables because NoPI tables do not have the overhead of row
distribution among the AMPs and sorting the rows on the AMPs by rowhash.
Nonpartitioned NoPI tables are also critical to support Extended MultiLoad Protocol
(MLOADX). A nonpartitioned NoPI staging table is used for each MLOADX job.
NoPI tables are not intended to be used as tables which are utilized by end user queries. They
exist primarily as a landing place for FastLoad or MultiLoad staging data.
For more information, please refer to the Teradata Database Design manual.
Section 2.5 Secondary Indexes
A secondary index is never required for Teradata Database tables, but they can sometimes
improve system performance, particularly in decision support environments. When column
demographics suggest their usefulness, the Optimizer selects secondary indexes to provide faster
set selection.
While secondary indexes are exceedingly useful for optimizing repetitive and standardized
queries, the Teradata Database is also highly optimized to perform full-table scans in parallel.
Because of the strength of full-table scan optimization in the Teradata Database, there is little
reason to be heavy-handed about assigning multiple secondary indexes to a table.
Secondary indexes are less frequently included in query plans by the Optimizer than the primary
index for the table being accessed.
You can create secondary indexes when you create the table via the CREATE TABLE statement,
or you can add them later using the CREATE INDEX statement
Data access using a secondary index varies depending on whether the index is unique or nonunique.
 Restrictions on Secondary Indexes
The following restrictions apply to secondary indexes:

Teradata Database tables can have up to a total of 32 secondary, hash, and join indexes.
16






No more than 64 columns can be included in a secondary index definition.
You can include UDT columns in a secondary index definition
You cannot include columns having XML, BLOB, CLOB, BLOB-based UDT, CLOBbased UDT, XML-based UDT, Period, JSON, ARRAY, or VARRAY data types in any
secondary index definition.
You can define a simple NUSI on a geospatial column, but you cannot include a column
having a geospatial data type in a composite NUSI definition or in a USI definition
You can include row-level security columns in a secondary index definition
You cannot include the system-derived PARTITION or PARTITION#Ln columns in any
secondary index definition
 Space Considerations for Secondary Indexes
Creating a secondary index causes the system to build a subtable to contain its index rows, thus
adding another set of rows that requires updating each time a table row is inserted, deleted, or
updated.
Secondary index subtables are also duplicated whenever a table is defined with FALLBACK, so
the maintenance overhead is effectively doubled.
When compression at the data block level is enabled for their primary table, secondary index
subtables are not compressed.
Section 2.6 Intermediate/Temporary Tables
In addition to derived tables, there are two Teradata table structures that can be used as
intermediate or temporary table types: Volatile tables and Global Temporary tables. Volatile
Tables are created in memory and last only for the duration of the session. Neither the definition
nor the contents of a volatile table persist across a system restart. There is no entry in the Data
Dictionary and no transaction logging. Global Temporary Tables also have no transaction
logging. Global Temporary Tables are entered in the data dictionary, but are not materialized for
the session until the table is loaded with data. Space usage is charged to login user temporary
space. Each user session can materialize as many as 2000 global temporary tables at t time.
One should specify a primary index (PI) for a temp table where access or join by the PI is
anticipated. Not specifying a PI will cause a default to a NOPI table or a PI on the first column
of the table regardless of the data demographics. If you do not know in advance what the best PI
candidate will be, then specify a NOPI table to insure even distribution of the rows across all
AMPs.
Section 2.7 Locking
The Teradata Database can lock several different types of resources in several different ways.
Most locks used on Teradata resources are obtained automatically. Users can override some
locks by making certain lock specifications, but cannot downgrade the severity of a lock -17
Teradata Database only allows overrides when it can assure data integrity. The data integrity
requirement of a request decides the type of lock that the system uses.
A request for a locked resource by another user is queued until the process using the resource
releases its lock on that resource.
Lock Levels:
Three levels of database locking are provided:
Object Locked
Database
Table
Description
Locks all objects in a database
Locks all rows in the table or view and any index and fallback
subtables
Locks the primary copy of a row and all rows that share the
same hash code within the same table (primary and Fallback
rows, and Secondary Index subtable rows).
Row Hash
Levels of Lock Types
Lock Type
Exclusive
Description
The requester has exclusive rights to the locked resource. No
other person can read from, write to, or access the locked
resource in any way. Exclusive locks are applied only to
databases or tables, never to rows.
The requester has exclusive rights to the locked resource except
for readers not concerned with data consistency (access lock
readers).
Several readers can hold read locks on a resource, during which
the system permits no modification of that resource.
Write
Read
Read locks ensure consistency during read operations such as
those that occur during a SELECT statement.
The requester is willing to accept minor inconsistencies of the
data while accessing the database (an approximation is good
enough).
An access lock permits modifications on the underlying data
while the SELECT operation is in progress.
Access
The same information is shown in the following table:
Lock
Request
Access
Read
None
Access
Read
Write
Exclusive
Granted
Granted
Granted
Granted
Granted
Granted
Granted
Queued
Queued
Queued
18
Write
Exclusive
Granted
Granted
Granted
Queued
Queued
Queued
Queued
Queued
Queued
Queued
Automatic Database Lock Levels
The Teradata Database applies most of its locks automatically. The following table illustrates
how the Teradata Database applies different locks for various types of SQL statements.
Locking Level by Access Type
UPI/NUPI/USI
NUSI/Full Table
Scan
Row Hash
Table
Row Hash
Table
Row Hash
Table
Row Hash
Not Applicable
Not Applicable
Database
Type if SQL
Statement
SELECT
UPDATE
DELETE
INSERT
CREATE
DATABASE
DROP DATABASE
MODIFY
DATABASE
CREATE TABLE
Not Applicable
DROP TABLE
ALTER TABLE
Table
Locking Mode
Read
Write
Write
Write
Exclusive
Exclusive
Is it generally recommended that queries that do not require READ lock access use the SQL
clause “LOCKING ROW FOR ACCESS.” This will allow non-blocked access to tables that are
being updated.
In addition to the information listed here, refer to the orange book, “Understanding Oracle and
Teradata Transactions and Isolation Levels for Oracle Migrations” for a further understanding of
the Teradata database differences.
Section 2.8 Statistics
Good primary index selection and timely and appropriate statistics collection may very well be
the two most important factors in obtaining Teradata Database performance.
Over the last two decades, Teradata software releases have consistently provided improvements
and enhancements in the way statistics are collected, and then utilized by the cost-based Teradata
Optimizer. The Optimizer doesn’t perform a detailed evaluation of every possible query plan
(multiple joins could produce billions of possibilities). Instead, it uses sophisticated algorithms to
identify and select the most promising candidates for detailed evaluation, then picks what it
perceives as the best plan among those. The essential task of the optimizer is to produce the
optimal execution plan (the one with the lowest cost) from many possible plans. The basis on
which different plans are compared with each other is the cost which is derived from the
estimation of cardinalities of the temporary or intermediate relations, after an operation such as
19
selections, joins and projections. The estimations in Teradata are derived primarily from statistics
and random AMP samples. Accurate estimations are crucial to get optimal plans.
Providing statistical information for performance optimization is critical to optimal query plans,
but collecting statistics can prove difficult due to the demands of time and system resources .
Without full or all-AMP sampled statistics, query optimization must rely on extrapolation and
dynamic AMP sample estimates of table cardinality, which does not collect all of the statistics
that a COLLECT STATISTICS request does.
Besides estimated cardinalities, dynamic AMP samples also collect a few other statistics, but far
fewer than are collected by a COLLECT STATISTICS request.
Statistics and demographics provide the Optimizer with information it uses to reformulate
queries in ways that permit it to produce the least costly access and join plans. The critical issues
you must evaluate when deciding whether to collect statistics are not whether query optimization
can or cannot occur in the face of inaccurate statistics, but the following pair of probing
questions.
• How accurate must the available statistics be in order to generate the best possible query plan?
• How poor a query plan are you willing to accept?
Different strategies can be used to attain the right balance between the need for statistics and the
demands of time and resources.
The main strategies for collecting statistics are: Random AMP Sampling and Full Sampling.
2.8.1 Random AMP Sampling
The optimizer builds an execution plan for each SQL statement that enters the parsing engine.
When no statistics have been collected, the system default is for the optimizer is to make a rough
estimate of a table’s demographics by using dynamic samples from one or more AMPs (one
being the default). These samples are collected automatically each time the table header is
accessed from disk and are embedded in the table header when it is placed in the dictionary
cache.
By default, the optimizer does the single AMP sampling to produce random AMP sample
demographics with some exceptions (volatile, sparse single table join indexes and aggregate join
indexes). By changing an internal field in the dbscontrol record called RandomAMPSampling, it
can be requested that sampling be performed on 2 AMPs, 5 AMPs, all AMPs on a node, or all
AMPs in the system.
When using these options, random sampling uses the same techniques as single-AMP random
AMP sampling, but more AMPs participate. Touching more AMPs may improve the quality of
the statistical information available during plan generation, particularly if rows are not evenly
distributed.
In Teradata Database 12.0 and higher releases, all-AMP sampling was enhanced to use an
efficient technique using “Last done Channel mechanism” which considerably reduces the
messaging overhead. This is used when all-AMP sampling is enabled in the dbscontrol or cost
20
profile but dbscontrol internal flag RowsSampling5 is set to 0 (which is the default). If set to
greater than 0, this flag causes the sampling logic to read the specified percentage of rows to
determine the number of distinct values for primary index.
Random AMP Sample Characteristics
• Estimates fewer statistics than COLLECT STATISTICS does.
Statistics estimated include the following for all columns.
• Cardinalities
• Average rows per value
For indexes only, the following additional statistics are estimated.
• Average rows per index
• Average size of the index per AMP
• Number of distinct values
• Extremely fast collection time, so is not detectable.
• Stored in the file system data block descriptor for the table, not in interval histograms in the
Data Dictionary.
• Occurs automatically. Cannot be invoked by user.
• Automatically refreshed when batch table
INSERT … DELETE operations exceed a threshold of 10% of table cardinality.
Cardinality is not refreshed by individual INSERT or DELETE requests even if the sum
of their updates exceed the 10% threshold.
• Cached with the data block descriptor.
• Not used for non-indexed selection criteria or indexed selection with non-equality conditions.
Best Use
• Good for cardinality estimates when there is little or no skew and the table has significantly
more rows than the number of AMPs in the system.
• Collects reliable statistics for NUSI columns when there is limited skew and the table has
significantly more rows than the number of AMPs in the system.
• Useful as a temporary fallback measure for columns and indexes on which you have not yet
decided whether to collect statistics or not. Dynamic AMP sampling provides a reasonable
fallback mechanism for supporting the optimization of newly devised ad hoc queries until you
understand where collected statistics are needed to support query plans for them. Teradata
Database stores cardinality estimates from dynamic AMP samples in the interval histogram for
estimating table growth even when complete, fresh statistics are available.
Pros and cons of Random AMP sampling:
Pros:
 Provides row count information of all indexes including the Primary Index.
 The row count of Primary Index is the total table rows.
 The row count of NUSI subtable is the number of distinct values of the
NUSI columns.
 The estimated number of distinct values is used for single-table equality
predicates, join cardinality, aggregate estimations, costing, etc.
 Can potentially eliminate the need to collect statistics on the indexes.
 Up-To-Date information – usually most fresh
21

This operation is automatically performed


Works only with indexed columns.
The single-AMP sampling may not be good enough for small tables and
tables with non-uniform distribution on the primary index.
Does not provide the following information.
Number of nulls
Skew Info
Value Range
For NUSIs, the estimated number of distinct values on a single-AMP is
assumed to be the total distinct values. This is true for highly non-unique
columns but can cause distinct value underestimation for fairly unique columns.
On the other hand, it can cause overestimation for highly nonunique
columns because of rowid spill over.
Cannot estimate the number of distinct values for non-unique primary indexes.
Single table estimations can use this information only for equality conditions
assuming uniform distribution.
Cons:




It is strongly recommended to contact Teradata Global Support Center (GSC) to assess the
impact of enabling all-AMP sampling on your configuration and to help change the
internal dbscontrol settings.
2.8.2 Full Statistics Collection
Generically defined, a histogram is a count of the number of occurrences, or cardinality, of a
particular category of data that fall into defined disjunct value range categories. These categories
are typically referred to as bins or buckets.
Issuing a COLLECT STATISTICS statement is the most complete method of gathering
demographic information about a column or an index. Teradata Database uses equal-height,
high-biased, and history interval histograms (a representation of a frequency distribution)
to represent the cardinalities and other statistical values and demographics of columns and
indexes for all-AMPs sampled statistics and for full-table statistics.
The greater the number of intervals in a histogram, the more accurately it can describe the
distribution of data by characterizing a smaller percentage of its composition per each interval.
Each interval histogram in the system is composed of a number of intervals (the default is 250
and the maximum is 500) intervals. A 500 interval histogram permits each interval to
characterize roughly 0.25% of the data.
Because these statistics are kept in a persistent state, it is up to the administrator to keep collected
statistics fresh. It is common for many Teradata Warehouse sites to re-collect statistics on the
majority of their tables weekly, and on particularly volatile tables daily, if deemed necessary.
Full statistics Characteristics
• Collects all statistics for the data.
22
• Time consuming.
• Most accurate of the three methods of collecting statistics.
• Stored in interval histograms in the Data Dictionary.
Best Use
• Best choice for columns or indexes with highly skewed data values.
• Recommended for tables with fewer than 1,000 rows per AMP.
• Recommended for selection columns having a moderate to low number of distinct values.
• Recommended for most NUSIs, PARTITION columns, and other selection columns because
collection time on NUSIs is very fast.
• Recommended for all column sets or indexes where full statistics add value, and where
sampling does not provide satisfactory statistical estimates.
2.8.3 Collection with the USING SAMPLE option
Collecting full statistics involves scanning the base table and performing a sort, sometimes a sort
on a large volume of data, to compute the number of occurrences for each distinct value. The
time and resources required to adequately collect statistics and keep them fresh can be
problematic, particularly with large data volumes.
Collecting statistics on a sample of the data reduces the resources required and the time to
perform statistics collection. However, the USING SAMPLE alternative was certainly not
designed to replace full statistics collection. It requires some careful analysis and planning to
determine under which conditions it will add benefit.
The quality of the statistics collected with full-table sampling is not guaranteed to be as good as
the quality of statistics collected on an entire table without sampling. Do not think of sampled
statistics as an alternative to collecting full-table statistics, but as an alternative to never, or rarely,
collecting statistics.
When you use sampled statistics rather than full-table statistics, you are trading time in exchange
for what are likely to be less accurate statistics. The underlying premise for using sampled
statistics is usually that sampled statistics are better than no statistics.
Do not confuse statistical sampling with the dynamic AMP samples (system default) that the
Optimizer collects when it has no statistics on which to base a query plan. Statistical samples
taken across all AMPs are likely to be much more accurate than dynamic AMP samples.
Sampled statistics are different from dynamic AMP samples in that you specify the percentage of
rows you want to sample explicitly in a COLLECT STATISTICS (Optimizer Form) request to
collect sampled statistics, while the number of AMPs from which dynamic AMP samples are
collected and the time when those samples are collected is determined by Teradata Database, not
by user choice. Sampled statistics produce a full set of collected statistics, while dynamic AMP
samples collect only a subset of the statistics that are stored in interval histograms.
Sampled Statistics Characteristics
23
• Collects all statistics for the data, but not by accessing all rows in the table.
• Significantly faster collection time than full statistics.
• Stored in interval histograms in the Data Dictionary.
Best Use
• Acceptable for columns or indexes that are highly singular; meaning that their number of
distinct values approaches the cardinality of the table.
• Recommended for unique columns, unique indexes, and for columns or indexes that are
highly singular. Experience suggests that sampled statistics are useful for very large
tables; meaning tables with tens of billions of rows.
• Not recommended for tables whose cardinality is less than 20 times the number of AMPs in the
system.
2.8.4 Collect Statistics Summary Option
New in Teradata 14.0, table-level statistics known as “summary statistics” are collected
alongside of the column or index statistics you request. Summary statistics do not cause their
own histogram to be built, but rather they create a short listing of facts about the table
undergoing collection that are held in the new DBC.StatsTbl. It is a very fast
operation. Summary stats report on things such as the table’s row count, average block size, and
some metrics around block level compression and (in the future) temperature. An example of
actual execution times that I ran are shown below, comparing regular column statistics
collection against summary statistics collection for the same large table. Time is reported in
MM:SS:
COLLECT STATISTICS ON Items COLUMN I_ProductID;
Elapsed time (mm:ss): 9:55
COLLECT SUMMARY STATISTICS ON Items;
Elapsed time (mm:ss): 00:01
You can request summary statistics for a table, but even if you never do that, each individual
statistics collection statement causes summary stats to be gathered. For this reason, it is
recommended that you group your statistics collections against the same table into one statement,
in order to avoid even the small overhead involved in building summary stats repeatedly for the
same table within the same script.
There are several benefits in having summary statistics. One critical advantage is that the
optimizer now uses summary stats to get the most up-to-date row count from the table in order
to provide more accurate extrapolations. It no longer needs to depend on primary index or
PARTITION stats, as was the case in earlier releases, to perform good extrapolations when it
finds statistics on a table to be stale.
Here’s an example of what the most recent summary statistic for the Items table looks like:
SHOW SUMMARY STATISTICS VALUES ON Items;
COLLECT SUMMARY STATISTICS
24
ON CAB.Items
VALUES
(
/** TableLevelSummary **/
/* Version
*/ 5,
/* NumOfRecords
*/ 50,
/* Reserved1
*/ 0.000000,
/* Reserved2
*/ 0.000000,
/* SummaryRecord[1] */
/* Temperature
*/ 0,
/* TimeStamp
*/ TIMESTAMP '2011-12-29 13:30:46',
/* NumOfAMPs
*/ 160,
/* OneAMPSampleEst
*/ 5761783680,
/* AllAMPSampleEst
*/ 5759927040,
/* RowCount
*/ 5759985050,
/* DelRowCount
*/ 0,
/* PhyRowCount
*/ 5759927040,
/* AvgRowsPerBlock
*/ 81921.871617,
/* AvgBlockSize
*/ 65024.000000,
/* BLCPctCompressed
*/ 0.00,
/* BLCBlkUcpuCost
*/ 0.000000,
/* BLCBlkURatio
*/ 0.000000,
/* RowSizeSampleEst
*/ 148.000000,
/* Reserved2
*/ 0.000000,
/* Reserved3
*/ 0.000000,
/* Reserved4
*/ 0.000000
);
2.8.5 Summary: Teradata Statistics Collection
The decision between full-table and all-AMPs sampled statistics seems to be a simple one:
always collect full-table statistics, because they provide the best opportunity for producing
optimal query plans.
While the above statement may be true, the decision is not so easily made in a production
environment. Other factors must be taken into consideration, including the length of time
required to collect the statistics and the resource consumption the collection of full-table
statistics incurs while running other workloads on the system.
To resolve this, the benefits and drawbacks of each method must be considered. An excellent
information table comparing the three methods (Full Statistics, Sampled Statistics, Dynamic
AMP Samples) is provided in Chapter 2 of the SQL Request and Transaction Processing Release
14.0 manual, under the heading Relative Benefits of Collecting Full-Table and Sampled Statistics.
25
2.8.6 New opportunities for statistics collection in Teradata 14.0[1]
Teradata 14.0 offers some very helpful enhancements to the statistics collection process. This
posting discusses a few of the key ones, with an explanation of how these enhancements can be
used to streamline your statistics collection process and help your statistics be more effective.
For more detail on these and other statistics collection enhancements, please read the orange
book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs.
New USING options add greater flexibility
In Teradata 14.0 you may optionally specify a USING clause within the collect statistics
statement. As an example, here are the 3 new USING options that are available in 14.0 with
parameters you might use:
. . . USING MAXINTERVALS 300
. . . USING MAXVALUELENGTH 50
. . . USING SAMPLE 10 PERCENT
MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a
time in the new version 5 statistics histogram. The default maximum number of intervals is
250. The valid range is 0 to 500. A larger number of intervals can be useful if you have
widespread skew on a column or index you are collecting statistics on, and you want
more individual high-row-count values to be represented in the histogram. Each statistics
interval highlights its single most popular value, which is designates as its “mode value” and lists
the number of rows that carry that value. By increasing the number of intervals, you will be
providing the optimizer an accurate row count for a greater number of popular values.
MAXVALUELENGTH lets you expand the length of the values contained in the histogram for
that statistic. The new default length is 25 bytes, when previously it was 16. If needed, you can
specify well over 1000 bytes for a maximum value length. No padding is done to the values in
the histogram, so only values that actually need that length will incur the space (which is why the
parameter is named MAXVALUELENGTH instead of VALUELENGTH). The 16-byte limit
on value sizes in earlier releases was always padded to full size. Even if you statistics value was
one character, you used the full 16 bytes to represent it.
Another improvement around value lengths stored in the histogram has to do with multicolumn
statistics. In earlier releases the 16 byte limit for values in the intervals was taken from the
beginning of the combined value string. In 14.0 each column within the statistic will be able to
26
represent its first 25 bytes in the histogram as the default, so no column will go without
representation in a multicolumn statistics histogram.
SAMPLE n PERCENT allows you to specify sampling at the individual statistics collection
level, rather than at the system level. This allows you to easily apply different levels of statistics
sampling to different columns and indexes.
Here's an example of how this USING syntax might look:
COLLECT STATISTICS
USING MAXVALUELENGTH 50
COLUMN ( P_NAME )
ON CAB.product;
Combining multiple collections in one statement
Statistic collection statements for the same table that share the same USING options, and that
request full statistics (as opposed to sampled), can now be grouped syntactically. In fact it is
recommended that once you get on 14.0 that you collect all such statistics on a table as one
group. The optimizer will then look for opportunities to overlap the collections, wherever
possible, reducing the time to perform the statistics collection and the resources it uses.
Here is an example
The old way:
COLLECT STATISTICS COLUMN
(o_orderdatetime,o_orderID)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderdatetime)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderID)
ON Orders;
The new, recommended way:
COLLECT STATISTICS
COLUMN (o_orderdatetime,o_orderID)
, COLUMN (o_orderdatetime)
, COLUMN (o_orderID)
ON Orders;
This is particularly useful when the same column appears in single and also multicolumn
statistics, as in the example above. In those cases the optimizer will perform the most inclusive
27
collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to
derive the statistics for the other two columns. Only a single table scan is required, instead of 3
table scans using the old approach.
Sometimes the optimizer will choose to perform separate collections (scans of the table) the first
time it sees a set of bundled statistics. But based on demographics it has available from the first
collection, it may come to understand that it can group future collections and use pre-aggregation
and rollup enhancements to satisfy them all in one scan.
But you have to remember to re-code your statistics collection statements when you get on 14.0
in order to experience this savings.
Automated statistics management (Teradata release 14.10 and above)
Description
• Identify and collect missing statistics
• Automates and provides intelligence to DBA tasks related to Optimizer Statistics
Collections where such tasks include:
• Identify and collect missing statistics needed for query optimization.
• Detect stale statistics and promptly refresh them.
• Identify and remove unused statistics from routine maintenance
• Prioritize list of pending collections such that important and stale statistics are given
precedence
• Execute needed collections in the background during scheduled time periods
• NEW Statistics Management Viewpoint portlet.
Benefits
• Automation of statistics collection/re-collection improves query and system performance.
• Automation of tasks greatly reduces the burden of statistics management from the DBA.
Note: With Teradata Software Release 15.0 and above, the Teradata Statistics Wizard is no
longer supported.
2.8.7 Recommended Reading
The subject of Teradata database statistics is far too complex and detailed to be summarily
defined or exhausted in this guidebook. There are many new statistics collection options with
Teradata Release 14.0, and also improvements to existing options. For example, one of the new
options in 14.0 is called SUMMARY. This is used to collect only the table-level statistical
information such as row count, average block size, average row size, etc. without the histogram
detail. This option can be used to provide up-to-date summary information to the optimizer in a
quick and efficient way. When SUMMARY option is specified in a collect statistics statement,
no column or index specification is allowed. The following resources are recommended reading
to further your knowledge of statistics as they pertain to the Teradata Database.
28
SQL Request and Transaction Processing Release 14.0 manual. Excellent, technically detailed
information on different statistic collection strategies is provided in chapter 2. Also, great
explanations of how the optimizer uses statistics.
The following Teradata Orange Books:
Optimizer Cardinality Estimation Improvements Teradata Database 12.0 by Rama Korlapati
Teradata 14.0 Statistics Enhancements by Rama Korlapati
Statistics Extrapolations by Rama Korlapati
Collecting Statistics by Carrie Ballinger (written for Teradata release V2R6.2, but still a valuable
resource)
Anything written by Carrie Ballinger on the subject of Teradata statistics. Check out her
contributions to the Teradata Developers Exchange at: http://developer.teradata.com/
Including - New opportunities for statistics collection in Teradata 14.0 on Carrie’s Blog.
http://developer.teradata.com/blog/carrie/2012/08/new-opportunities-for-statistics-collection-interadata-14-0
Also on http://developer.teradata.com/
When is the right time to refresh statistics? - Part I (and Part II) by Marcio Moura
http://developer.teradata.com/blog/mtmoura/2009/12/when-is-the-right-time-to-refresh-statisticspart-i
Others
http://developer.teradata.com/tools/articles/easy-statistics-recommendations-statistics-wizardfeature
http://developer.teradata.com/database/articles/statistics-collection-recommendations-forteradata-12
http://developer.teradata.com/blog/carrie/2012/04/teradata-13-10-statistics-collectionrecommendations
Optimizer article by Alan Greenspan:
http://www.teradatamagazine.com/Article.aspx?id=12639
Section 2.9 Stored Procedures
Stored procedures are available in Teradata and allow for procedural manipulation of set or table
data. Advantages to using stored procedures include:
1) One set of code can be used many times by many users/clients
2) Stored procedures are stored in a compiled object code, eliminating the need to
process raw SQL and SPL (Stored Procedure Language) for each request
3) Enforcement of business rules and standards
Stored procedures can be internal (SQL and/or SPL) or external (C, C++, Java in Teradata 12
and beyond) and are considered database objects. Internal and protected external stored
procedures are run by Parsing Engines (in other words, governed internally by Teradata).
Internal stored procedures are written in SQL and SPL whereas external stored procedures
29
cannot execute SQL Statements. External Stored procedures, however, can execute other stored
procedures providing an indirect method of executing SQL statements.
External stored procedures can also execute as a separate process/thread (outside of a Teradata
Parsing Engine), or as a function depending on the protection mode used when the stored
procedure was created. Protected mode invokes the procedure directly under Teradata while
unprotected mode allows the procedure to run in its own thread. The tradeoff is protected mode
will ensure that memory and other resources don’t conflict with Teradata but can negatively
affect performance. Running in unprotected mode can provide better performance but there is
risk of a potential resource conflict (memory usage/fault, using processing resources that would
be used by Teradata). If you are attempting to run a stored procedure and it’s very slow, one of
the first items to check is the protection mode that was selected when the procedure was created.
Example (Creating/Replacing a Stored Procedure)
REPLACE PROCEDURE sp_db.test_sp
(
IN in_parm INTEGER,
OUT return_parm CHAR(4)
)
BEGIN
SELECT return_code
INTO return_parm FROM
sp_db.table1
WHERE table1.key_column = in_parm ;
END;
If sp_db.test_sp already exists, replace the CREATE with REPLACE.
Note: If creating stored procedures using SQL Assistant, make sure that ‘Allow the Use of
ODBC SQL Extensions’ is checked (Menu – Tools/Options, Query tab). SQL Assistant will not
recognize the CREATE/REPLACE commands if this option is not checked.
Example (Altering a Stored Procedure)
ALTER PROCEDURE sp_db.test_sp LANGUAGE C EXECUTE NOT PROTECTED;
This modifies an existing external stored procedure (written in C) to run unprotected.
Example (Calling a Stored Procedure)
Stored procedures can be invoked with a CALL statement as part of a macro.
CREATE MACRO test_macro (:returned_value CHAR(4)) AS
CALL sp_db.test_sp(15467, :returned_value));
Teradata V12 Stored Procedures can return result sets. Prior to V12, result sets can be stored in a
table (permanent or temp table) for access outside the stored procedure.
30
Error handling is also built-in to Teradata stored procedures through messaging facilities (Signal,
Resignal) and a host of available standard diagnostic variables. External stored procedures also
can use a debug trace facility that provides a means to store tracing information in a global
temporary trace table.
It is important to find a balance when using stored procedures, especially when porting existing
stored procedures from another database. Teradata’s strength lies in its ability to process large
sets of data rather quickly. Using row at a time processing such as cursors can cause slower
performance. In Teradata 13, recursive queries are allowed in stored procedures allowing for a
set approach to many of the problems cursors have been used in the past to solve.
Elements of XSP Body (with example)
#define SQL_TEXT Latin_Text
#include <sqltypes_td.h>
#include <string.h>
void xsp_getregion( VARCHAR_LATIN *region, char sqlstate[6])
{
char tmp_string[64];
if (strlen((const char *)region) > 4)
{
/* Strip off the first four characters */
strcpy(tmp_string, (char *)region);
strcpy((char *)region, &tmp_string[4]);
}
}
SP Commands






SHOW PROCEDURE
Display procedure statements and comments
HELP PROCEDURE
Show procedures parameter names and types
ALTER PROCEDURE
Change attributes such as protected mode /storing of SPL.
COMPILE / REecompile stored procedures
EXECUTE PROTECTED/EXECUTE NOT PROTECTED
Provides an execution option for fault isolation and performance
ATTRIBUTES clause
Display transaction and platform the SP was created on
DROP PROCEDURE
Removes unwanted SPs. For XSP, it is removed from the available SP library with a relink of the library.
Handling SPs during migration
• Prior to Migration
31
•
•
> Pre_upgrade_prep.pl
> script identifies SPs that will not recompile automatically.
During Migration
> Qualified SPs and XSPs are recompiled automatically
> SPs with NO SPL and those that fail recompile are identified
– Will have to be manually recreated/recompiled respectively
SPs must be recompiled
> on new Major releases of TD
> after cross-platform migration
Section 2.10 User Defined Functions (UDF)
User Defined Functions are programs or routines written in C/C++, and Java (Teradata V13)
allowing users to add extensions to the Teradata SQL language and are classified by their input
and output parameters:
Output Parameter Type
Scalar
Input
Parameter
Type
Set
Scalar
Set
User Defined Scalar
Functions
User Defined Table
Functions
User Defined
Aggregate
Functions
User Defined Table
Operator
Scalar functions accept inputs from an argument list and returns a single result value. Some
examples of built-in Teradata scalar functions include: SUBSTR, ABS, and SQRT.
Scalar UDFs can also be written using SQL constructs. These are called SQL UDFs and they are
very limited. SQL commands cannot be issued in "SQL UDFs", so they are basically limited to
single statements using SQL functions. Two main advantages: they can simplify SQL DML
statements that use the function call instead of long convoluted logic, and they run faster than C
language UDFs.
Example:
REPLACE FUNCTION Power( M Float, E Float )
RETURNS FLOAT
LANGUAGE SQL
CONTAINS SQL
RETURNS NULL ON NULL INPUT
DETERMINISTIC
32
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
CASE M WHEN 0 then 0 ELSE EXP ( LN ( M ) * E ) END
;
SELECT POWER(cast(2 as decimal(17,0))-cast(1 as decimal(17,0)),2);
SELECT POWER(cast(1234567890098765432 as bigint)-1,2);
These are not the equivalent of PL/SQL Functions because they can't really do SQL. For ideas
on translating PL/SQL Functions to Teradata see
http://developer.teradata.com/blog/georgecoleman/2014/01/ordered-analytical-functionstranslating-sql-functions-to-set-sql
Aggregate functions are similar to scalar functions except they work on sets (created by GROUP
BY clauses in a SQL statement) of data, one row at a time, and returning a single result. SUM,
MIN, MAX, and AVG are examples of Teradata built-in aggregate functions.
Table functions return a table a row at a time and are unlike scalar and aggregate UDF’s since
they cannot be called in places that system functions are called. Table functions require a
different syntax are invoked similarly to a derived table:
INSERT INTO Sales_Table SELECT S.Store, S.Item, S.Quantity
FROM TABLE (Sales_Retrieve(9005)) As S;
The table function, Sales_Retrieve is passed the parameter of 9005. The results of
Sales_Retrieve will be packaged to match the 3 columns in the SELECT clause.
UDF’s can be an optimal solution for:
1. Additional SQL functions to enhance existing Teradata supplied functions. For example,
certain string manipulations may be common for a given application. It may make sense
to create UDF’s for those string manipulations; the rule is created once and can be used
by many. UDF’s can make porting from different databases easier (i.e. the Oracle
DECODE function) by coding a UDF to match the other database function. Recreating
the function (one code change) reduces the amount of SQL rewrite for an application that
may use the function many times (many code changes).
2. Complex algorithms for data mining, forecasting, business rules, and encryption
3. Analysis of non-traditional data types (i.e. image, text, audio, etc.)
4. XML string processing
5. ELT (Extract Transform Load) data validation
UDF’s are invoked qualifying the database name where they are stored, e.g.
DBName.UDFname(), or if stored in the special database call SYSLIB, without database name
qualification, e.g. UDFname().
UDF’s versus Stored Procedures
33

UDF’s are invoked in a SQL DML statement whereas Stored Procedures must be
invoked with an explicit CALL statement
 UDF’s cannot modify data with INSERT, UPDATE, or DELETE statements and can
only work with data that is passed in as parameters.
 UDF’s are written in C/C++ (Java in Teradata V13), Internal Stored Procedures are
written in SPL (Stored Procedure Language). External Stored Procedures are similar to
UDF’s; they are written in C/C++ and Java however a UDF cannot CALL a stored
procedure. However a stored procedure can invoke a UDF as part of a DML statement.
 UDF’s run on the AMPs while Stored Procedures run under control of the Parsing
Engines (PEs). Starting with Teradata V2R6.0, some UDF’s can run as part of the PEs.
 UDF’s can only return a single value (except for Table UDF’s)
Stored procedures can handle multiple SQL exceptions whereas UDF’s can only catch and
pass one value to the caller.
Protected versus Unprotected Mode
UDF’s can run in either protected or unprotected mode. When a UDF is first created, it is in
protected mode. An ALTER statement is used to switch the UDF from protected to unprotected
mode. In protected mode, the UDF is run by an AMP UDF Server Task (by default there are two
per AMP). Running under the Server Task creates overhead and can result in slower execution
times. To increase performance, the UDF can be run in unprotected mode which means it is run
directly by the AMP itself. Unprotected mode should be used only when the UDF has been fully
tested and deemed fail-safe. If the function fails in unprotected mode, a database restart is
possible since the AMP is no longer insulated from the function via the Server Task process.
Developing and distributing UDF’s: UDF’s support the concept of packages which allow
developers to create function suites or libraries (i.e. .DLL’s in Windows and .SO’s on UNIX)
that are easily deployable across systems.
Identifying Overloaded Functions
Teradata has the concept of the “FunctionName” and “SpecificName” in which the specific
name is qualified by the parameters. You can see the detail with:
select functionname (format 'x(20)'),
specificname (format 'x(20)'),
numparameters,
parameterdatatypes
from dbc.functions
where databasename='xxx'
and functionname like ‘FL%’
order by 1;
Standard Teradata Functions can be found in database: TD_SYSFNLIB
The following is a query which can be used to identify functions and a date of last modification.
SELECT
dbase.databasename (Format 'x(20)') (Title 'DB Name')
34
, UDFInfo.FunctionName (Format 'x(20)') (Title 'Function')
,case
when UDFInfo.FunctionType='A' then 'Aggregate'
when UDFInfo.FunctionType='B' then 'Aggr Ordered Anal'
when UDFInfo.FunctionType='C' then 'Contract Func'
when UDFInfo.FunctionType='E' then 'Ext Stored Proc'
when UDFInfo.FunctionType='F' then 'Scalar'
when UDFInfo.FunctionType='H' then 'Method'
when UDFInfo.FunctionType='I' then 'Internal'
when UDFInfo.FunctionType='L' then 'Table Op'
when UDFInfo.FunctionType='R' then 'Table Function'
when UDFInfo.FunctionType='S' then 'Ordered Anal'
else 'Unknown'
end (varchar(17), Title 'Function//Type')
, CAST(TVM.LastAlterTimeStamp AS DATE) (FORMAT 'MMM-DD',Title 'Altered')
FROM DBC.UDFInfo, DBC.DBase, DBC.TVM
WHERE DBC.UDFInfo.DatabaseId = DBC.DBase.DatabaseId
AND DBC.UDFInfo.FunctionId = DBC.TVM.TVMId
ORDER BY 1,2,3,4;
Archive/Restore/Copy/Migrating Considerations
The following list BAR related operations and describes how they act on UDFs. It is very
important to understand that only database level operations will act on UDF. There is no way
of selectively archive or restore a UDF (by the way, the same is true for stored procedures):
Dictionary Archive
ALL AMP Archive
Cluster Archive
Dictionary Restore/Copy
ALL AMP Restore/Copy
Cluster Restore/Copy
UDF Dictionary information (summary) is
saved but not UDF source
UDF Dictionary information and UDF source
code are archived
UDF Source code is archived
UDF Dictionary Information is restored (no
UDF source code)
UDF Dictionary and source code are
restored/copied
UDF Source code restored
Section 2.11 Table Operators
A Teradata release 14.10 table operator is a type of function that can accept an arbitrary row
format and based on operation and input row type can produce an arbitrary output row format.
The name is derived from the concept of a database physical operator. A physical operator takes
as input one or more data streams and produces an output data stream. Examples of physical
operators are join, aggregate, scan etc. The notion is that from the perspective of the function
implementer, programmer, a new physical operator can be implemented that has complete
control of the Teradata parallel "step" processing. From the function user, SQL writer,
perspective it is very analogues to the concept of a "from clause" table function.
35
Differences Between Table Functions and Table Operators
• The inputs and outputs for table operators are a set of rows (a table) and not columns. The
default format of a row is IndicData.
• In a table function, the row iterator is outside of the function and the iterator calls the function
for each input row. In the table operators, the operator writer is responsible for iterating over the
input and producing the output rows for further consumption. The table operator itself is called
only once. This reduces per row costs and provides more flexible read/write patterns.
System Defined Table Operators
A table operator can be system defined or user defined. Teradata release 14.10 introduces three
new system defined table operators:



TD_UNPIVOT which transforms columns into rows based on the syntax of the unpivot
expression.
CalcMatrix which calculates a Correlation, Covariance or Sums of Squares and Cross
Products matrix
LOAD_FROM_HCATALOG which is used for accessing the Hadoop file system.
Use Options
The table operator is always executed on the AMP within a return step (stpret in DBQL). This
implies that it can read from spool, base table, PPI partition, index structure, etc. and will always
write its output to spool. Some concepts related to the operator execution.
If a HASH BY and/or a LOCAL ODER BY is specified the input data will always be spooled to
enforce the HASH BY geography and the LOCAL ORDER BY ordering within the AMP.
HASH BY can be used to assign rows to a AMP and LOCAL ORDER BY can be used to order
the rows within a AMP. You can specify either or both of the clauses independently.
If a PARTITION BY and ORDER BY is specified the input data will always be spooled to
enforce the PARTITON by grouping and ORDER BY ordering within the PARTITION. You
can specify a PARTITON BY without an ORDER BY but you cannot have an ORDER BY
without a PARTITION BY. Further the table operator will be called once for each partition and
the row iterator will only be for the rows within the partition. In summary a PARTITION is a
logical concept and one or more partitions may be assigned to a AMP, same behavior as ordered
analytic partitions.
The USING clause values are modeled as key-value. You can define multiple key-value pairs
and a single key can have multiple values. The using clause is a literal value and cannot contain
any expressions, DML etc. Further, the values are handled by the syntaxer in a similar manner to
regular SQL literals. For example {1 , 1.0 ,'1'} are respectively passed to the table operator as
byteint, decimal(2,1) and VARCHAR(1) CHARACTER SET UNICODE values.
36
Section 2.12. QueryGrid
Teradata Database provides a means to connect to a remote system and retrieve or insert data
using SQL. This enables easy access to Hadoop data for the SQL user, without replicating the
data in the warehouse.
Note: With Teradata Software Release 15.0, SQL-H has been rebranded as Teradata QueryGrid:
Teradata Database to Hadoop. The existing connectors on Teradata 14.10 to TDH/HDP will
continue to be called SQL-H. The 14.10 SQL-H was released with Hortonworks Hadoop and is
certified to work with TDH 1.1.0.17, TDH/HDP 1.3.2, and TDH/HDP 2.1 (TDH = Teradata
Distribution for Hadoop, HDP = Hortonworks Data Platform).
The goal and vision of Teradata® QueryGrid™ is to make specialized processing engines,
including those in the Teradata Unified Data Architecture™ act as one solution from the user’s
perspective. Teradata QueryGrid is the core enabling software, engineered to tightly link with
these processing engines to provide intelligent, transparent and seamless access to data and
processing. This family of intelligent connectors deliver bi-directional data movement and pushdown processing to enable the Teradata Database or the Teradata Aster Database systems to
work as a powerful orchestration layer.
As the role of analytics within organizations continues to grow, along with the number and types
of data sources and processing requirements, companies face increasing IT complexity. Much of
the complexity arises from the proliferation of non-integrated systems from different vendors,
each of which is designed for a specific analytic task.
This challenge is best addressed by the Teradata® Unified Data Architecture™, which enables
businesses to take advantage of new data sources, data types and processing requirements across
the Teradata Database, Teradata Aster Database, and open-source Apache™ Hadoop®. Teradata
QueryGrid™ optimizes and simplifies access to the systems and data within the Unified Data
Architecture and beyond to other source systems, including Oracle Database; delivering seamless
multi-system analytics to end-users.
This enabling solution orchestrates processing to present a unified analytical environment to the
business. It also provides fast, intelligent links between the systems to enhance processing and
data movement while leveraging the unique capabilities of each platform. Teradata Database
15.0 brings new capabilities to enable this virtual computing, building on existing features and
laying the groundwork for future enhancements.
Teradata QueryGrid is a powerful enabler of technologies within and beyond the Unified Data
Architecture that delivers seamless data access and localized processing. The QueryGrid adds a
single execution layer that orchestrates analyses across Teradata, Teradata Aster, Hadoop, and in
37
the future other databases and platforms. The analysis options include SQL queries, as well as
graph, MapReduce, R-based analytics, and other applications. Offering two-way, Infiniband
connectivity among data sources, the QueryGrid can execute sophisticated, multi-part analyses.
It empowers users to immediately and automatically access and benefit from all their data along
with a wide range of processing capabilities, all without IT intervention. This solution raises the
bar for enterprise analytics and gives companies a clear competitive advantage.
The vision --simply said-- is that a business person connected to the Teradata Database or Aster
Database can submit a single SQL query that joins data together from a one or more systems for
analysis. There’s no need to depend upon IT to extract data and load it into another machine. The
business person doesn’t have to care where the data is – they can simply combine relational
tables in Teradata with tables or flat files found in Hadoop on demand.
Teradata QueryGrid delivers some important benefits:














It’s easy to use, using existing SQL skills
Allows standard ANSI SQL access to Hadoop data
Low DBA labor moving and managing data between systems
High performance: reducing hours to minutes means more accuracy and faster turnaround
for demanding users
Cross-system analytics
Leverage Teradata/Aster strengths: security, workload management, system management
Minimum data movement improves performance and reduces network use
Move the processing to the data
Leverages existing BI tools and enables self-service
Dynamically reads data from other database systems without requiring replication of the
data into the local Teradata system
Uses intelligent data access and filtering mechanisms that reduces amount of rows that
are transported over the wire to Teradata
Provides scalable architecture to move and process foreign data
Eliminates the details for connecting to foreign databases
Provides data type mapping and data conversion
The Teradata approach to fabric-based computing also leverages these elements wherever
possible for seamlessly accessing data across the Teradata® Unified Data Architecture™:

Teradata BYNET® V5 is built on InfiniBand technology. Perfected over 20 years of
massively parallel processing experience, BYNET V5 provides low-latency messaging
capabilities for maximum data access. This is accomplished by leveraging the inherent
38
scalability and integrity of InfiniBand to load-balance multiple fabrics, seamlessly
handling failover in the event of an interconnect failure.

InfiniBand technology, a Teradata fabric, gains much of its resiliency from the Mellanoxsupplied InfiniBand switches, adapters and cables that are recognized as industry-leading
products for high-quality, fully interoperable enterprise switching systems.
Section 2.13 DBQL
Teradata Database Query Log (DBQL) is the primary data source for query monitoring and
evaluation of tuning techniques. Teradata DBQL provides a series of predefined tables that can
store, based on rules you specify, historical records of queries and their duration, performance,
and target activity. The DBQL data is accessed with SQL queries written against the DBQL
tables or views. DBQL is documented in the Database Administration manual for each release
of the Teradata database. A Database Administrator (DBA) typically performs the DBQL
logging configuration and runs the DBQL logging statements. The DBA grants access on DBQL
tables or views to application analysts so that analysts can measure and quantify their database
workloads.
The recommended leading practice DBQL configuration for Teradata environments is all usage
should be logged at the Detail level with SQL and Objects. The only exception is database usage
comprised of strictly subsecond, known work, i.e. tactical applications. This subsecond, known
work is logged at the Summary level.
Logging with DBQL is best accomplished by one logging statement for each “accountstring” on
a Teradata system. A database user session is always associated with an account. The account
information is set with an “accountstring.” Accountstrings typically carry a workload group
name (“$M1$”), identifying information for an application (“WXYZ”), and expansion variables
for activity tracking (“&S&D&H” for session number, date, and hour).
An example of the recommended DBQL Detail logging statement using this example
accountstring is:
BEGIN QUERY LOGGING with SQL, OBJECTS LIMIT sqltext=0 ON ALL ACCOUNT =
‘$M1$WXYZ&S&D&H’;
This statement writes rows to the DBQLogTbl table. This table contains detailed information
including but not limited to CPU seconds used, I/O count, result row count, system clock times
for various portions of a query, and other query identifying information. This logging statement
writes query SQL to the DBQLSqlTbl table with no SQL text kept in the DBQLogTbl. Database
object access counts for a query is written to the DBQLObjTbl table.
An example of the recommended DBQL Summary logging statement for tactical applications
(only known subsecond work) using the example accountstring on Teradata V2R6 and later is:
39
BEGIN QUERY LOGGING LIMIT SUMMARY=10,50,100 CPUTIME ON ALL ACCOUNT =
‘$M1$WXYZ&S&D&H’;
This statement results in up to four rows written to the DBQLSummaryTbl in a 10 minute DBQL
logging interval. These rows summarize the query logging information for queries in hundredths
of a CPU second between 0 and 0.10 CPU second, 0.10 - 0.50 CPU second, 0.50 - 1 CPU
second, and over 1 CPU second.
For a well-behaved workload with occasional performance outliers, DBQL threshold logging can
be used. Threshold logging is not typically recommended but is available. An example of a
DBQL Threshold logging statement using the example accountstring is:
BEGIN QUERY LOGGING LIMIT THRESHOLD=100 CPUTIME AND SQLTEXT=10000
ON ALL ACCOUNT = ‘$M1$WXYZ&S&D&H’;
The CPUTIME threshold is expressed in hundredths of a CPU second. This statement logs all
queries over 1 CPU second to the DBQL Detail table with the first 10,000 characters of the SQL
statement also logged to the DBQL Detail table. For queries less than 1 CPU second, this DBQL
Threshold logging statement writes a query cumulative count by CPU seconds consumed for
each session as a separate row in DBQLSummaryTbl every 10 minutes. The
DBQLSummaryTbl will also contain I/O use and other query identifying information.
IMPORTANT NOTE: DBQL Threshold and Summary logging cause you to lose the ability to
log SQL and OBJECTS. Threshold and Summary logging are recommended only after the
workload is appropriately profiled using DBQL Detail logging data. Further, threshold logging
should be used in limited circumstances. Detail logging data, with SQL and OBJECTS, is
typically desired to ensure a full picture is gathered for analysis of queries performing outside
expected norms.
DBQL can be enabled on Teradata user names in place of accountstrings. This is typically done
when measuring performance tests run under a specific set of known user logins. DBQL logging
by accountstrings is the more flexible, production-oriented approach.
If a maintenance process is required for storing DBQL data historically, the Download Center on
Teradata.com provides a DBQL Setup and Maintenance document with the table definitions,
macros, etc. to accomplish the data copy and the historical storage process. This is generally
implemented when it is decided that DBQL data should not be in the main or root Teradata DBC
database for more than a day. The DBQL data maintenance process from Teradata.com can be
implemented in under an hour.
Two additional logging statements are recommended in addition to the recommended Detail and
Summary logging statements previously mentioned. These additional DBQL logging statements
are used temporarily on a database user such as SYSTEMFE to dump the DBQL data buffers
before query analysis or during the maintenance process. Examples of this statement are:
BEGIN QUERY LOGGING ON SYSTEMFE;
40
END QUERY LOGGING ON SYSTEMFE;
DBQL buffers can retain data up to 10 minutes after a workload runs. On a lightly used Teradata
system, DBQL buffers may not flush for hours or days. Any DBQL configuration change causes
all DBQL data to be written to the DBQL tables. Use of these two additional statements ensure
that all data is available before an analysis of DBQL data is performed.
DBQL enhancements in Teradata version 14.10
• Descriptions
> Enhancements to provide more accurate and complete performance data. i.e.
collecting resource usage data for AMP steps
> Include CPU & IO usage all the way through an aborted query step (top requested
item from the PAC).
> Enhance DBQL logging for parallel AMP steps.
> Provide a new DBQL table to enable the logging of Database Lock information
– Enabler for capturing and storing database Lock information in a DBQL
table
• Benefit
> Provides a more powerful query analysis tool for developing, debugging, and
optimizing queries.
A new Database Query Logging (DBQL) option, called UTILITYINFO, captures information
for load, export, and Data Stream Architecture (DSA) utilities at the job level.
• New DBC DBQL table - DBQLUtilityTbl
• Supported utilities:
> FastLoad Protocol: FastLoad, TPT Load, and JDBC FastLoad
> MLOAD Protocol: MultiLoad and TPT Update
> MLOADX Protocol: TPT Update operator
> FastExport Protocol: FastExport, TPT Export, and JDBC FastExport
> DSA
> Example:
> BEGIN QUERY LOGGING WITH UTILITYINFO ON USER1;
– The UTILITYINFO option is turned off by default
DBQL enhancements in Teradata version 15.00
DBQL’s Parameterized Query Logging is a new logging feature added in Teradata release 15.0.
With earlier versions of Teradata, DBQL did not have the capability to collect values of
theparameters for a parameterized query. With PQL feature parameter information along with
values are captured into a new DBQL table DBC.DBQLParamTbl. This table holds all the
necessary parameter data in a BLOB column. This BLOB column can then be converted into a
JSON document using the internal fast path system function TD_SYSFNLIB.TD_DBQLParam().
This feature can be enabled for all users with the following statement.
Begin Query Logging with ParamInfo on all;
41
Now that we capture the parameter values the same query can now be replayed with the values in
place of parameters. This will help customers isolate problem values supplied to parameters.
DBQL Queries
User Activity Query :
SELECT UserName (TITLE 'User', FORMAT 'X(8)'),
SessionID (TITLE 'Session', FORMAT 'ZZZZZZZ9'),
QueryID (TITLE 'Query//ID', FORMAT 'ZZZZZ9'),
StartTime (TITLE 'Start//Time', FORMAT 'HH:MI:SS'),
FirstRespTime (TITLE 'Response//Time', FORMAT 'HH:MI:SS'),
TotalIOCount (TITLE 'IO//Count', FORMAT 'ZZ9'),
AMPCPUTime (TITLE 'CPU', FORMAT 'ZZ9.99')
FROM DBC.DBQLogTbl
WHERE UserName='Tester'
ORDER BY 1,2,3;
Query Analysis:
select
queryid,
FirstStepTime,
EstResultRows (format 'zzz,zzz,zzz,zz9'),
NumResultRows(format 'zzz,zzz,zzz,zz9'),
EstProcTime (format 'zzz,zzz,zz9.999'),
AMPCPUTime (format 'zzz,zzz,zz9.999'),
TotalIOCount(format 'zzz,zzz,zzz,zz9'),
NumOfActiveAMPs(format 'zzz,zz9'),
MaxAmpCPUTime(format 'zzz,zzz,zz9.999'),
MinAmpCPUTime(format 'zzz,zzz,zz9.999'),
MaxAmpIO (format 'zzz,zzz,zzz,zz9'),
MinAmpIO (format 'zzz,zzz,zzz,zz9')
from dbc.dbqlogtbl
where username ='xxxx'
and cast(FirstStepTime as date) = '2013-02-13'
and NumOfActiveAMPs <> 0
and AMPCPUTime > 100
order by Firststeptime;
Errors that result in an abort and possibly a rollback:
sel distinct errorcode (format 'zzzzz9'),
errortext
from dbc.dbqlogtbl
where cast (collecttimestamp as date) = date
and errorcode in (2631, 2646, 2801, 3110, 3134, 3514, 3535, 3577) order by 1;
42
Top 10 (CPU):
SELECT StartTime
,QueryID
,Username
,StatementType
,AMPCPUTime
,rank () over (
order by AMPCPUTime DESC) as ranking
from DBC.QryLog
where cast(CollectTimeStamp as date) = date
qualify ranking <=10;
Section 2.14 Administrative Tips
The DBC.* views whose names end with "X" are all designed to show only what the current user
is allowed to see.
To list all databases that have tables that you can access:
select distinct databasename from dbc.tablesvX order by 1;
To list all the tables for these databases that you can access:
select databasename,tablename from dbc.tablesX order by 1,2;
Collations can be defined at the USER (modify user collation=x) or SESSION (set session
collation x) levels. Set session takes precedence. Each type of collation be discovered by using
a query such as
SELECT CharType FROM DBC.ColumnsX WHERE ...
If Referential Integrity is defined on the tables, you can get a list of the relationships with a query
like this:
select
trim(ParentDB) || '.' || trim(ParentTable) || '.' || trim(ParentKeyColumn) (char(32)) "Parent"
, trim(ChildDB) || '.' || trim(ChildTable) || '.' || trim(ChildKeyColumn) (char(32)) "Child"
from DBC.All_RI_ParentsX
order by IndexName
;
If you want to start with a particular parent table and build a hierarchical list, you might try this
recursive query:
with recursive RI_Hier( ParentDB, ParentTable ,ChildDB, ChildTable ,level
) as (
select ParentDB, ParentTable, ChildDB, ChildTable, 1
43
from DBC.All_RI_ParentsX
where ParentDB = <Parent-Database-Name> and ParentTable = <Parent-Table-Name>
union all
select child.ParentDB, child.ParentTable, child.ChildDB, child.ChildTable, RI_Hier.level+1
from RI_Hier
,DBC.All_RI_ParentsX child
where RI_Hier.ChildDB = child.ParentDB
and RI_Hier.ChildTable = child.ParentTable
)
select
trim(ParentDB) || '.' || trim(ParentTable) "Parent"
,trim(ChildDB) || '.' || trim(ChildTable) "Child"
,level
from RI_Hier
order by level, Parent, Child
;
There are a number of session-specific diagnostic features which can be very helpful under
specific situations. Use these by executing the statement(s) below in a query window prior to
diagnosing the query of interest. Note that these settings are only active for the current session.
When the session is terminated, the session parameter is cleared.
a)
DIAGNOSTIC HELPSTATS ON (NOT ON) FOR SESSION;
Using the EXPLAIN feature on a query in conjunction with the above session parameter
provides the user with statistics recommendations for the given query. While the list can be very
beneficial in helping identify missing statistics, not all of the recommendations may be
appropriate. Each of the recommended statistics should be evaluated individually for usefulness.
Using the example below, note that the optimizer is suggesting that statistics on columns
DatabaseName and TVMName would likely result in higher confidence factors in the query
plan. This is due to those columns being used in the WHERE condition of the query.
Query:
SELECT DatabaseName,TVMName,TableKind
FROM dbc.TVM T
,dbc.dbase D
WHERE
D.DatabaseId=T.DatabaseId
AND DatabaseName='?DBName'
AND TVMName='?TableName'
ORDER
BY 1,2;
Results (truncated):
BEGIN RECOMMENDED STATS ->
44
•
"COLLECT STATISTICS dbc.dbase COLUMN (DATABASENAME)". (HighConf)
•
"COLLECT STATISTICS dbc.TVM COLUMN (TVMNAME)". (HighConf)
<- END RECOMMENDED STATS
b)
DIAGNOSTIC VERBOSEEXPLAIN ON (NOT ON) FOR SESSION;
Teradata’s Verbose Explain feature provides additional query plan information above and
beyond that shown when using the regular Explain function. Specifically, more detailed
information regarding spool usage, hash distribution and join criteria are presented. Use the
above session parameter in conjunction with the EXPLAIN feature on a query of interest.
3. Workload Management
Section 3.1 Workload Administration
Workload Management in Teradata is used to control system resource allocation to the various
workloads on the system. After installation of a Teradata partner application at a customer site,
Teradata has a number of tools used for workload administration. Teradata Active System
Management (TASM) is a grouping of products, including system tables and logs, that interact
with each other and a common data source. TASM consists on a number of products: Teradata
Workload Analyzer, Viewpoint Workload Monitor/Health, and Viewpoint Workload Designer.
TASM also includes features to capture and analyze Resource Usage and Teradata Database
Query Log (DBQL) statistics. There are a number of orange books, Teradata magazine articles,
and white papers addressing the capabilities of TASM.
Perhaps the best source of information for TASM is the Teradata University website at
https://university.teradata.com. There are a number of online courses and webcasts available on
the Teradata University site which offer a wealth of information on TASM and its component
products.
The Teradata Viewpoint portal is a framework where Web-based applications, known as
portlets, are displayed. IT professionals and business users can customize their portlets to
manage and monitor their Teradata systems using a Web browser.
Portlets enable users across an enterprise to customize tasks and display options to their
specific business needs. You can view current data, run queries, and make timely business
decisions, reducing the database administrator workload by allowing you to manage your
work independently. Portlets are added to a portal page using the Add Content screen. The
Teradata Viewpoint Administrator configures access to portlets based on your role.
45
4. Migrating to Teradata
Section 4.1 Utilities and Client Access
Teradata offers a complete set of tools and utilities that exploit the power of Teradata for
building, accessing, managing, and protecting the data warehouse. Teradata’s data acquisition
and integration (load and unload) tools are typically used by partners in the ETL space while
partners in the Business Intelligence and EAI spaces use the connectivity and interface tools
(ODBC, JDBC, CLIv2, .NET, OLE DB). For this discussion, we will focus primarily on the
“Load & Unload tools” and the “Connectivity and Interface Tools.”
One common reason that Partners want to integrate their products with Teradata is Teradata’s
ability to more efficiently work with, and process, larger amounts of data than the other
databases that the Partner is accustomed to working with. With this in mind, Partners should
consider the advantages and flexibility offered by the Teradata Parallel Transporter (and the TPT
API), which provides the greatest flexibility and throughput capability of all the Load/Unload
products.
Another relatively new option which should be considered as an ELT architecture option is the
ANSI Merge in combination with NoPI tables, which is an option as of Teradata Version 13.0.
The ANSI merge offers many of the capabilities of the Teradata utilities (error tables, block-at-atime optimization, etc.) and the FastLoad into the NoPI target table is up to 50% faster than the
FastLoad into a target table with a Primary Index.
Both of these options are discussed in this section.
Many other options are presented here, and it is important to consider all of them to find the
methods that are best for a particular Partner’s needs.
Section 4.1.1 Teradata Load/Unload Protocols & Products
•
SQL Protocol - Used for small amounts of data or continuous stream feeds.
• Use SQL protocol for
> DDL operations
– Use industry standard open APIs, BTEQ.
> Loading/unloading small amounts of data
– Use open APIs, BTEQ, and/or TPump protocol.
> Sending Insert/Select or mass update/delete/merge for an ELT scenario
– Use open APIs or BTEQ.
> Continuous data loading – use TPump protocol.
• Industry standard open APIs for SQL protocol
> Teradata ODBC Driver
> Teradata OLE DB Provider
> Teradata .Net Data Provider
> Teradata JDBC Driver
• Teradata tools that use SQL protocol
46
> TPump protocol
– TPump product or Parallel Transporter Stream Operator.
– TPump product - script-driven batch tool.
– Parallel Transporter Stream – use with TPT API or scripts.
– Optimizations for statement and data buffering along with reduced.
row locking contention and SQL statement cache reuse on the
Teradata Database.
– Indexes are retained and updated
> BTEQ
– Batch SQL and report writer tool that has basic import/export.
> Preprocessor2
– Used for imbedding SQL in C application programs.
> CLIv2
– Lowest level API but not recommended due to added complexity
without much performance gain over using higher level interface
– Note: CLIv2 is used by Teradata tools like BTEQ, Teradata load tools
etc. One can run different protocols using CLIv2 (e.g., SQL, ARC for
backup, FastLoad, MultiLoad, FastExport, etc.) Only SQL protocol is
published.
• Bulk Data Load/Unload Protocols
FastLoad, MultiLoad, FastExport, TPump – are load protocols which can be executed
by Stand-alone tools or Teradata Parallel Transporter.
– FastLoad Protocol – Bulk loading of empty tables.
– MultiLoad Protocol – Bulk Insert, Update, Upsert, & Delete.
– FastExport Protocol – Export data out of Teradata DB.
– TPump – SQL application for continuous loading uses a pure SQL interface.
TPump includes the best knowledge of how to write a high-performance load tool
using SQL Inserts – including multi-statement requests, buffering of data,
checkpoint/restart, etc. The Teradata Database does not know that the SQL Inserts
are coming from TPump, they are just like any other SQL requests.
> Load/Unload Products which use the load protocols
– Stand-alone Utilities (original tools)
• FastLoad, MultiLoad, TPump, FastExport.
• Separate tools & languages, script interface only.
– JDBC
• FastLoad & FastExport protocol for pure Java applications. This is
for applications that are pure Java and don’t want to wrap the TPT
API C++ interface with JNDI wrappers for performance and
portability reasons. JDBC Parameter Arrays can map to Teradata
iterated requests/FastLoad. Given an unconstrained network, JDBC
FastLoad may be three to 10 times faster than the corresponding
SQL PreparedStatement batched insert. FastLoad can be
transparently enabled via connection URL setting.
47
– Teradata Parallel Transporter (improved tools).
• Execute all load/unload protocols in one product with one scripting
language.
• Plug-in Operators: Load, Update, Stream, Export.
• Provides C++ API to protocols for ISV partners.
Section 4.1.2 Input Data Sources with Scripting Tools
• Flat files on disk and tape.
• Named Pipes.
• Access Modules
> Plug-ins to Teradata Load Tools to access various data sources.
> OLE DB Access Module to read data from any OLE DB Provider (e.g., all
databases).
> JMS Access Module to read data from JMS queues.
> WebSphere MQ Module to read data from MQ queues.
> Named Pipe Access Module to add checkpoint/restart to pipes.
> Custom – write your own to get data from anywhere.
• Teradata Parallel Transporter Operators
> ODBC Operator - reads data from any ODBC Driver (e.g., all databases).
Access Modules Example
Plug-in OLE DB Access Module to read data from any database
Oracle
Databas
e
OLE DB
Access
Teradata
Load
Tool
Module
Teradata
Databas
e
Section 4.1.3 Teradata Parallel Transporter
What is Teradata Parallel Transporter?
• Parallel Transporter is the new generation of Teradata Load/Unload utilities. It is the new
version of FastLoad, MultiLoad, TPump, and FastExport. Those load protocols were
combined into one tool and new features added.
• The FastLoad, FastExport, and MultiLoad protocols are client/server, there is code that
runs on the client and code that runs on the server. The Teradata Database server code is
unchanged, and the client part of the code was re-written.
48
• If you already know the legacy Stand-alone load tools, everything you have learned about
the four load tools still applies as far as features, parameters, limitations (e.g., number of
concurrent load jobs), etc. There is no learning curve for the protocols, just learn the new
language and the new features.
• Why was the client code rewritten? Benefits are one tool with one language, performance
improvement on large loads with parallel load streams when I/O is a bottleneck, ease of
use, & the TPT API (for better integration with partner load tools, no landing of
data, and parallel load streams).
• Most everything about load tools still applies
> Similar basic features, parameters, limitations (e.g., number of concurrent load
jobs), when to use, etc.
•
Parallel Transporter performs the tasks of FastLoad, MultiLoad, Tpump, and Fast Export. In
addition to these functions it also provides:
– Common scripting language across all processes. This simplifies the writing of scripts and
makes it easier to do tasks such as “Export from this database, load into this database” in a
single script.
– Full parallelism on the client side (where the FastLoad and MultiLoad run). We now create
and use multiple threads on the client side allowing ETL tools to fan out in parallel on the
multi-CPU SMP box they run on.
– API for connecting ETL tools straight to the TPT Load/Unload Operators. This will simplify
integration with these tools and improve performance in some cases.
• Wizard that generates scripts from prompts for learning the script language –
supports only small subset of Parallel Transporter features.
• Command line interface (known as Easy Loader interface) that allows one to
create load jobs with a single command line. Supports a subset of features.
• With Teradata release 14.10, the MultiLoad protocol on the Teradata Database has been
extended. This extension this is known as Extended Mload or MLOADX. The new
extension is implemented only for the Parallel Transporter Update Operator. At
execution time, the extension converts the TPT Update Operator script into an ELT
process if the target table of the utility has any of the following objects: Unique
Secondary Indexes, Join Indexes, Referential Integrity, Hash Index, a trigger, and
supports temporal tables. This extension eliminates the need for the system administrator
to Drop/Create the aforementioned objects just so that Multiload may be used.
Conversion from ETL to ELT will happen automatically; no change to the utility or its
script is necessary.
Note: The names -- FastLoad, MultiLoad, FastExport, TPump – refer to load protocols and
not products. Anywhere those load protocols are mentioned, Teradata Parallel Transporter
(TPT) can be substituted to run the load/unload protocols.
How does it work?
49
Parallel Transporter Architecture
Databases
Files
User Written Scripts
Message Queues
ETL Tools
Custom Programs
Direct API
Script Parser
TPT Infrastructure
Parallel
Transporter
Data Source Operators
Load
Update
Export
Teradata Database
50
Stream
Increased Throughput
Source1
1 job per
source
or 1
source
at a time
Source2
Load Utility
Teradata Database
Traditional Utility Job
•
•
•
•
•
•
•
•
•
•
•
•
Read
Operator
Source2
Source3
Read
Operator
Read
Operator
Source3
InMods or
Access Modules
•
•
•
Source1
Transform
Operator
Transform
Operator
Transform
Operator
Load
Operator
Load
Operator
Teradata Database
Parallel Transporter
Traditional utilities on the left and Parallel Transporter on the right
On left, must concatenate the 3 files into 1 or run 3 jobs.
Terminology of Producer (Read) Operator, Consumer (Write) Operator, Transform (userwritten), & Independent ones like the DDL Operator.
Operators flow data through data streams.
If I/O is bottleneck, can read all three files in parallel – generally benefits FastLoad and
MultiLoad protocols.
If the load utility is the bottleneck and pegged out the CPU, can scale it.
Optional user transform Operators (write C++ for simple transforms).
Picture on right is one job (all in the rectangle noted by the thin line).
Still always more throughput to run multiple jobs. TPT makes one job run faster by
internally leveraging parallel processes. Looks like one load job to the DBS.
Uses more resources (memory, CPU, etc.) to gain throughput.
If I/O or CPU is not bottleneck, then scaling can reduce throughput by having to manage
multiple processes for no gain.
All processes run asynchronously and in parallel (overlapped I/O & loading of DBS).
Note the bottleneck with a TPump/Stream Operator job is usually Teradata (use Priority
Scheduler, etc.) and not the I/O.
TPump protocol can benefit in a scenario of sending multiple input files to a single
TPump job versus multiple TPump jobs reducing table locking.
Best performance is scalable performance. This is a scalable solution on the client to
match the scalability on the Teradata Database. Parallel processes read parallel input
51
streams , can have parallel load processes that communicate with the Teradata Database
through parallel sessions (scale across network bandwidth), & parallel PEs read data
while parallel AMPs apply data to target tables.
TPT Operators
Much of the functionality is provided by scalable, re-usable components called Operators:
• Operators can be combined to perform desired operations (load, update, filter, etc.).
• Operators & infrastructure are controlled through metadata.
• Scalable Data Connector – reads/writes data in parallel from/to files and other sources
(e.g. Named Pipes, TPT Filter Operator, etc.).
• Export, SQL Selector - read data from Teradata tables (Export uses Fast Export protocol
& Selector is similar to BTEQ Export).
• Load - inserts data into empty Teradata tables (FastLoad protocol).
• SQL Inserter - inserts data into existing tables (similar to BTEQ Import using SQL).
• Update - inserts, updates, deletes data in existing tables (uses MultiLoad protocol).
Starting with Teradata release 15.0, the TPT Update Operator can load LOBs.
• Stream - inserts, updates, deletes data in existing tables in continuous mode.
• Infrastructure reads script, parses, creates parallel plan, optimizes, launches, monitors,
and handles checkpoint/restart.
TPT Operators include:
> Load – FastLoad protocol.
> Update – MultiLoad protocol.
> Stream – TPump protocol.
> Export – FastExport protocol.
> SQL Inserter. Loads data, including large objects (LOBs), into a new or an
existing table using a single SQL protocol session.
> SQL Selector. Extracts data, including LOBs, from an existing table using a
single SQL protocol session.
> Open Database Connectivity (ODBC). Extracts data from external third-party
ODBC sources.
> Data Connector. Supports simultaneous, parallel reading of multiple data sources,
such as various types of files or queuing systems; also allows writing to external
data sources.
FastLoad, MultiLoad, and FastExport protocols are client/server protocols with a program that
runs on the client box talking with a proprietary interface to a program on the Teradata Database.
These closed, undocumented interfaces have been opened up with the Teradata Parallel
Transporter API.
With the FastLoad & MultiLoad protocols, there are two phases, acquisition and apply. In the
acquisition phase the client reads data as fast as it can and sends it to the database which puts the
data into temp tables. When the data read is exhausted, the client signals the database to start the
apply phase where the data in the temp tables is redistributed to the target tables in parallel.
Teradata Active System Management tools can throttle Teradata load tools (ML, FL, & FE
protocols), TPump must be treated like any other SQL application.
52
TPT API
TPT API is the interface for programmatically running the Teradata load protocols.
Specifically: FastLoad, MultiLoad, FastExport, and TPump. It enhances partnering with 3rd
party tools by:
 Proprietary load protocols become open
 Partners integrate faster and easier
 Partner tool has more control over entire load process
 Increased performance
TPT API is used by more than ETL vendors – BI vendors use Export Operator to pull data
from TDAT into their tool.
Not all vendors use API. Script interface has its place (e.g., TDE) & non-parallel tools can
create parallel streams using named pipes and the script version of TPT which has a parallel
infrastructure to read the multiple input files in parallel.
In addition, TPT API has the following characteristics:
•
•
•
•
•
Opens up previously closed, proprietary load protocols.
Partners integrate faster & easier with C++ API than generating script language.
Can flow parallel streams of data into load tools.
No landing of data before loading.
Partner tool gets complete control over load process.
53
Parallel Transporter - Using API
Data Sources
Application/ETL
Program
Direct
API
Oracle, etc.
Flat file
L
o
a
d
E
x
p
or
t
U
p
d
at
e
St
re
a
m
Teradata
Database
Integration with API
Before API – Integration with Scripting
5. Read
messages
Metadata
Data Source
Message & Statistics File
2. Write
Data
Data –
Named
Pipe
ETL Tool
1. Build &
Write Script
4. Read
Data
4. Write
Msgs.
FastLoad
3. Invoke
Utility
4. Load
Data
FastLoad Script File
4. Read
Script
1. Vendor ETL tool creates Teradata utility script & writes to file.
54
Teradata
Database
2. Vendor ETL tool reads source data & writes to intermediate file (lowers performance to
land data in intermediate file).
3. Vendor invokes Teradata utility (Teradata tool doesn’t know vendor tool called it).
4. Teradata tool reads script, reads file, loads data, writes messages to file.
5. Vendor ETL tool reads messages file and searches for errors, etc.
•
Before the API, it was necessary to generate multiple script languages depending on
the tool, land data in file or named pipe, and post-process error messages from a file.
With API - Integration Example
Metadata
Pass data & FastLoad parameters
ETL Tool
Data Source
FastLoad
Protocol
Functions
Load
Data
Teradata
Database
Return codes & error
messages passed
•
•
•






Vendor ETL passes script parameters to API
Vendor ETL tool reads source data & passes data buffer to API
Teradata tool loads data and passes return codes and messages back to
caller
Using TPT API, ETL tool reads data into a buffer and passes buffer to Operator
thru API (no landing of the data).
Get return codes and statistics through function calls in the API.
No generating multiple script languages.
ETL tool has control over load process (e.g., ETL tool determines when a
checkpoint is kicked off instead of being specified in the load script).
ETL tool can dynamically choose an operator (e.g. UPDATE vs. STREAM)
Simpler, faster, higher performance.
Features and Benefits of TPT
Benefits
Three main benefits are performance (parallel input load streams), various ease of use
features, and the TPT API for ETL vendor integration.
• Performance – Improved Throughput
> Scalable performance on client load server with parallel instances and parallel
load streams.
> ETL vendors can scale load job across load servers.
55
> No landing the data with TPT API – in memory buffers.
• Ease of Use – When Using Scripts
Ease of use features apply when you are writing scripts. If you use an ETL vendor, then
the ETL tool will either call the TPT API or generate the appropriate script.
> Less time
– One tool & one scripting language.
– Easier to switch between load protocols.
– Unlimited symbol substitution.
– Load multiple input sources & combine data from dissimilar sources.
– Multiple job steps.
> Fewer scripts required
– Automatically load all files in a directory.
– Reduction of number of scripts.
– Teradata to Teradata export and load scenario.
– Don’t have to write code to generate scripts in multiple languages for
multiple tools – just pass buffers in memory
> Wizard to aid first-time script building.
• Improved ISV partner tool integration via Direct API
> Proprietary load protocols become open.
> 3rd Party partners integrate faster and easier.
> Partner-written programs can directly call load protocols.
In addition, TPT also has following benefits:
• Less training is involved to learn one scripting language.
• Symbol substitution use cases:
1. Supply test names at run time during testing and supply production names at run time
in production – the script doesn’t change.
2. Define multiple load operators in the script and specify at run time which load protocol
to use (easier to switch between protocols).
• Can load multiple input sources & sources can be completely different media and
completely different file sizes (selected data must be union compatible).
• Multiple job steps (e.g., DDL Operator to delete/allocate table followed by load step).
• Load all files in a directory and export from Teradata and load Teradata in one job.
• Wizard for simple scripts only, not all features included in Wizard, and not meant to be or
replace a 3rd party ETL tool.
• When you install TPT there is a “samples” directory, go there and grab a script, modify it,
and run it.
• TPT Operators run in memory as part of the ETL tool’s address space rather than
asynchronously without each other’s knowledge.
• Checkpoints are initiated by ETL tool and not by Teradata load tool.
• ETL tool gets return codes and operational metadata from function calls rather than
parsing output files.
• Application program (e.g., ETL tool) reads the source data and calls the API to pass:
56

• 1. Parameters based on the load protocol
• 2. Data
• 3. Get messages
Only the four load/unload protocols are available through TPT API. No access modules
are available since ETL vendors have the functionality of the Teradata access modules in
their products.
The following two diagrams depict the advantages of TPT over Stand-alone tools:
Stand-Alone Tools: No Parallel Input
Streams
Oracle
ETL Tool Launches
Multiple Instances
that Read/Transform
Data in Parallel
ETL tool must bring
parallel streams back
to one input source
FastLoad can only
read one input
stream
ETL
Tool
ETL
Tool
ETL
Tool
File
or
Pipe
F
a
st
L
o
a
d
Teradata
• Stand-alone load tools can only read one input source
57
Parallel Input Streams Using API
Oracle
ETL Tool Launches
Multiple Instances
that Read/Transform
Data in Parallel
Parallel Transporter
Reads Parallel
Streams with
Multiple Instances
Launched Through
API
ETL
Tool
ETL
Tool
ETL
Tool
TPT
API
TPT
API
TPT
API
TPT
Loa
d
Inst
anc
TPT
Loa
d
Inst
anc
TPT
Loa
d
Inst
anc
Teradata
• Application program (e.g., ETL tool) reads the source data and calls the API to pass:
• 1. Parameters to the load Operator based on the load protocol (e.g., number of
sessions to connect to Teradata, etc.)
• 2. Data
• 3. Function calls to get messages and statistics
ETL tool can flow parallel streams of data through TPT API to gain throughput for
large data loads.
Section 4.1.4 Restrictions & Other Techniques
• Some bulk loading restrictions (see reference manuals)
> FastLoad & MultiLoad – no join indexes, foreign key references, no LOBs
> FastLoad – only primary indexes
> MultiLoad – no USIs
> Due to restrictions, engineering emphasis on ELT and improving Insert/Select,
ANSI merge, etc.
> ARC can’t move from new release to older release
• Other data movement techniques
58
> Use of UDFs (e.g., table functions), stored procedures, triggers, etc.
> Data Mover product is a shell on top of TPT API, ARC, and JDBC
Other data movement protocols
 Teradata Unity Data Mover - Data Mover enables you to define and edit jobs that copy
specified database objects, such as tables, users, views, macros, stored procedures, and statistics,
from one Teradata Database system to another Teradata Database system. Tables can be copied
between Teradata systems and Aster or Hadoop systems.
• Teradata Migration Accelerator (TMA) – TMA is a professional services tool that’s
designed to assist with migration projects by automating many of the common tasks that
are associated with the task. Currently TMA supports Oracle, DB2 and SQL Server
migrations.
• ARC – Archive and restore.
> Moves data at the internal block level.
> Format only understood by the Teradata dump and restore tool so no data
transformation can be done.
> In addition to data protection, it is the fastest protocol for moving data. from
Teradata to Teradata (e.g., upgrading machines).
> ARC Products:
– Backup with ISV products (NetVault, NetBackup, Tivoli).
NPARC service uses ARC for system upgrades.
• Teradata Replication – changed data capture API.
Note: Teradata Replication Services using Golden Gate are not supported with Teradata
Software Release 15.0 and above.
> Currently undocumented, under development, may change.
> Intent is to document the API when it is finalized.
> GoldenGate is only company that currently uses CDC API.
> Lower data volumes than bulk loading protocols.
Summary: Products & Protocols
59
Load
Protocol ---->
Product
TPT
BTEQ
ODBC driver
JDBC driver
OLE DB provider
.NET Data Provider
TPump, TPT Stream
FastLoad, TPT Load
MultiLoad, TPT Update
FastExport, TPT Export
BAR & ARC scripts
Replication Services
(GoldenGate)
TDM (calls TPT API &
ARC)
Teradata Unity
SQL (Insert,
Update, etc.) FastLoad MultiLoad Tpump FastExport ARC
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
CDC (changed
data capture API)
for pulling data out
only
X
X
X
X
X
60
X
X
X
X
X
X
Section 4.2 Load Strategies & Architectural Options
Section 4.2.1 ETL Architectural Options
• Run utilities where source data is located
> Utilities run on mainframe, Windows, and UNIX.
> Mainframe is direct channel attached to Teradata.
• Load Server
> Dedicated server with high-speed connect to Teradata Database.
> Pull data from various platforms & sources
– ODBC databases, named pipes, JMS queues, Websphere MQ, etc.
– Use ETL tool to source data.
• Do ETL with:
> Teradata Utilities and/or Teradata Parallel Transporter.
> 3rd Party Products
– Vendor products work in conjunction with Teradata client software
(CLIv2, ODBC, MultiLoad, FastLoad, Teradata Parallel Transporter, etc.).
> Custom applications
– Utilize CLIv2, ODBC, JDBC, etc.
> Sometimes, a combination of the above.
• Extract, Load, & Transform (ELT) and ANSI Merge
> Load raw data (FastLoad protocol) to staging table and use SQL Insert/Select to
do transforms & load target table
– Some ISV ETL tools generate post-load SQL
– Recommended when pulling large amounts of data from Teradata &
reloading Teradata
– ANSI Merge is an option as of Teradata Version 12.0 and is much more
efficient than insert/update scenario
– ANSI Merge in combination with NoPI tables is an option as of
Teradata Version 13.0
The ELT approach can also take advantage of the SQL bulk load operations that are available
within the Teradata Database. These operations not only support MERGE-INTO but also
enhance INSERT-SELECT and UPDATE-FROM. This enables primary, fallback and index data
processing with block-at-a-time optimization.
The Teradata bulk load operations also allow users to define their own error tables to handle
errors from operations on target tables. These are separate and different from the Update
operator’s error tables. Furthermore, the no primary index (NoPI) table feature also extends the
bulk load capabilities. By allowing NoPI tables, Teradata can load a staging table faster and
more efficiently.
Merge is ANSI-standard SQL syntax that can perform bulk operations on tables using the
extract, load and transform (ELT) function. These operations merge data from one source table
into a target table for performing massive inserts, update and upserts. So why use the merge
61
function instead of an insert-select, or when an update join will suffice? Better performance and
the added functionality of executing a bulk, SQL-based upsert.
Beginning with Teradata 13, the FastLoad target table can be a “No-PI” table. This type of table
will load data faster because it avoids the redistribution and sorting steps, but this only postpones
what eventually must be done during the merge process. The “merge target table” can have a
predefined error table assigned to it for trapping certain kinds of failures during the merge
process. There can be up to a 50% performance improvement when NoPI tables are used in a
FastLoad loading scenario.
• Raw Data Movement for System Migrations
> Use NPARC (Name Pipe ARC) service which uses ARC protocol in custom shell
scripts
> System upgrades require one-time movement of entire system
Section 4.2.2 ISV ETL Tool Advantages vs. Teradata Tools
• GUI interface
> Data workflows designed easily.
• Numerous data sources supported
> Drop down windows show multiple choices for data sources.
• Numerous data transformations on client box
> Numerous data sorting options.
> Data cleansing.
• Some generate SQL for ELT
> Generates SQL used to transform data in parallel Teradata Database engine.
• Central metadata repository for entire ETL process.
Section 4.2.3 Load Strategies
• Periodic Batch
> Overnight or other reasonable time slot
– ELT - FastLoad to staging table then Insert/Select to target.
– MultiLoad/TPump target table using Insert, Update, Upserts
– TPump protocol – row/hash lock.
– MultiLoad – Insert, Update, Upsert, Delete.
> Frequent ‘Mini’ Batch
– Stage for seconds or Minutes, then do Insert/Select.
– Use rotating batch jobs.
• Continuous – TPump protocol
> Data available immediately - row/hash lock.
• Changed data capture
> GoldenGate CDC interface to capture data from Teradata & copy to Teradata or
non-Teradata system.
62
> ISV tools - CDC data from other databases & load Teradata.
Section 4.2.4 ETL Tool Integration
1. Generate scripts for Teradata Stand-alone tools
> FastLoad, MultiLoad, TPump, and FastExport.
2. Generate scripts for Parallel Transporter (pre-API)
> Operators: Load, Update, Stream, and Export.
3. Integrate with Parallel Transporter thru API
> C++ calls directly to Operators.
• All three methods run same load protocols.
• Advantage of #2 & #3
> Parallel Transporter allows parallel input streams.
• Advantage of #3 over #2
> Vendor can integrate faster and easier.
> No landing of the data.
> Gives vendor tool more control over load process.
• All major ETL tool vendors integrate well with Parallel Transporter. Create a data flow
diagram with the ETL tool GUI and the ETL tool will internally generate the appropriate
Parallel Transporter calls.
• Not all tools will benefit with the API. Non-parallel vendors can get parallelism by
writing to multiple files/pipes and invoking the script interface where the TPT
infrastructure can read the data in parallel.
• ISV tools that don’t have the ability to access various data sources can take advantage of
the plug-in Access Modules (ODBC, Named Pipes, JMS, etc.) by using the TPT script
interface since the API only supports the load/unload Operators.
• Teradata Decision Experts uses the script version of TPT and leverages the ODBC
Operator to pull data from Oracle while using the Load Operator to load the Teradata
Database.
• Advantages of TPT API are:
> parallel load streams.
> not landing the data in a file or pipe.
> Increased command and control for ISV product (e.g., ISV product controls
the checkpoint intervals, etc.).
63
Section 4.3 Concurrency of Load and Unload Jobs
If you do not use the Teradata Viewpoint Workload Designer portlet to throttle concurrent load
and unload jobs, the MaxLoadTasks and the MaxLoadAWT fields of the DBSControl determine
the combined number of FastLoad, MultiLoad, and FastExport jobs that the system allows to run
concurrently. The default is 5 concurrent jobs.
If you have the System Throttle (Category 2) rule enabled, even if there is no Utility Throttle
defined, the maximum number of jobs is controlled by the Teradata dynamic workload
management software and the value in MaxLoadTasks field is ignored.
For more information on changing the concurrent job limit from the default value, see
"MaxLoadTasks" and "MaxLoadAWT" in the chapter on DBS Control in the Utilities manual.
Section 4.4 Load comparisons
Teradata recommends using one of the Teradata TPT load operators rather than ODBC or JDBC
for loading and/or extracting more than a few thousand rows of data from Teradata. As
mentioned in the previous section, the Teradata TPT load operators – specifically Load and
Update – were specifically designed for loading large volumes of data into Teradata as rapidly as
possible while ODBC and JDBC were not. This is simply due to the fact that Teradata, although
it scales very effectively in all boundaries (data, users, query volumes, etc.), is orders of
magnitude different in performance than other RDBMs for row at a time processing and any ISV
that wants to leverage the strengths of Teradata needs to understand this key point.
There can be as much as a 10x difference between TPT Load/Update and ODBC parameter array
inserts. Not utilizing ODBC parameter arrays (single row at a time) can make this as much as a
100x difference.
5. References
Section 5.1 SQL Examples
Preferred processing architecture
Teradata is powerful relational database engine that can perform complex processing against
large volumes of data. The preferred data processing architecture for a Teradata solution is one
that would have business questions/problems passed to the database via complex SQL statements
as opposed to selecting data from the database for processing elsewhere.
The following examples are meant to provide food for thought in how the ISV would approach
the integration with Teradata.
64
Derived Tables
Description
A derived table is obtained from one or more other tables through the results of a query. How
derived tables are implemented determines how or if performance is enhanced or not. For
example, one way of optimizing a query is to use derived tables to control how data from
different tables is accessed in joins. The use of a derived table in a SELECT forces the subquery
to create a spool file, which then becomes the derived table. Derived tables can then be treated in
the same way as base tables. Using derived tables avoids CREATE and DROP TABLE
statements for storing retrieved information and can assist in optimizing joins. The scope of a
derived table is only visible to the level of the SELECT statement calling the subquery.
Example (Simple – Derived Table w/ AVG function)
The following SELECT statement displays those employees whose salary is below their
departmental average. 'WORKERS' in this example is the derived table. (result: 119 rows)
SELECT EMPNO, NAME, DEPTNO, SALARY
FROM (SELECT AVG(SALARY), DEPTNO
FROM EMPLOYEE
GROUP BY DEPTNO) AS WORKERS (AVERAGE_SALARY, DEPTNUM),
EMPLOYEE
WHERE SALARY < AVERAGE_SALARY AND
DEPTNUM = DEPTNO
ORDER BY DEPTNO, SALARY DESC;
Recursive SQL
Description
Recursive queries are used for hierarchies of data, such as Bill of Materials, organizational
structures (department, sub-department, etc.), routes, forums of discussions (posting, response
and response to response) and document hierarchies.
Example
The following selects a row for each child-parent, child-grandparent, etc. relationship in a
recursive table
WITH RECURSIVE TEMP_TABLE (CHILD, PARENT) AS
( SELECT ROOT.CHILD, ROOT.CHILD
FROM CHILD_PARENT_HIERARCHY ROOT
UNION ALL
SELECT H1.CHILD, H2.PARENT
FROM TEMP_TABLE H2, CHILD_PARENT_HIERARCHY H1
WHERE H2.CHILD = H1.PARENT
)
SELECT CHILD, PARENT FROM TEMP_TABLE ORDER BY 1,2;
65
Sub Queries
Description
Permits a more sophisticated and detailed query of a database through the use of nested SELECT
statements. Hence, the elimination of intermediate result sets to the client. There are subqueries
for search conditions and correlated subqueries when it references columns of outer tables in an
enclosing, or containing, inner query. The expression 'correlated subquery' comes from the
explicit requirement for the use of correlation names (table aliases) in any correlated subquery in
which the same table is referenced in both the internal and external query.
Subqueries in Search Conditions
Example (Simple - w/ IN statement)
The following SELECT statement displays all the transactions which include the partkeys
involved in orderkey=1. (result: 184 rows)
SELECT T1.P_NAME, T1.P_MFGR
FROM PRODUCT T1, ITEM T2
WHERE T1.P_PARTKEY = T2.L_PARTKEY
AND T1.P_PARTKEY IN
(SELECT L_PARTKEY
FROM ITEM
WHERE L_ORDERKEY = 1);
Example (Simple - w/ 'operators')
The following SELECT statement displays the employees with the highest salary and the most
years of experience. (result: 1 row)
SELECT EMPNO, NAME, DEPTNO, SALARY, YRSEXP
FROM EMPLOYEE
WHERE (SALARY, YRSEXP) >= ALL
(SELECT SALARY, YRSEXP
FROM EMPLOYEE);
Example (Simple – w/ ‘operator’ and AVG function)
The following SELECT statement displays every employee in the Employee table with a salary
that is greater than the average salary of all employees in the table. (result: 497 rows)
SELECT NAME, DEPTNO, SALARY
FROM EMPLOYEE
WHERE SALARY >
(SELECT AVG(SALARY)
FROM EMPLOYEE)
ORDER BY NAME;
66
Example (Complex – w/ OLAP function and Having clause)
The following SELECT statement uses a nested OLAP function and a HAVING clause to display
those Partkeys) that appear in the top 10 percent of profitability in more than 10 Orderkeys.
(result: 486 rows)
SELECT L_PARTKEY, COUNT(L_ORDERKEY)
FROM
(SELECT L_ORDERKEY, L_PARTKEY, PROFIT, (QUANTILE(10, PROFIT))
AS PERCENTILE
FROM
(SELECT L_ORDERKEY, L_PARTKEY, (SUM(L_EXTENDEDPRICE)
–COUNT(L_QUANTITY) * 5) AS PROFIT
FROM CONTRACT, ITEM
WHERE CONTRACT.O_ORDERKEY = ITEM.L_ORDERKEY
GROUP BY 1,2) AS ITEMPROFIT
GROUP BY L_ORDERKEY
QUALIFY PERCENTILE = 0) AS TOPTENPERCENT
GROUP BY L_PARTKEY
HAVING COUNT(L_ORDERKEY) >= 10;
Correlated Subqueries
Example (Simple – Correlated Subquery)
The following SELECT statement displays employees who have highest salary in each
department.
(result: 876 rows)
SELECT *
FROM EMPLOYEE AS T1
WHERE SALARY =
(SELECT MAX(SALARY)
FROM EMPLOYEE AS T2
WHERE T1.DEPTNO = T2.DEPTNO);
Case Statement
Description
The CASE expression is used to return alternative values based on search conditions. There are
two forms of the CASE Expression:

Valued CASE Expression
- Specify a SINGLE expression to test (equality)
- List the possible values for the test expression that return different results
67
-

CASE---- value_expression_1 WHEN value_expression_n THEN
scalar_expression_n ELSE scalar_expression_m (Result is either result_n or
result_m)
Searched CASE Expression
- You do not specify an expression to test. You specify multiple, arbitrary, search
conditions that can return different results.
- CASE WHEN search_condition_n THEN scalar_expression_n ELSE
scalar_expression_m
Example (Simple - Valued CASE)
The following SELECT statement displays only total Manufacture #2 Retail Price. (result: 1 row)
SELECT SUM(CASE P_MFGR WHEN 'Manufacturer#2' THEN P_RETAILPRICE
ELSE 0 END)
FROM PRODUCT
Example (Complex - Searched CASE)
The following SELECT statement displays the top product type(s) that has a 25% or higher
return rate. (result: 66 rows)
SELECT T1.P_TYPE,
SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END) AS “RETURNED”,
SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R')
THEN (T2.L_QUANTITY) ELSE (0) END) AS “NOT RETURNED”,
(SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END)) /
(
(SUM(CASE WHEN (T2.L_RETURNFLAG = 'R')
THEN (T2.L_QUANTITY) ELSE (0) END)) +
(SUM(CASE WHEN (T2.L_RETURNFLAG <> 'R')
THEN (T2.L_QUANTITY) ELSE (0) END))
) * 100 AS “% RETURNED”
FROM PRODUCT T1, ITEM T2
WHERE T2.L_PARTKEY = T1.P_PARTKEY
GROUP BY 1
HAVING ("RETURNED" / ("RETURNED" + "NOT RETURNED")) >= .25
ORDER BY 4 DESC, 1 ASC
Example (Advanced - Searched CASE)
The following SELECT statement displays a report to show Month-To-Date, Year-To-Date,
Rolling 365 day, and Inception to Date Metrics. (result: 1 row)
SELECT CURRENT_DATE,
SUM(CASE WHEN L_SHIPDATE >
68
(((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || '-' ||
TRIM((EXTRACT(MONTH FROM
CURRENT_DATE)) (FORMAT '99')) || '-01')))(DATE) AND
L_SHIPDATE < CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS MTD,
SUM(CASE WHEN L_SHIPDATE >
(((((CURRENT_DATE (FORMAT 'YYYY'))(CHAR(4))) || (('-01-01')))
(DATE))) AND L_SHIPDATE < CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS YTD,
SUM(CASE WHEN L_SHIPDATE >
CURRENT_DATE-365
THEN L_EXTENDEDPRICE ELSE 0 END) AS ROLLING365,
SUM(CASE WHEN L_SHIPDATE <
CURRENT_DATE
THEN L_EXTENDEDPRICE ELSE 0 END) AS ITD
FROM ITEM;
Sum (SQL-99 Window Function)
Description
Returns the cumulative, group, or moving sum of a value_expression, depending on how the
aggregation group in the SUM function is specified. Cumulative Sum, Group Sum and Moving
Sum syntax is similar, but slight differences are necessary to modify the type of sum desired.
SUM (cumulative) SUM (value_expression) OVER(PARTITION BY value_expression ORDER BY
value_expression ROWS UNBOUNDED PRECEDING ASC|DESC)
SUM (group)
SUM (value_expression) OVER(PARTITION BY value_expression ROWS
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
ASC|DESC)
SUM (moving)
SUM (value_expression) OVER(PARTITION BY value_expression ORDER BY
value_expression ROW width PRECEDING ASC|DESC)
Example 1 (Simple – Cumulative Sum)
The following SELECT statement displays the cumulative balance per Orderkey by Shipdate.
(result: 60175 rows)
SELECT
L_ORDERKEY, L_SHIPDATE, L_EXTENDEDPRICE,
SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_ORDERKEY
ORDER BY L_SHIPDATE ROWS
UNBOUNDED PRECEDING) AS BALANCE
FROM ITEM
ORDER BY L_ORDERKEY, L_SHIPDATE
Example 2 (Complex - Cumulative Sum)
69
The following SELECT statement displays projected Monthly and YTD Sums for the Extended
Price attribute based on Shipdate. (result: 84 rows)
SELECT T1.L_SHIPDATE (FORMAT 'YYYY-MM’)(CHAR(7)), T1.MTD, T1.YTD
FROM (SELECT L_SHIPDATE,
SUM(L_EXTENDEDPRICE) OVER (PARTITION BY
L_SHIPDATE (FORMAT
'YYYY/MM')(CHAR(7)) ORDER BY L_SHIPDATE ROWS
UNBOUNDED
PRECEDING) AS MTD,
SUM(L_EXTENDEDPRICE) OVER (PARTITION BY
L_SHIPDATE (FORMAT
'YYYY')(CHAR(4)) ORDER BY L_SHIPDATE ROWS
UNBOUNDED PRECEDING)
AS YTD
FROM (SELECT L_SHIPDATE, SUM(ITEM.L_EXTENDEDPRICE) AS
L_EXTENDEDPRICE
FROM ITEM
GROUP BY 1) AS T3
) AS T1,
(SELECT MAX(L_SHIPDATE) as L_SHIPDATE
FROM ITEM
GROUP BY L_SHIPDATE (FORMAT 'YYYY-MM') (CHAR(7))
) AS T2
WHERE T1.L_SHIPDATE (FORMAT 'YYYY-MM’) = T2.L_SHIPDATE(FORMAT
'YYYY-MM')
ORDER BY 1
Rank (SQL-99 Window Function)
Description
Returns an ordered ranking of rows for the value_expression.
Example 1 (Simple - Rank)
The following SELECT statement ranks Clerks by Order Status based on Total Price. (result:
15000 rows)
SELECT O_CLERK, O_ORDERSTATUS, O_TOTALPRICE,
RANK() OVER (PARTITION BY O_ORDERSTATUS ORDER BY O_TOTALPRICE
DESC)
FROM CONTRACT
Example 2 (Complex - Rank)
70
The following SELECT statement ranks the Monthly Extended Price in descending order within
each year using Monthly and YTD Sums for the Extended Price attribute query above. (result: 84
rows)
SELECT T1.L_SHIPDATE (FORMAT 'YYYY-MM')(CHAR(7)), T1.MTD, T1.YTD,
RANK() OVER (PARTITION BY T1.L_SHIPDATE (FORMAT
'YYYY')(CHAR(4)) ORDER BY T1.MTD
DESC)
FROM (SELECT L_SHIPDATE,
SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE
(FORMAT
'YYYY/MM')(CHAR(7)) ORDER BY L_SHIPDATE ROWS
UNBOUNDED
PRECEDING) AS MTD,
SUM(L_EXTENDEDPRICE) OVER (PARTITION BY L_SHIPDATE
(FORMAT
'YYYY')(CHAR(4)) ORDER BY L_SHIPDATE ROWS
UNBOUNDED PRECEDING)
AS YTD
FROM (SELECT L_SHIPDATE, SUM(L_EXTENDEDPRICE) AS
L_EXTENDEDPRICE
FROM ITEM
GROUP BY 1) AS T3
) AS T1,
(SELECT MAX(L_SHIPDATE) AS L_SHIPDATE
FROM ITEM
GROUP BY L_SHIPDATE (FORMAT 'YYYY-MM’) (CHAR(7))
) AS T2
WHERE T1.L_SHIPDATE (FORMAT 'YYYY-MM') = T2.L_SHIPDATE(FORMAT
'YYYY-MM')
ORDER BY T1.L_SHIPDATE (FORMAT 'YYYY')(CHAR(4)),4
Fast Path Insert
Description
Inserting rows into empty tables only require one entry to the transient journal log for the entire
step. Rows are inserted at the block level rather than individual rows. By “stacking” multiple
insert statements into one request, all insert steps in the request participate in the single fast path
insert. BTEQ nomenclature is to add the semicolon (;) in front of multiple insert statements.
Example
The following multiple step inserts all participate in the fast path insert for an empty target table
INSERT TABLE A
71
SELECT *
FROM TABLE A1
;INSERT TABLE A
SELECT *
FROM TABLE A2
;
Fast Path Delete
A fast path delete occurs when the optimizer recognizes an unqualified DELETE that IS NOT inside an
uncommitted unit of work. In this scenario, it will not journal the deleted rows in the transient journal. The
logic behind the decision is that, even if the system restarts, the transaction has been previously
authorized to run to completion - eliminate all the rows; therefore since the deleted rows will never need
to be "restored", the system skips the overhead of copying them into the transient journal.
Actually all the DELETE ALL does is to re-chain the internal data blocks on the "data block free- chain" something that is not effected by the number of rows in the table.
If a transaction is created with succession of delete statements, it will not be able to take advantage of
this feature except (possibly) for the last delete. The reason for this is that, by definition, all statements
within a transaction must either succeed or fail. Teradata must be able to recover or undo the changes.
To do this, the transient journal must record the "before image" of all data modified by the transaction.
A transaction can be created in several ways:
o) All SQL statements inside a macro run as a single transaction.
o) A multi-statement request is also a transaction.
o) Statements executed in parallel (F9) within SQL Assistant operate as a transaction. Note that this is
not the case for the other execute, F5: "Execute the statements one step at a time".
Within a transaction, you can take advantage of the fast path delete if the following requirements are met:
o) In ANSI mode, the (unqualified) DELETE must be immediately followed by a COMMIT.
o) In Teradata mode (IMPLICIT), the (unqualified) DELETE must be the last (or only) statement in the
transaction.
o) In Teradata mode (EXPLICIT), the (unqualified) DELETE must be immediately followed by the END
TRANSACTION.
Section 5.2 Set vs. Procedural
There are two ways to state an operation on data
– Set: “Find the Seven of Spades.”
– Procedural: “Look at each card; if one of them is the Seven of Spades then show it
to me (and, optionally, stop looking).”
72
Teradata can process set expressions (SQL) in parallel, but procedural statements result in serial,
row-at-a-time processing.




Think in SETs of data -- not one row at a time
A single insert/update/delete on Teradata is painfully slow, compared to its parallel
capabilities.
Use the TPT to get data into the database.
SQL is the only way to make Teradata use its parallelism and most things can be done in
SQL.
So, how would one approach this?
1 Locate an update DML statement
– Insert/Update/Delete
2 Trace each column involved back to its source in a table
– Change IF … ELSIF … ELSE ... ENDIF to
CASE […] WHEN … WHEN … ELSE … END
3 Locate the conditions under which the update occurs
– Change IF … to WHERE …
4 Repeat for each update DML statement
As an example, the following is what we would typically encounter in working with an Oracle
based application:
Cursor: select order.nr, order.amt into :order_nr, :order_amount from order;
while there_is_data do
fetch from Cursor;
if :rundate < current_date then
if :order_amount < 0 then
:OrdAmt = 0
else
:OrdAmt = :order_amount
endif;
insert into Orders values ( :order_nr, :OrdAmt );
endif;
...
endo;
This is how you would actually want this executed in Teradata:
insert into Orders
select order.nr
,case order.amt
when < 0 then 0
else order.amt
end
from order
73
where :rundate < current_date
;
A nested loop is usually a join, so why not do it in SQL? Sub-processes or nested processes must
be analyzed as nested loops and brought into SQL. Examples in code are multiple subroutines
within a package or program, functions in C or PL/SQL or whatever, and nested loops.
Another example:
cursor TXN_CUR: select Amt, AcctNum from Acct_Transactions
where AcctNum = acct_num;
cursor ACCT_CUR: select AcctNum, acct_balance from Accts;
for acct_rec in ACCT_CUR
loop
acct_num := acct_rec.AcctNum;
acct_bal :=
acct_rec.acct_balance;
for txn_rec in TXN_CUR
loop
acct_bal :=
acct_bal + txn_rec.Amt;
end loop;
UPDATE Accts
set acct_balance = acct_bal
where AcctNum = acct_num;
end loop;
And how it should actually be written:
UPDATE Accts
SET acct_balance = acct_balance + Txn_Sum.total_amt
FROM (
SELECT AcctNum, SUM(Amt) from Acct_Transactions
GROUP BY AcctNum
) Txn_Sum ( acct, total_Amt )
WHERE Accts.AcctNum = Txn_Sum.acct ;
What to do about other complications?


PL/SQL Function and similar features
– If it accesses the database, you might be able to treat it like a sub-procedure
and/or turn it into a derived table.
– If it just does logic, then it can be a CASE operation or a UDF.
Processes nested several layers deep
– Will take a long time to sort out.
– Try to start from the original specs.
74

When all else fails,
– You might be able to use Teradata SPL (not parallel)
– Or create an aggregate table.
An excellent source of information on this subject is George Coleman’s "blog" at
http://developer.teradata.com/blog/georgecoleman. It is actually a series of articles on this topic
and goes into more depth on this subject.
Section 5.3 Statistics collection “cheat sheet”
The following statistics collection recommendations are intended for sites that are on any of the
Teradata 13.10 software release levels. Most of these recommendations apply to releases earlier
than Teradata 13.10, however some may be specific to Teradata 13.10 only.
Collect Full Statistics

Non-indexed columns used in predicates

Single-column join constraints, if the column is not unique
All NUSIs (but drop NUSIs that aren’t needed/used)
USIs/UPIs only if used in non-equality predicates (range constraints)
Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
Full statistics always need to be collected on relevant columns and indexes on small
tables (less than 100 rows per AMP)





PARTITION for all tables whether partitioned or not. Collecting PARTITION statistics
on tables that have grown supports more accurate statistics extrapolations. Collecting on
PARTITION is an extremely quick operation (as long as table is not over-partitioned).
Can Rely on Dynamic AMP Sampling
 This is sometimes referred to as a random AMP sample since, in early releases of
Teradata, a random AMP was picked for obtaining the sample; however, in current
releases, the term dynamic AMP sample is more appropriate.
 USIs or UPIs if only used with equality predicates
 NUPIs that display even distribution, and if used for joining, conform to assumed

uniqueness (see Point #2 under “Other Considerations” below)
See “Other Considerations” for additional points related to dynamic AMP sampling
Option to use USING SAMPLE

Unique index columns
75

Nearly-unique (any column which is over 95% unique is considered as a nearly-unique
column) columns or indexes
Collect Multicolumn Statistics


Groups of columns that often appear together with equality predicates, if the first 16
bytes of the concatenated column values are sufficiently distinct. These statistics are used
for single-table estimates.
Groups of columns used for joins or aggregations, where there is either a dependency or
some degree of correlation among them. With no multicolumn statistics collected, the
optimizer assumes complete independence among the column values. The more that the
combination of actual values are correlated, the greater the value of collecting
multicolumn statistics.
Other Considerations


Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination
are not chosen unless statistics have been collected on the relevant columns.
NUPIs that are used in join steps in the absence of collected statistics are assumed to be
75% unique, and the number of distinct values in the table is derived from that. A NUPI
that is far off from being 75% unique (for example, it’s 90% unique, or on the other side,
it’s 60% unique or less) benefits from having statistics collected, including a NUPI
composed of multiple columns regardless of the length of the concatenated
values. However, if it is close to being 75% unique, dynamic AMP samples are
adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you
can issue this SQL statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;

For a partitioned primary index table, it is recommended that you always collect statistics
on:
o

PARTITION. This tells the optimizer how many partitions are empty, and how
many rows are in each partition. This statistic is used for optimizer costing.
o The partitioning column. This provides cardinality estimates to the optimizer
when the partitioning column is part of a query’s selection criteria.
For a partitioned primary index table, consider collecting these statistics if the
partitioning column is not part of the table’s primary index (PI):
76
o

(PARTITION, PI). This statistic is most important when a given PI value may
exist in multiple partitions, and can be skipped if a PI value only goes to one
partition. It provides the optimizer with the distribution of primary index values
across the partitions. It helps in costing the sliding-window and rowkey-based
merge join, as well as dynamic partition elimination.
o (PARTITION, PI, partitioning column). This statistic provides the combined
number of distinct values for the combination of PI and partitioning columns after
partition elimination. It is used in rowkey-based merge join costing.
Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than
from a single AMP (the default). For small tables, with less than 25 rows per AMP, allAMP sampling is done automatically. It is also the default for volatile tables and sparse
join indexes. All-AMP sampling comes with these tradeoffs:
o Dynamic all-AMP sampling provides a more accurate row count estimate for a
table with a NUPI. This benefit becomes important when NUPI statistics have
not been collected (as might be the case if the table is extraordinarily large), and
the NUPI has an uneven distribution of values.
o Statistics extrapolation for any column in a table is triggered only when the
optimizer detects that the table has grown. The growth is computed by comparing
the current row count with the last known row count to the optimizer. If the
o

default single-AMP dynamic sampling estimate of the current row count is not
accurate (which can happen if the primary index is skewed), it is recommended to
enable all-AMP sampling or re-collect PARTITION statistics.
Parsing times for queries may increase when all AMPs are involved, as the
queries that perform dynamic AMP sampling will have slightly more work to do.
Note that dynamic AMP samples will stay in the dictionary cache until the
periodic cache flush, or unless they are purged from the cache for some
reason. Because they can be retrieved once and re-used multiple times, it is not
expected that dynamic all-AMP samping will will cause additional overhead for
all query executions.
For temporal tables, follow all collection recommendations made above. However,
statistics are currently not supported on BEGIN and END period types. That capability is
planned for a future release.
These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul
Sinclair
77
Section 5.4 Reserved words
Please refer to Appendix B of the Teradata Database SQL Fundamentals manual for a list for
Restricted Words.
Section 5.5 Orange Books and Suggested Reading
The Teradata documentation manuals supplied with the installation and available at
www.info.teradata.com are quite readable. Of particular interest are the “Teradata Database
Design” and “Teradata Performance Management” manuals.
In addition to white papers and Teradata Magazine articles available to www.teradata.com,
Teradata Partners have access to the Teradata orange book library via Teradata at Your Service
at http://www.teradata.com/services-support/teradata-at-your-service. To access the orange
books, enter “orange book” in the text box under “Search Knowledge Repositories” and then
clicking “Search”.
The Teradata Developer Exchange website, http://developer.teradata.com/, offers a wide variety
of blogs, articles, and information related to Teradata. It also offers a download section.
Here’s a list of orange books that are particular interest to migrating and tuning applications on
the Teradata database. In addition, there are a number of orange books that address controlling
and administrating mixed workload environments, and other subjects, that are not listed here.
“Understanding Oracle and Teradata Transactions and Isolation Levels for Oracle Migrations.”
When migrating applications from Oracle to Teradata, the reduced isolation levels used by the
Oracle applications need to be understood before the applications can be ported or redesigned to
run on Teradata. This Orange Book will describe transaction boundaries, scheduling, and the
isolation levels available in Oracle and in Teradata. It will suggest possible solutions for coping
with incompatible isolation levels when migrating from Oracle to Teradata.
“ANSI MERGE Enhancements.” This is an overview of the support of full ANSI MERGE
syntax for set inserts, updates and upserts into tables. This includes an overview of the batch
error handling capabilities.
“Single-Level and Multilevel Partitioning.” Usage considerations, examples, recommendations,
and technical details for tables and noncompressed join indexes with single-level and multilevel
partitioned primary indexes.
“Collecting Statistics Teradata Database V2R62.” Having statistical information available is
critical to optimal query plans, but collecting statistics can involve time and resources. By
combining several different statistics gathering strategies, users of the Teradata database can find
the correct balance between good query plans and the time required to ensure adequate statistical
information is always available. Note: Although the title specifies Teradata V2R6.2, this orange
book is also applicable to Teradata V12 and beyond.
78
“Implementing Tactical Queries the Basics Teradata Database V2R61.” Tactical queries support
decision-making of an immediate nature within an active data warehouse environment. These
response time-sensitive queries often come with clear service level expectations. This orange
addresses supporting tactical queries with the Teradata database. Note: Although the title
specifies Teradata V2R6.1, this orange book is also applicable to Teradata V12 and beyond.
“Feeding the Active Data Warehouse.” Active ingest is one of the first steps that needs to be
considered in evolving towards an active data warehouse. There are several proven approaches to
active ingest into a Teradata database. This orange reviews the approaches, their pros and cons,
and implementation guidelines.
“Introduction to Materialized Views.” Materialized views are implemented as join indexes in
Teradata. Join indexes can be used to improve the performance of queries at the expense of
update performance and increased storage requirements.
“Reserved QueryBand Names for Use by Teradata, Customer and Partner Applications.” The
Teradata 12 Feature known as Query Bands provide a means to set Name/Value pairs across
individual Database connections at a Session or Transaction level to provide the database with
significant information about the connections originating source. This provides a mechanism for
Applications to collaborate with the underlying Teradata Database in order to provide for better
Workload Management, Prioritization and Accounting.
“Teradata Active System Management.” As DBAs and other support engineers attempt to
analyze, tune and manage their environment ,s performance, these new features will greatly ease
that effort through centralizing management tasks under one domain, providing automation of
certain management tasks, improving visibility into management related details, and by
introducing management and monitoring by business driven, workload-centric goals.
“Stored Procedures Guide.” This Orange books provides an overview of stored procedures and
some basic examples for getting started.
“User Defined Functions” and “Teradata Java User Defined Functions User’s Guide” are two
Orange Books for digging deeper in C/C++ and Java UDF’s. Both guides explain the UDF
architecture and how a UDF is created and packaged for use by Teradata.
K-means clustering and Teradata 14.10 table operators. Article by Watzke on 17 Sep 2013.
Teradata Developer Exchange. http://developer.teradata.com/extensibility/articles/k-meansclustering-and-teradata-14-10-table-operators-0
Here are a few white papers and Teradata manuals that are particular interest to migrating and
tuning application on the Teradata database. A list of available white papers can be found at
http://www.teradata.com/t/resources.aspx?TaxonomyID=4533. The Teradata manuals are found
in the latest documentation set available at www.info.teradata.com.
79
“Implementing AJIs for ROLAP.” This white paper describes how to build and implement
ROLAP cubes on the Teradata database.
http://www.teradata.com/t/article.aspx?id=1644
“Teradata Database Queue Tables.” This white paper describes the Queue Tables feature that
was introduced in Teradata® Database V2R6. This feature enables new and improved ways of
building applications to support event processing use cases.
http://www.teradata.com/t/article.aspx?id=1660
“Oracle to Teradata Migration Technical Info.” This document is based on a true practical
experience in migrating data between Oracle and Teradata. This document contains a description
of the technicalities involved in the actual porting of the software and data; as well as some
templates and tools provided that are useful for projects of this nature. This is available on the
Teradata Partner Portal.
[1]
Blog entry by carrie on 27 Aug 2012
80