A Sample Event Consumer Program For Updating Netezza with DB2... Introduction This article introduces a sample program that retrieves DB2 changed...

Transcription

A Sample Event Consumer Program For Updating Netezza with DB2... Introduction This article introduces a sample program that retrieves DB2 changed...
A Sample Event Consumer Program For Updating Netezza with DB2 Changed Data
Introduction
This article introduces a sample program that retrieves DB2 changed data in a delimited values format from
WebSphere MQ and puts the changed data into Netezza. The sample is based on the event publishing function
found in the IBM InfoSphere Data Replication (IIDR) Q Replication technology.
Event Publishing captures changed data from a database transaction log and makes it available for applications
to consume. The consuming applications may range from in-house programs to off-the-shelf ETL tools, such as
InfoSphere DataStage, that perform complex transformations on the data.
Netezza appliances deliver advanced complex analytics functions to enterprise data warehousing and business
intelligence applications. As more applications are processed by the appliance, the need to move data to the
Netezza appliance increases.
Sample Scenario
The sample event consumer program use-cases include active-active and data warehousing environments
maintained by the Q Replication technology of the IIDR product. The sample scenario used in this article is a
data warehousing scenario with site 1 as the DB2 for z/OS production environment and site 2 as a DB2 for z/OS
server maintained by Q Replication. Site 2 also runs a Netezza appliance. In the picture below, unidirectional
replication is used to maintain the DB2 for z/OS on site 2. A capture program on site 2 feeds DB2 changed data
into the Netezza Appliance. In active-active scenarios with bidirectional replication between the two sites, a
capture would already be running on site 2.
1
A benefit of scenarios that rely on a Q Apply program to maintain the DB2 for z/OS on site 2 is the ability to
perform the initial load of the Netezza database without impacting the DB2 for z/OS production environment
on site 1. The procedure is described in more details in a section below.
The solution at a glance
To feed DB2 changed data into a Netezza Appliance, you need a Q Capture program (which provides the event
publishing function) and the sample event consumer program called asnconsumer introduced in this article.
The sample program runs on USS (Unix System Services for z/OS), UNIX, Linux or Windows.
The high-level solution includes three main stages. In the first stage, you identified the tables for which changed
data is to be captured. The Q Capture program's event publishing function puts committed transactions into a
local WebSphere MQ queue.
The second and third stages run independently of each other and run continuously. They are delivered by the
main threads of the event consumer sample program.
The three stages are shown in the diagram below:
2
In the second stage, the event consumer sample program reads the queue and writes data to the local file
system (USS, UNIX, etc.). One file is created for each table with changed data. The program reads data from
the queue for a user-defined amount of time (default is five minutes). At the end of that time, reading of
WebSphere MQ data stops temporarily, the files are moved to a “refresh“ directory, and the reading of MQ data
starts again.
In the third stage, the event consumer sample program relies on JDBC to load the refresh directory data files
into changed data tables in Netezza (the TMPCD_XXX tables in the picture) and to call a SQL procedure to
perform the necessary delete and insert statements for the target tables (REFRESH_SQL procedure). The event
consumer sample program does not issue update statements against the Netezza target database.
If there is no data to process, the sample program simply waits for data to be available. The event consumer
sample program runs continually until you stop it with a 'CTL-C' user interrupt.
Nothing runs on Netezza other than two SQL procedures: the REFRESH_SQL procedure mentioned above and
the ADDSTMTS procedure that defines a Netezza end-user table to the event consumer program. Instructions
on how to create the SQL procedures (and required SQL tables) are provided in the last section of this article.
The event consumer program creates all its files on the local file system (USS) of system 2, not in Netezza.
Initial Load of the Netezza Target Database
Before you start the asnconsumer sample, you first need to perform a load of the Netezza end-user tables with
data from the DB2 source using your own load procedures. The sample scenario described above delivers a
simple solution to coordinate the load process with concurrent updates being made to the DB2 tables without
impacting the DB2 source production environment.
When the Q Apply program maintains the DB2 for z/OS at site 2, you can find a reliable point in time in DB2
from which to load Netezza. Stop the Q Apply program at a particular point in time, load the data into Netezza
using your own load procedure and restart the Q Apply process when the load is completed for all tables. Since
the Q Apply program delivers transaction consistency, the procedure guarantees that Netezza is initially loaded
with DB2 data that preserves transaction consistency without impacting DB2 on site 1. Changes will be
accumulating in the WebSphere MQ queue on site 2 while Q Apply is stopped but they will be picked up by Q
Apply when it comes back up.
Updating Netezza (more on the third stage)
The event consumer sample program runs in the same system where DB2 and WebSphere MQ run (site 2 in
the diagrams of this article) and issues JDBC calls to access the (remote) Netezza target database.
Data file names in the local file system are hardcoded as XXX.dat and the changed data table names are
hardcoded as TMPCD_XXX in Netezza with XXX the original DB2 table name (does not include schema name).
3
Changed data tables are only created when there is changed data for a table and are defined as temporary
tables in the Netezza database.
The sample program uses Netezza “external” tables to load the Netezza changed data tables from data files
(step 3a in the diagram below). External tables process bulk loads of data quite efficiently and the sample
program loads each changed data table one at a time.
Once the changed data tables are loaded, the sample program calls a SQL procedure named REFRESH_SQL to
refresh the Netezza end-user tables. The procedure issues a single delete and a single insert SQL statement for
each table (step 3b in the diagram below) and each statement includes its own subquery that identifies the
correct set of changed data rows to apply.
After delete and insert statements are executed for the end-user tables, a SQL commit is executed, data files
that were processed are deleted and temporary changed data tables are dropped.
The diagram below shows how a Netezza table is updated by populating a temporary staging table (step 3a)
and by issuing one delete and one insert SQL statement against the table (step 3b):
4
The delete statement is always executed before the insert statement and a DB2 update statement is processed
as a delete/insert pair (a Netezza database optimization).
The delete subquery ('deleteSubquery' in the diagram above) associated with the delete statement derives the
set of rows to delete from the end-user table based on the data present in the temporary TMPCD_XXX table
and the Netezza end-user table. Likewise, the insert subquery ('insertSubquery') derives the set of rows to
insert into the end-user table.
The subqueries rely on the existence of a primary key for each Netezza end-user table. The Netezza database
does not enforce primary key definitions but allows their definition.
The table below explains the logic used by the delete and insert subqueries. The logic finds the qualifying rows
from the temporary TMPCD_XXX table and the delete subquery has two answer sets. The qualifying rows may
be a subset of the initial TMPCD_XXX table rows as some optimization minimizes the set of rows required:
Subquery
Qualifying row from the temporary
TMPCD_XXX table
Rationale
Delete
Insert or Update changed data row
and
after image key values match the key
values of a row in the Netezza end-user
table (exist in the Netezza end-user table)
Remove the record from the Netezza end-user
table so that the Insert or Update statement
succeeds when it is executed after the
delete
Delete
Update or Delete changed data row
and
before image key values match the key
values of a row in the Netezza end-user
table
Remove the “old” record value of an Update
statement and remove a regular Delete record
from the Netezza end-user table
Insert
Insert or Update changed data row
Add a regular Insert record and add the “new”
and
record value of an Update statement
after image key values do not already exist in the Netezza end-user table
in the Netezza end-user table
Optimization:
If the same after-image key values exist in
another row of the TMPCD_XXX table, key
values are that of the most recent row
Optimization:
When multiple rows exist for the same key values
in the TMPCD_XXX table, retain only the most
current row. Netezza analytics functions are used
to build an intermediary answer set to filter
redundant Insert or Update rows before applying
the condition listed in the Qualifying Row column
of this table.
5
The delete and insert statements with a subquery are issued by the Netezza SQL procedure called
REFRESH_SQL. The next section shows you how to download the asnconsumer zip file that includes the script
that you need to run to create the REFRESH_SQL procedure and highlights the other tasks you need to
complete before you run the event consumer sample.
Putting it all together: the asnconsumer.zip file
Before you run the event consumer sample program, a Q Replication environment must be up and running for
a DB2. Instructions on how to complete the prerequisite steps and configure your environment are provided in
a README file included in the asnconsumer zip file that you first need to download from the Samples section of
the Q+SQL Replication Forum on developerWorks.
To configure your environment after downloading the asnconsumer.zip file follow the steps below:
1. Unzip the asnconsumer.zip file by running the command: unzip asnconsumer.zip
2. Follow the instructions provided in the README file, which include the event consumer
sample program:
◦
◦
◦
◦
◦
◦
◦
◦
Prerequisites
Contents of the asnconsumer.zip file
Install Tasks
One-time Setup
Per Source Table Configuration (required for each Netezza end-user table maintained by the
asnconsumer sample)
Logic Overview (includes description of the asnconsumer sample threads)
Sample Invocation
Parameters.
The asnconsumer.zip file includes all the script files you need to run the asnconsumer sample program in your
environment. It also includes the asnconsumer.jar file (executable) and the asnconsumer.java file (source code).
If you have a non-Netezza database and would like to change the asnconsumer sample logic to support it,
follow the instructions in the Install Tasks section of the README on how to build a new asnconsumer.jar file
after making changes to the original asnconsumer.java source.
6