Populating the Data Warehouse (ETL)

Transcription

Populating the Data Warehouse (ETL)
Extract, Transform, Load
1
Agenda
 Review




Analysis
Logical Design
Physical Design
Implementation
(Bus Matrix, Info Package)
(Dimensional Modeling)
(Spreadsheet)
(Data Mart Relational Tables)
 ETL Process Overview
 ETL Components




Staging Area
Extraction
Transformation
Loading
 Documenting High-Level ETL Requirements
 Documenting Detailed ETL Flows
 Example ETL
2
Review: Dimensional Modeling
3
Review: DM Implementation
DimStudent
CREATE TABLE DimStudent(
student_sk
int identity(1,1),
student_id
varchar(9),
firstname
varchar(30),
lastname
varchar(30),
city
varchar(20),
state
varchar(2),
major
varchar(6),
classification
varchar(25),
gpa
numeric(3, 2),
club_name
varchar(25),
undergrad_school
varchar(25),
gmat
int,
undergrad_or_grad varchar(10),
CONSTRAINT dimstudent_pk PRIMARY
KEY (student_sk));
GO
FactEnrollment
CREATE TABLE FactEnrollment(
student_sk
int,
class_sk
int,
date_sk
int,
professor_sk
int,
course_grade
numeric(2, 1),
CONSTRAINT factenrollment_pk PRIMARY KEY
(student_sk, class_sk, date_sk, professor_sk),
CONSTRAINT factenrollment_student_fk FOREIGN
KEY (student_sk) REFERENCES
dimstudent(student_sk),
CONSTRAINT factenrollment_class_fk FOREIGN
KEY(class_sk) REFERENCES dimclass (class_sk),
CONSTRAINT factenrollment_date_fk FOREIGN
KEY(date_sk) REFERENCES dimtime (date_sk),
CONSTRAINT factenrollment_professor_fk FOREIGN
KEY(professor_sk) REFERENCES dimprofessor
(professor_sk));
GO
4
Review: Physical DW Design
5
ETL Overview
 Reshaping relevant data from source systems
into useful information stored in the DW
 Extract
 Copying and integrating data from OLTP and
other data sources in preparation for cleansing
and loading into the DW
 Transform
 Cleaning and converting data to prepare it for
loading into the DW
 Load
 Putting cleansed and converted data into the DW
6
ETL Process
 Not Really New, BUT…
 Much more data
 Includes rearranging, summarizing
 Data used for strategic decision-making
 Characteristics:




Process AND technology
Detailed, highly-dependent tasks
Consumes average 75% of DW development
An on-going process for life of DW
 Requirements:
 Well-documented
 Automated
 Flexible
7
ETL Process
1. Determine target data
2. Determine data sources
3. Prepare data mapping
4. Organize data staging area
5. Establish data extraction rules
6. Establish data transformation rules
7. Plan aggregate tables
8. Establish data load procedures
9. Load dimension tables
10. Load fact tables
8
ETL Process Flow
3, Spreadsheet
1, Dim Model
2, Spreadsheet
6, 7, Map
& SSIS
5, SSIS
8, 9, 10, SSIS
4
9
ETL Staging Area
 Information hub, facilitating the enriching
stages that data goes through to populate a DW
 Advantages:
 Separates source systems and DW
 Minimizes ETL impact on source AND DW systems
 Can consist of multiple “hubs”
 “upload” area
 “staging” area
 “DW load images”
10
ETL Staging Area, cont…
11
High Level Design of ETL Process
 Initial documentation of:
 What data do we need and where is it coming
from?
 Physical DW Design Spreadsheet shown previously
 What are the major transformation/cleansing
needs?
 “Extend” Physical DW Design Spreadsheet OR
 ETL Map
 What’s the sequence of activities for ETL?
 ETL Map
12
Common Transformations
 Format Revisions
 Key Restructuring, Lookup
 Handling of Null Values
 Decoding fields
 Calculated, Derived values
 Merging of Data
13
Common Transformations, cont…
 Splitting of single fields
 Character set conversion
 Units of measurement conversion
 Date/time conversion
 Summarization
 Deduplication
14
Common Transformations, cont…
 Other Data Quality Issues
 Standardize values
 Validate values
 Identifying mismatches, misspellings
 Etc…
 Suggestions:
 Appoint “Data Stewards”
 Ensure ETL programs have control checks
 Data Profiling…
15
Comparison of Models
16
Transformations Example
DimTime
DimProfessor
DimClass
DimStudent
FactEnrollment
Create table
Generate SK
Generate SK
Generate SK
Add SKs:
student, section, prof
(join registration to
student, time, and section
dims;
left join them to prof)
Insert row w/SK = -1
Insert row w/SK = -1
Insert row w/SK = -1
Insert row w/SK = -1
Expand rank values
(use SQL case)
Get coursename & cred
hrs from section tbl
(join section to course)
Expand classification values
(use SQL case)
Expand department values
(join prof to departments)
Expand state values
(needs lookup table but
use SQL case instead)
Get gmat, undergrad school
from grad table
(join student to grad)
Get club name from club
(join student to undergrad;
Left join them to club)
Create undergrad_or_grad
values
(if stud_id in undergrad or
stud_id in grad)
17
Data Profiling
 Systematic analysis of the content of a data
source
 Goals:
 Anticipate potential data quality issues upfront
 Build quality corrections and controls into ETL
process
 Manual and/or Tool-assisted
18
Profiling Example: Manual
Account
CustID Number
Customer
First
Type
Title Name
AW000110
11000 00
I
AW000110
11001 01
I
AW000110
11002 02
Last
Name
Gender Email
Phone
Address Line1
Address
Line2
State
Postal
Code Country
Yang
F
jon24@adventureworks.com.
1(11) 500 5550162
3761 N. 14th St
Queensland
4700
AU
Eugene
Huang
F
eugene10@adventureworks.com.
500-555-0110
2243 W St.
Victoria
3198
AU
I
Ruben
Torres
F
ruben35@advantureworks.com.
1(11) 500 5550184
5844 Linden Dr
New South
Wales
7001
AU
AW000110
11003 03
I
Christy
Zhu
F
christy12@adventureworks.com.
1(11) 500 5550162
1825 Village Pl.
Queensland
2113
AW000110
11004 04
I
F
elizabeth5@adventureworks.com.
7553 Harness
(500) 555-0131 Circle
AW000110
11005 05
I
M
julio1@adventureworks.com.
1(11) 500 5550151
Mr. Jon
Mrs. Elizabeth Johnson
Julio
Ruiz
7305 Humphrey
Drive
New South
Wales
2500
AU
4169
OZ
19
Profiling Example: SSIS
20
Documenting ETL High Level Design
 Add to existing DW Physical Design
Spreadsheet
21
Documenting ETL High Level Design
22
Low Level Design of ETL Process
 Detailed documentation of:
 What data do we need and where is it coming
from?
 What are the major transformation/cleansing
needs?
 What’s the sequence of activities for ETL?
 Can use tool like SSIS
23
Extracting Source Data
 Two forms:
1.
Static Data Capture


Point-in-time snapshot
Initial Loads and periodic refreshes
2. Revised Data Capture



Only data that has been added, updated, deleted
since last load
Ongoing incremental loads
Two timeframes


Immediate
Deferred
24
Static Data Capture
 (T)SQL Scripts
 e.g., small number of tables/rows
 Export/Import Tables
 e.g., database or non-database sources
 Backup/Restore Database
 e.g., copying sqlserver source database for initial
load ETL
 Detach/Attach Database
 e.g., copying older sqlserver version to newer
sqlserver version for initial load ETL
25
Revised Data Capture
 Immediate / Real-time
 ETL side:
 OLTP side:
 OLTP side:
procs get changed data from log real-time
and update ETL staging tables
triggers update ETL staging tables
apps write to OLTP AND ETL staging
tables
 Deferred
 ETL side:
 ETL side:
 OLTP side:
procs get changed data from OLTP tables
based on timestamps
procs do file comparison
changed data capture (SS 2008)
26
Documenting ETL Low Level Design:
SSIS
 Comes with SQL Server
 Helps document and automate ETL process
 Based on defining
 Packages
 Tasks
 One approach
 A package for each target table
 A "master" package
27
SSIS Package Examples: Master
28
SSIS Package Examples: Extract All
29
SSIS Package Examples: Extract Changed
using CDC
Eg, SELECT * from cdccustomer WHERE
cdc_chg_date >
etl_last_capture_date;
30
SSIS Package Examples: Transforms
31
SSIS Package Examples: Load
32
Class Performance DW Example
 Create ClassPerformanceDW database
 Using ClassPerformanceDW database…
 Create ClassPerformanceDW tables using SQL
Script

http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables/create_class_performance_dw_tables.sql
33
ETL Example using SQL Scripts
 One "Master Script"
 Calls five "table" scripts
34
"Master" Script
--be sure to turn on Query, SQLCMD mode in order to run this script
Use ClassPerformanceDW
print 'loading dimclass table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql"
print 'loading dimprofessor table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql"
print 'loading dimstudent table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql"
print 'loading dimtime table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql"
print 'loading factenrollment table'
Go
:r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql"
Print 'class performance DW data transformation and loading is complete'
Go
35
Load "DimProfessor" Script (pg. 1 of 3)
set nocount on
print 'remove existing data from dimprofessor'
delete from dimprofessor;
go
print 'reseeding SK identity value back to 1'
dbcc checkident ('dimprofessor', reseed, 0);
go
print 'adding oltp prof data to dimprofessor'
print 'professor_sk will be automatically inserted'
insert into dimprofessor (
professor_id,
firstname,
lastname,
rank,
department)
select
prof_id, firstname, lastname, rank, dept
from
regnOLTP.dbo.prof
;
go
36
Load "DimProfessor" Script (pg. 2 of 3)
print 'decoding rank field'
UPDATE dimprofessor
SET dimprofessor.rank = case dimprofessor.rank
when 'asst' then 'assistant prof'
when 'assc' then 'associate prof'
when 'prof' then 'full prof'
end
;
Go
print 'decoding department field using imported excel spreadsheet'
UPDATE dimprofessor
SET
dimprofessor.department = regnOLTP.dbo.departments.department
FROM dimprofessor, regnOLTP.dbo.departments
WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix
;
Go
37
Load "DimProfessor" Script (pg. 3 of 3)
print 'adding SK -1 row'
set identity_insert dimprofessor on
Go
insert into dimprofessor (
professor_sk,
professor_id,
firstname,
lastname,
rank,
department)
Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown');
GO
set identity_insert dimprofessor off
Go
Set nocount off
38
Load "FactEnrollment" Script
print 'adding oltp registration data to fact_enrollment'
INSERT INTO factenrollment (
student_sk,
class_sk,
date_sk,
professor_sk,
course_grade)
SELECT student_sk, class_sk, datekey, professor_sk, final_grade
FROM
((((regnOLTP.dbo.registration INNER JOIN dimstudent ON
registration.stud_id = dimstudent.student_id)
INNER JOIN dimclass ON
regnOLTP.dbo.registration.callno = dimclass.crn)
INNER JOIN dimtime ON
CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey)
INNER JOIN regnOLTP.dbo.section ON
dimclass.crn = regnOLTP.dbo.section.callno)
LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id =
dimprofessor.professor_id
;
Go
39
Entire Transform/Load "Package"
http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables.zip
40