Populating the Data Warehouse (ETL)
Transcription
Populating the Data Warehouse (ETL)
Extract, Transform, Load 1 Agenda Review Analysis Logical Design Physical Design Implementation (Bus Matrix, Info Package) (Dimensional Modeling) (Spreadsheet) (Data Mart Relational Tables) ETL Process Overview ETL Components Staging Area Extraction Transformation Loading Documenting High-Level ETL Requirements Documenting Detailed ETL Flows Example ETL 2 Review: Dimensional Modeling 3 Review: DM Implementation DimStudent CREATE TABLE DimStudent( student_sk int identity(1,1), student_id varchar(9), firstname varchar(30), lastname varchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_name varchar(25), undergrad_school varchar(25), gmat int, undergrad_or_grad varchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO FactEnrollment CREATE TABLE FactEnrollment( student_sk int, class_sk int, date_sk int, professor_sk int, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO 4 Review: Physical DW Design 5 ETL Overview Reshaping relevant data from source systems into useful information stored in the DW Extract Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW Transform Cleaning and converting data to prepare it for loading into the DW Load Putting cleansed and converted data into the DW 6 ETL Process Not Really New, BUT… Much more data Includes rearranging, summarizing Data used for strategic decision-making Characteristics: Process AND technology Detailed, highly-dependent tasks Consumes average 75% of DW development An on-going process for life of DW Requirements: Well-documented Automated Flexible 7 ETL Process 1. Determine target data 2. Determine data sources 3. Prepare data mapping 4. Organize data staging area 5. Establish data extraction rules 6. Establish data transformation rules 7. Plan aggregate tables 8. Establish data load procedures 9. Load dimension tables 10. Load fact tables 8 ETL Process Flow 3, Spreadsheet 1, Dim Model 2, Spreadsheet 6, 7, Map & SSIS 5, SSIS 8, 9, 10, SSIS 4 9 ETL Staging Area Information hub, facilitating the enriching stages that data goes through to populate a DW Advantages: Separates source systems and DW Minimizes ETL impact on source AND DW systems Can consist of multiple “hubs” “upload” area “staging” area “DW load images” 10 ETL Staging Area, cont… 11 High Level Design of ETL Process Initial documentation of: What data do we need and where is it coming from? Physical DW Design Spreadsheet shown previously What are the major transformation/cleansing needs? “Extend” Physical DW Design Spreadsheet OR ETL Map What’s the sequence of activities for ETL? ETL Map 12 Common Transformations Format Revisions Key Restructuring, Lookup Handling of Null Values Decoding fields Calculated, Derived values Merging of Data 13 Common Transformations, cont… Splitting of single fields Character set conversion Units of measurement conversion Date/time conversion Summarization Deduplication 14 Common Transformations, cont… Other Data Quality Issues Standardize values Validate values Identifying mismatches, misspellings Etc… Suggestions: Appoint “Data Stewards” Ensure ETL programs have control checks Data Profiling… 15 Comparison of Models 16 Transformations Example DimTime DimProfessor DimClass DimStudent FactEnrollment Create table Generate SK Generate SK Generate SK Add SKs: student, section, prof (join registration to student, time, and section dims; left join them to prof) Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Expand rank values (use SQL case) Get coursename & cred hrs from section tbl (join section to course) Expand classification values (use SQL case) Expand department values (join prof to departments) Expand state values (needs lookup table but use SQL case instead) Get gmat, undergrad school from grad table (join student to grad) Get club name from club (join student to undergrad; Left join them to club) Create undergrad_or_grad values (if stud_id in undergrad or stud_id in grad) 17 Data Profiling Systematic analysis of the content of a data source Goals: Anticipate potential data quality issues upfront Build quality corrections and controls into ETL process Manual and/or Tool-assisted 18 Profiling Example: Manual Account CustID Number Customer First Type Title Name AW000110 11000 00 I AW000110 11001 01 I AW000110 11002 02 Last Name Gender Email Phone Address Line1 Address Line2 State Postal Code Country Yang F jon24@adventureworks.com. 1(11) 500 5550162 3761 N. 14th St Queensland 4700 AU Eugene Huang F eugene10@adventureworks.com. 500-555-0110 2243 W St. Victoria 3198 AU I Ruben Torres F ruben35@advantureworks.com. 1(11) 500 5550184 5844 Linden Dr New South Wales 7001 AU AW000110 11003 03 I Christy Zhu F christy12@adventureworks.com. 1(11) 500 5550162 1825 Village Pl. Queensland 2113 AW000110 11004 04 I F elizabeth5@adventureworks.com. 7553 Harness (500) 555-0131 Circle AW000110 11005 05 I M julio1@adventureworks.com. 1(11) 500 5550151 Mr. Jon Mrs. Elizabeth Johnson Julio Ruiz 7305 Humphrey Drive New South Wales 2500 AU 4169 OZ 19 Profiling Example: SSIS 20 Documenting ETL High Level Design Add to existing DW Physical Design Spreadsheet 21 Documenting ETL High Level Design 22 Low Level Design of ETL Process Detailed documentation of: What data do we need and where is it coming from? What are the major transformation/cleansing needs? What’s the sequence of activities for ETL? Can use tool like SSIS 23 Extracting Source Data Two forms: 1. Static Data Capture Point-in-time snapshot Initial Loads and periodic refreshes 2. Revised Data Capture Only data that has been added, updated, deleted since last load Ongoing incremental loads Two timeframes Immediate Deferred 24 Static Data Capture (T)SQL Scripts e.g., small number of tables/rows Export/Import Tables e.g., database or non-database sources Backup/Restore Database e.g., copying sqlserver source database for initial load ETL Detach/Attach Database e.g., copying older sqlserver version to newer sqlserver version for initial load ETL 25 Revised Data Capture Immediate / Real-time ETL side: OLTP side: OLTP side: procs get changed data from log real-time and update ETL staging tables triggers update ETL staging tables apps write to OLTP AND ETL staging tables Deferred ETL side: ETL side: OLTP side: procs get changed data from OLTP tables based on timestamps procs do file comparison changed data capture (SS 2008) 26 Documenting ETL Low Level Design: SSIS Comes with SQL Server Helps document and automate ETL process Based on defining Packages Tasks One approach A package for each target table A "master" package 27 SSIS Package Examples: Master 28 SSIS Package Examples: Extract All 29 SSIS Package Examples: Extract Changed using CDC Eg, SELECT * from cdccustomer WHERE cdc_chg_date > etl_last_capture_date; 30 SSIS Package Examples: Transforms 31 SSIS Package Examples: Load 32 Class Performance DW Example Create ClassPerformanceDW database Using ClassPerformanceDW database… Create ClassPerformanceDW tables using SQL Script http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables/create_class_performance_dw_tables.sql 33 ETL Example using SQL Scripts One "Master Script" Calls five "table" scripts 34 "Master" Script --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go 35 Load "DimProfessor" Script (pg. 1 of 3) set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcc checkident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go 36 Load "DimProfessor" Script (pg. 2 of 3) print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROM dimprofessor, regnOLTP.dbo.departments WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix ; Go 37 Load "DimProfessor" Script (pg. 3 of 3) print 'adding SK -1 row' set identity_insert dimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insert dimprofessor off Go Set nocount off 38 Load "FactEnrollment" Script print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go 39 Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables.zip 40