ETL Architecture Guide
The purpose of this document is to present architectural guidelines for developing a common-sense approach to supply the best possible quality of data attainable for the data mart.
ETL – an industry standard abbreviation for the phrase: “extract, transform and load” which favors the use of applications to perform data transformations outside of a database on a row-by-row basis. Traditionally, referred to as the “hub and spokes” practice.
ELT – an industry standard abbreviation for the phrase: “extract, load and transform” which favors the use of relational databases as the staging point before performing any transformations of source data into target data.
- When functional information requirements have been defined for the end products of the data mart, each of these requirements will be traced back to the appropriate source systems of record. Best practices will be used to determine the most expedient approach to obtain the highest quality data in terms of most accurate, complete and valid data from each source. Flat (non-hiearchical, non-relational) file extracts may be either pulled by an application such as ftp or other mechanism (tape, disk, etc…) or pushed by a source system extract application to a staging area. Relational data may also be queried to produce extract files or to load directly into relational database staging tables. An ETL metadata reference table will be defined (data_source_type) to uniquely identify each type of data source (flat file, spreadsheet, hierarchical database, relational database, multi-valued database, comma-separated variable length, fixed record length, etc…) An ETL metadata reference table will be defined (data_source) to uniquely identify all data sources. All data sources will be required to have event metadata either embedded with the source data itself in standardized header/trailer records, or in a separate file or table which must be accessible by the ETL tools or applications at the same time as the source data itself. At a minimum this will include file or table name, date of creation, unique creator/author identifier, checksum, size in bytes, character set, number of records/rows, data source type (see above) and data source identifier (see above). Additional items may also include: cyclic redundancy check (“CRC”), hash totals, public key encryption token. Where no secure data channel exists, corporate data must be encrypted by a standard application before transmission to a staging location. All data mart staging areas must be kept secure with only those users responsible for operational management and the applications themselves having the necessary permissions. Corporate data is extremely valuable and must be kept secure at all times as it is moved from source systems toward the data mart. Where large quantities of data may need to be moved across a wide area network, the ability to compress the data at the source prior to transmission and expand the data upon receipt into the staging area is required. The ability to archive data with an associated time context naming convention is frequently useful and often required. Preferably the naming convention for files will include the following format: YYYYMMDD_HHmmSS where YYYY = century and year (ie 2007), MM = Gregorian/calendar month with leading zeroes (ie. 03 for March); DD = day of month; HH=hour of day (using zulu, universal or Greenwich meridian timezone; ie. 6:30AM Central Time = 12:30PM Universal Time); mm=minute of hour; SS=seconds of minute Any and all database objects developed to support the ETL process such as: tables, views, functions, stored procedures, dictionary table population scripts must have individual drop and create/replace scripts created for each object to facilitate rapid deployment of an ETL environment to a new database or server. Any and all scripts associated with the ETL process must be placed into a standard version control system to support change management and rollback in case of failures to assist in root cause determination. The version control system will be the single source used to produce a continuous integration environment for unit testing all ETL scripts. Business rule-driven unit tests will be designed for each ETL component and executed after any change which touches either the data processed by the component or any component immediately up/down stream from the component. Continuous integration testing of the ETL system will require the creation of a separate ETL data work area within the development environment which can be completely dropped, rebuilt and unit tests run on static datasets to produce expected results. These continuous integration tests must produce easily determined pass/fail results and be self-reporting to the ETL development team. All required client and server database, tool or other software (including upgrades, patches, service packs or hot-fixes) must be kept in a separate escrow to be able to reproduce an entire ETL environment exactly as it was prior to any upgrade to any component. All database, ftp and other connectivity information for each environment (development, testing/quality assurance and production) must be kept in a centralized, single location accessible by the development team and the ETL applications themselves. Appropriate access controls must be maintained on this information to only allow authorized individuals the ability to update this information as needed, and to allow a single point of maintenance which may be applied to all environments. Any component upgrade must be tested in the continuous integration environment prior to deployment to quality assurance and must never be applied to production without passing the continuous integration and quality assurance testing. There will be at minimum three (3) separate environments created to support ETL: 1) Development, 2) test/quality assurance and 3) production. All new development and unit testing activities will begin in the ETL development environment. Prior to migration of the ETL component to the
test/QA environment, a test plan must be written and communicated to the quality assurance staff. This test plan must include: all setup instructions, process execution instructions, expected results and/or success criteria and any restart instructions. All ETL processes will make use of parameters to provide the maximum amount of reusability without re-coding. This also reduces the number of modules to maintain, while increasing the number of tests required per process / module. All ETL metadata will reside in a central repository accessible via a generic search engine by all stakeholders in the process. ETL metadata must be accessible and updateable by all users (ie. Wiki-style) to keep the data current and add additional perspectives. The metadata should be periodically peer reviewed by metadata data stewards for content accuracy and currency. Every relational database table whether source, interim or final will have a minimum of the following attributes defined: name, classification (process control, process log, dimensional, master/reference, transaction, fact, associative, reject, transformation); primary key; business key; foreign keys; surrogate key; estimated row count; anticipated growth factor per month; average row length and indexing recommendations. Every file whether source, interim or final will have a minimum of the following attributes defined: creation date, identify file creator name, file classification (process control, log, reject, source, final) file type (fixed length, comma-separated variable, other delimited (with delimiter), structural layout containing: field name, data type (numeric, date, date/time, alpha, GUID), maximum length, required Y/N. identity, domain of values, and range of values. This information must supply enough information to construct data edit scripts to assess the validity and general quality of information contained in the file itself. Every relational database table column will have the following metadata requirements: (see file immediately above) in addition, foreign key columns must be associated with their parent table names where applicable to facilitate further automation of edit checking. Any additional business rules must be defined where applicable to a given column. This will support the addition of data-driven validation of business rules. Each column may have zero or many business rules applied to its validation process. Each business rule for a given column must have a gate, event (which may be empty) or state of relevant data required in order to fire the business rule edit test. As an example, if address line 1, city and state are supplied we may fire the business rule to validate the ZIP code for addresses in the United States . Each multi-dimensional cube must have the following items defined in the ETL metadata: by dimension: name, expected number of entries in each dimension, percent of sparse data anticipated in the dimension, OLAP query to produce meaningful control totals for the dimension; then for the cube one or more OLAP queries to produce a validation of the cube being updated (may be run before building the cube and again after in order to validate new data added had the desired impact) Any ETL process that must execute a third-party product must log all parameters sent to the third-party application and the resulting return code received from the product along with a timestamp for each and a mapping of the return code into a simple pass/fail status. A centralized dashboard must be developed to monitor and communicate the status of each cycle of the ETL process. All appropriate ETL processes must accurately update the appropriate ETL logs upon completion of their cycle to reflect both total data processed and total data rejected. This dashboard must support establishing performance thresholds for:
- Elapsed time required to process a given quantity of data; Data quality assessment as a percent of total new and as a percent of total overall for master/reference, transaction, dimension and fact data
The dashboard must also support drilling down from the dashboard to the details of the ETL logs to provide immediate insight into the root causes of any out of threshold aggregates.
· All data mart surrogate keys will be assigned via a programmatic (not via database auto assignment / identity) approach defined for each table as appropriate to provide an added measure of control and support portability of data between environments.
· All data added to the data mart will be identified by a unique data source identifier, which will have been previously established in the data mart source system control table.
· All data loaded into the data mart during the same ETL processing cycle will be identified by a unique identifier assigned at the beginning of the cycle. This identifier will be generated at the time of the ETL processing cycle and will have additional metadata about the ETL cycle associated with it including: description of the reason for the cycle, ETL start date and time, ETL environment identifier.
· When an ETL processing cycle successfully completes and the resulting data in the data mart has passed both automated and any manual quality assurance checks, the staging data (successfully loaded) will be associated with the unique ETL processing session identifier, it will be archived in either a flat file or database backup.
· Prior to the start of any ETL processing cycle the destination data mart, any associated OLAP cubes, and any staging are to be affected during the process must be backed up. This is necessary to supply a recovery position should a catastrophic failure occur during the process. For performance reasons if the backups are done at the end of the previous ETL process, these backups must be verified to have been successful and still viable.