Data Migration is a field that is often overlooked as a simple step between retiring an existing platform and bringing to life a new one. The reality is that we are constantly migrating data between systems, implementing new applications, setting up coexistence scenarios, and sharing information in a way that could be thought of as data migration.

The actual task of data migration is a complex one requiring experience and a detailed understanding of the steps to be taken, both in terms of business process and technical implementation. This includes a clear delineation of responsibility between IT and business, since far too often the job of manipulating data is relegated to IT, well outside of their core competency. 

Enabling the migration requires a team experienced in abstracting the technical details such that the business team can implement their rules for migration without ever writing a single line of code, a task for IT.

This document focuses on technical architecture rather than process. We describe an architecture that we’ve found to be ideal in dozens of large data migration scenarios at organizations ranging from global Fortune 500 companies to local Small Businesses. This architecture is portable between different software toolsets; we’ve implemented it in hand-coded systems as well as open-source and best-of-breed ETL/DQ solutions.

This document is aimed at anyone who’s about to embark on a data migration task, or who is trying to rescue an existing project that is “off-plan”. To gain the most from this document, you should only require a rudimentary understanding of the terminology of data handling, terms such as ETL (Extract, Transform, and Load) and Reference Data (e.g. ISO currency codes, lists of possible postcodes).

Data Migration ETL Architecture

Stage 1: Data is read directly from source systems in the native code page and put into database staging areas in a format and ER model that exactly matches the source systems, whether they are ERPs, bespoke systems, flat files, or semi-structured data. Some metadata is added to aid in the tracking of records in the form of a surrogate ID (SID) as well as timestamp, source, and other metadata as appropriate for the system.

Stage 2: Data is moved from the Stage 1 area to the Stage 2 area. Only that subset of columns considered necessary in the data map to target systems are moved, including any fields necessary to perform calculations but not necessarily designed to be loaded to the target, as well as all metadata. SIDs are maintained to provide easy “back tracking”. Any filtering rules to remove data known to be irrelevant to the migration is performed. Duplicates are removed as deemed appropriate (i.e. customer master data might be removed while transactions are persisted). Data is still maintained in the entity-relationship model of the source systems less the columns that have been deemed extraneous. All actions are logged in the audit and error data model with error data being excluded from the target.

Stage 3: Data is mapped from the old entity-relationship model to the new one, metadata is transferred, SIDs maintained, and all actions logged while error records are excluded from the target. It is important to note that no data translations or calculations have taken place yet. The data is in the target ER model but not necessarily in the target data formats.  All data types are defined as VARCHAR (MAX LENGTH) to avoid rejecting any source data.

Stage 4: Data is translated to meet target system rules, implement functional requirements, and checked against target system reference data. All metadata and SIDs are maintained and all actions logged, both audit and error information. A second duplicate check is performed here to account for the potential for “new” duplicates create by data translation, of course the concept of a duplicate is one requiring a business and technical definition. The target data model including data types have now been put in place and a column-by-column check is enforced to ensure that no data violate the datatype constraints in the target system.  Error records are excluded from the target and logged.

“Dictionary” translations, those translations where a given input from a given source system translates to a particular output (e.g. “Source System A” uses a 3 character country code while “Target System X” uses a 2 character one) will have the code abstracted such that the translation is stored in a value mapping table. Those tables can be easily maintained by the business users via a spreadsheet that gets loaded into the value mapping table. This ensures that the business maintains responsibility for the data translation while the development team is responsible for correctly coding the abstract rule.

Stage 5: Referential integrity checks are performed to ensure that the data structures are completely ready to be written to the target system. Audit and error data is logged and metadata is maintained. At this point the data should be error-free, any “child” records that did not pass error checks have been excluded and any parent records that didn’t pass error checks have also been excluded along with any related child records such that orphan records cannot be written to the target system. Alteration of the parent-child exclusion rules is possible but heavily dependent on the nature of the target systems (i.e. SAP will not allow the creation of a parent-child hierarchy if parent and child legal entities don’t already exist in the system).

Stage 6: This stage involves invoking any APIs necessary and passing data in the structure required to those methods. It also includes writing data directly in cases requiring a direct-write rather than API, IDOCS, or similar calls.

Reconciliation: Stage 6 reconciliation is performed by reading the data structures from the target system and comparing backwards to Stage 5 data, performing logical DELETEs against exact matches.

Fault, fix, and follow-through: Using our internally developed audit and error data model along with the technical steps detailed here allows a single view of all excluded data, why, and where it was excluded, and also exposes a simple interface in which a business user can change the data (with further audit information logged, of course). This system can be built to log exclusions based on minor differences (i.e. variations of ‘-‘ vs. ‘/’ on dates) or to only indicate major failures (i.e. variations in transaction values). Ultimately the system allows for “fixing” data in situ and “re-migrating” the data to the new system.

This system is meant to support test data migrations such that all rules can be verified and enforced with the goal of ZERO errors in the stages by the time of the final cutover.

Added value

The occasion of a data migration often involves work that can be used to support other initiatives bringing new value to the enterprise.

Master Data Management

As part of the data and application migration, it is often necessary to capture and validate various “master data” including:

  • Customer data
  • Product data
  • Reference data
  • Value mappings for translation: “Dictionary translations” as described in Stage 4.

Having developed the data model and captured the requisite information, the possibility of implementing a full-featured Master Data Hub can be considered. Master data could be created, maintained, or simply exposed via this central repository with all systems synchronized in real-time or batch modes to this Single Version of Truth.

Co-existence

At the time of writing the translations and transformations of data and hierarchies in the migration workstream, “reverse functions” are written for those transformations that are easily reversed (e.g. VAT added calculations can be easily reversed). Other not easily reversible transformations (e.g. concatenations) are “reversed” by writing the rule as though it was to support a migration from the target system(s) to the source system(s) with the added benefit of being able to examine programmatically the audit trails. With the object-oriented writing of all functions, they are easily re-used to support a co-existence scenario of legacy and new systems that can be used to replicate data between those systems on a real-time or batch basis as appropriate. These transformations can be configured to be called via an Enterprise Service Hub (ESB) or other methods, as required.

Archiving

During any data migration, business requirements should be captured around the availability of data; what data should be available on which SLAs in order to ensure that the operating system continues to operate at maximum efficiency. A typical plan might envision segregating former customers to a secondary system outside of the main physical database (for performance reasons) 12 months after they terminate a client relationship. The same could be true of older transactions and other data. The ideal technical system would have the ability to treat data in an “object-oriented” form, archiving off such “objects” like a customer, automatically capturing all information related to that customer, and moving it to the secondary system. 

The data can still be interrogated but is no longer cause for performance degradation as data volumes expand. Use of this tactic ensures that the sizing of databases and therefore software and hardware requirements can be easily predicted and is the only variable insofar as operational behaviours vary, or the client base expands at which time additional hardware or software can become necessary. 

This function can also support a full Information Lifecycle Management system that includes the creation of data for system testing with features such as object movement (with referential integrity maintained) and data masking for security. Additional work required to support this would include the creation of a security model that defines how datasets can be created, who can view unmasked data (at the branch/column/customer/etc. level), and how the datasets can be extracted or moved.

There are existing software products that can best leverage the work that will have already been performed to support the Information Lifecycle Management system.

Getting Started

Our experience shows that if a well-defined business process is implemented, along with our technical steps as described above, you can significantly reduce the cost of your data migration and eliminate many sources of risk, especially when that migration is repetitive in nature (multiple countries to be migrated independently over a number of months/years to a new system). A typical cost curve levels out at a 30% cost per migration after the second migration. Furthermore, you can leverage the artifacts created to gain new value as described in “Added Value” above.