Exploring Change Data Capture (CDC)

Change Data Capture (CDC) technique tracks and stores incremental changes of the source database to replicate those changes to other databases in near real-time. CDC has become the ideal solution for low-latency efficient database migration and synchronization between relational on-premises or cloud databases in high-load environments. This whitepaper explores three the most popular methods of Change Data Capture.

Traditionally, companies and organizations have used snapshot approach to migrate or synchronize data. According to this technique, source database state is captured as a snapshot and then it is applied to the target database, that means the data is transferred in a single operation. During the snapshot replication any modifications of the source database are not allowed, that causes the database downtime. Obviously, this approach is becoming unacceptable in circumstances of permanent data flow of the modern business environments.

When migrating to the cloud, ensuring that time-sensitive information is replicated is crucial, especially when the data is frequently changing and interrupting connections to online databases is not possible.

CDC has three key benefits compared to snapshot replication:

reduces transferring data over the network by sending minimal incremental changes only
helps to get the most up-to-date data faster, that is important for real-time applications and business flows
minimizes the system downtime and disruption of workloads

Approaches to Change Data Capture

There are three the most popular approaches to Change Data Capture. The following are the brief description of each method, their strength and weak sides.

1. Timestamp tracking. Following this method, tables included in CDC should have a service column representing the timestamp of last change. CDC algorithm considers any row having timestamp after the time of the last data capture as modified.

Pros: simple implementation
Cons: does not track deleted rows, those rows will not be replicated in the target database. Also, this approach requires extra CPU resources to scan tables for modified data.

2. Trigger-based Change Data Capture. This is one of popular methods for capturing data changes for corporate-scale databases. Following this approach, triggers for insert, updates and delete are created for each table involved in CDC. Each of those triggers stores all necessary information about data changing event into special 'history' table created in the source database. Then, all changes are replicated to the target database based on the records from the 'history' table.

For example, Intelligent Converters software creates 'history' table having the following structure (for MySQL):

  CREATE TABLE `__history__`(
    table_name varchar(255) NOT NULL,
    pk_data text NOT NULL,
    state int NOT NULL,
    ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY(table_name,pk_data_src(255))
  );

where 'pk_data' is the string representation of primary key/unique index of the captured row.

Pros: very reliable and detailed, 'history' table provides easy-to-use journal of all changes.
Cons: degrades the performance due to multiple writes to a database for every data update. It is important to test the related triggers performance and evaluate tolerance of the extra overhead.

3. Transaction Log CDC. Advanced database management systems (such as Oracle, MySQL, PostgreSQL, MS SQL) use transaction logs for backup and recovery purposes. However, those logs can also be used to track and replicate changes into the target database.

Pros: does not affect to the source database system since there is no additional transactions for each data update. Also, it does not require to change structures of the source database tables or create a service table.
Cons: it is hard to extract updated data from transaction log because DBMS vendors do not open its format. Also, the parsing algorithm may not work with new versions of database management system because format of transaction log is modified. DBMS archives log files permanently, so CDC tool may lose some changes if does not read transaction log before it is archived.

Trigger-based vs Transaction Log CDC

The Transaction Log CDC technique does not imply permanent storing the data stream. Kafka event streaming platform can be used to capture and load changes into the target database. Unlike trigger-based CDC approach, transaction log CDC extracts information about data updates from the storage created by DBMS instead using its own.

Native transaction logs, also known as redo logs, are utilized by a database engine to store all database activities, allowing for database recovery in the event of a failure. Trigger-based method creates its own events journal to track changes with full control over it, while transaction log CDC uses the underlying database transaction log. That is why no modifications at the application level or scanning of the 'history' table are required by the second approach. On the other hand, lack of control over the transaction log may cause some challenges for the corresponding technique explored in the next section of this whitepaper.

Following the trigger-based CDC technique, triggers are involved in every transaction over the captured tables to instantly track insert, update, or delete events as they occur and store the corresponding changes in 'history' table. In contrast, transaction log CDC operates autonomously from transactions, using a redo log file to record changes. This results in better performance since CDC operations are not directly connected to every transaction as it occurs in the database.

Why Intelligent Converters Use Trigger-based CDC?

1. Most database management systems have no documentation about the format of transaction logs. This makes hard to develop algorithm of parsing transaction logs.

2. DBAs do not like to make any changes into configuration of working database, for example enable transaction logs on the database server if it is disabled. It is risky to rely on the feature that may be disabled in some configurations.

3. If connection to the target database is lost due to a technical issue when Transaction Log CDC is sending data, this may cause data loss or duplication since in the next replication run.

Conclusion

Change Data Capture (CDC) is an efficient technique for database migration and synchronization between relational databases, especially in high-load environments. CDC involves tracking and storing incremental changes in the source database and replicating them in near real-time to other databases with minimal volume of data transferred.

Three popular methods of CDC are: timestamp tracking, trigger-based CDC and transaction log CDC, each with its strengths and weaknesses. The Trigger-based CDC approach is preferred by Intelligent Converters due to its reliability and easy-to-use journal of all changes. On the other hand, Transaction Log CDC is challenging due to the lack of documentation about the format of transaction logs, the potential risk of data loss, and duplication in case of a lost connection to the target database.

Have questions? Contact us