From:       To:      
Home > Documentation > Miscellaneous

Exploring Change Data Capture (CDC)

Change Data Capture (CDC) technique tracks and stores incremental changes of the source database to replicate those changes to other databases in near real-time. CDC has become the ideal solution for low-latency efficient database migration and synchronization between relational on-premises or cloud databases in high-load environments. This whitepaper explores three the most popular methods of Change Data Capture.

Traditionally, companies and organizations have used snapshot approach to migrate or synchronize data. According to this technique, source database state is captured as a snapshot and then it is applied to the target database, that means the data is transferred in a single operation. During the snapshot replication any modifications of the source database are not allowed, that causes the database downtime. Obviously, this approach is becoming unacceptable in circumstances of permanent data flow of the modern business environments.

When migrating to the cloud, ensuring that time-sensitive information is replicated is crucial, especially when the data is frequently changing and interrupting connections to online databases is not possible.

CDC has three key benefits compared to snapshot replication:

Approaches to Change Data Capture

There are three the most popular approaches to Change Data Capture. The following are the brief description of each method, their strength and weak sides.

1. Timestamp tracking. Following this method, tables included in CDC should have a service column representing the timestamp of last change. CDC algorithm considers any row having timestamp after the time of the last data capture as modified.

2. Trigger-based Change Data Capture. This is one of popular methods for capturing data changes for corporate-scale databases. Following this approach, triggers for insert, updates and delete are created for each table involved in CDC. Each of those triggers stores all necessary information about data changing event into special 'history' table created in the source database. Then, all changes are replicated to the target database based on the records from the 'history' table.

For example, Intelligent Converters software creates 'history' table having the following structure (for MySQL):

  CREATE TABLE `__history__`(
    table_name varchar(255) NOT NULL,
    pk_data text NOT NULL,
    state int NOT NULL,
    ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY(table_name,pk_data_src(255))
  );

where 'pk_data' is the string representation of primary key/unique index of the captured row.

3. Transaction Log CDC. Advanced database management systems (such as Oracle, MySQL, PostgreSQL, MS SQL) use transaction logs for backup and recovery purposes. However, those logs can also be used to track and replicate changes into the target database.

Trigger-based vs Transaction Log CDC

The Transaction Log CDC technique does not imply permanent storing the data stream. Kafka event streaming platform can be used to capture and load changes into the target database. Unlike trigger-based CDC approach, transaction log CDC extracts information about data updates from the storage created by DBMS instead using its own.

Native transaction logs, also known as redo logs, are utilized by a database engine to store all database activities, allowing for database recovery in the event of a failure. Trigger-based method creates its own events journal to track changes with full control over it, while transaction log CDC uses the underlying database transaction log. That is why no modifications at the application level or scanning of the 'history' table are required by the second approach. On the other hand, lack of control over the transaction log may cause some challenges for the corresponding technique explored in the next section of this whitepaper.

Following the trigger-based CDC technique, triggers are involved in every transaction over the captured tables to instantly track insert, update, or delete events as they occur and store the corresponding changes in 'history' table. In contrast, transaction log CDC operates autonomously from transactions, using a redo log file to record changes. This results in better performance since CDC operations are not directly connected to every transaction as it occurs in the database.

Why Intelligent Converters Use Trigger-based CDC?

1. Most database management systems have no documentation about the format of transaction logs. This makes hard to develop algorithm of parsing transaction logs.

2. DBAs do not like to make any changes into configuration of working database, for example enable transaction logs on the database server if it is disabled. It is risky to rely on the feature that may be disabled in some configurations.

3. If connection to the target database is lost due to a technical issue when Transaction Log CDC is sending data, this may cause data loss or duplication since in the next replication run.

Conclusion

Change Data Capture (CDC) is an efficient technique for database migration and synchronization between relational databases, especially in high-load environments. CDC involves tracking and storing incremental changes in the source database and replicating them in near real-time to other databases with minimal volume of data transferred.

Three popular methods of CDC are: timestamp tracking, trigger-based CDC and transaction log CDC, each with its strengths and weaknesses. The Trigger-based CDC approach is preferred by Intelligent Converters due to its reliability and easy-to-use journal of all changes. On the other hand, Transaction Log CDC is challenging due to the lack of documentation about the format of transaction logs, the potential risk of data loss, and duplication in case of a lost connection to the target database.

Have questions? Contact us