Sunday, June 29, 2025

Tuning and Troubleshooting Synchronous Redo Transport (Part 2)

Tuning and Troubleshooting Synchronous Redo Transport (Part 2)

Alireza Kamrani (06/29/2025)

 

 

Understanding What Causes Outliers

Any disruption to the I/O on the primary or standby databases, or spikes in network latency, can cause high log file sync outliers with synchronous redo transport. 

You can see this effect when the standby system's I/O subsytemis inferior to that of the primary system.

Often administrators host multiple databases such as dev and test on standby systems, which can impair I/O response. 

It is important to monitor I/O using iostat to determine if the disks reach maximum IOPS, because this affects the performance of SYNC writes.

Frequent log switches are significant cause of outliers. Consider what occurs on the standby when a log switch on the primary occurs, as follows.

1. Remote file server (RFS) process on the standby must finish updates to the standby redo log header.
2. RFS then switches into a new standby redo log with additional header updates.
3. Switching logs forces a full checkpoint on the standby.

This causes all dirty buffers in the buffer cache to be written to disk, causing a spike in write I/O.

In a non-symmetric configuration where the standby storage subsystem does not have the same performance as the primary database, this results in higher I/O latency.

4. The previous standby redo log must be archived, increasing both read and write I/O.

Effects of Synchronous Redo Transport Remote Writes

When you enable synchronous redo transport (SYNC), you introduce a remote write (remote file server (RFS) write to a standby redo log) in addition to the normal local write for commit processing.

This remote write, depending on network latency and remote I/O bandwidth, can make commit processing time increase. 

Because commit processing takes longer, you observe more sessions waiting on the Log Writer Process (LGWR) to finish its work and begin work on the commit request, that is, application concurrency has increased. 

You can observe increased application concurrency by analyzing database statistics and wait events.

Consider the example in the following table.

Affect of Sync Transport Increasing Application Concurrency 


SYNC

Redo Rate

Network Latency

TPS from AWR

log file sync average (ms)

log file parallel write average (ms)

RFS random I/O

SYNC remote write average (ms)

Redo write size (KB)

Redo writes

Defer

25MB

0

5,514.94

0.74

0.47

NA

NA

10.58

2,246,356

Yes

25MB

0

5,280.20

2.6

.51

.65

.95

20.50

989,791

Impact

0

-

-4%

+251%

+8.5%

NA

NA

+93.8%

-55.9%

 

In the above example, enabling SYNC reduced the number of redo writes, but increased the size of each redo write. 

Because the size of the redo write increased, you can expect the time spent doing the I/O (both local and remote) to increase. 

The log file sync wait time is higher because there is more work per wait.

However, at the application level, the impact on the transaction rate or the transaction response time might change very little as more sessions are serviced for each commit. 

This is why it is important to measure the impact of SYNC at the application level, and not depend entirely on database wait events.

It is also a perfect example of why log file sync wait event is a misleading indicator of the actual impact SYNC has on the application.


Example of Synchronous Redo Transport Performance Troubleshooting

To look at synchronous redo transport performance, calculate the time spent for local redo writes latency, average redo write size for each write, and overall redo write latency, as shown here.

Use the following wait events to do the calculations.

• local redo write latency = 'log file parallel write'
• remote write latency = ‘SYNC remote write’
• average redo write size per write = ‘redo size’ / ‘redo writes’
• average commit latency seen by foregrounds = 'log file sync'

Statistics from an Automatic Work Repository (AWR) report on an Oracle database are provided in the following table. 

Synchronous redo transport (SYNC) was enabled to a local standby with a 1ms network latency to compare the performance impact to a baseline with SYNC disabled.

 

Assessing Synchronous Redo Transport Performance with Oracle Database

Metric

Baseline (No SYNC)

SYNC

Impact

redo rate (MB/s)

25

25

no change

log file sync

0.68

4.60

+576%

log file parallel write average (ms)

0.57

0.62

+8.8%

TPS

7,814.92

6224.03

-20.3%

RFS random I/O

NA

2.89

NA

SYNC remote write average (ms)

NA

3.45

NA

redo writes

2,312,366

897,751

-61,2%

redo write size (KB)

10.58

20.50

+93.8%

 

In the above example observe that log file sync waits averages increased dramatically after enabling SYNC. 

While the local writes remained fairly constant, the biggest factor in increasing log file sync was the addition of the SYNC remote write. 

Of the SYNC remote write the network latency is zero, so focusing on the remote write into the standby redo log shows an average time of 2.89ms. 

This is an immediate red flag given that the primary and standby were using the same hardware, and the SYNC remote write average time should be similar to the primary's log file parallel write average time.

In the above example, the standby redo logs have multiple members, and they are placed in a slower performing disk group. 

After reducing the standby redo logs to a single member, and placing them in a fast disk group, you can see results such as those shown in the following table.

 

SYNC Performance After Reducing Standby Redo Logs to a Single Member and Placing on a Fast Disk Group


Metric

Baseline (No SYNC)

SYNC

Impact

redo rate (MB/s)

25

25

no change

log file sync

0.67

1.60

+139%

log file parallel write

0.51

0.63

+23.5%

TPS

7714.36

7458.08

-3.3%

RFS random I/O

NA

.89

NA

SYNC remote write average (ms)

NA

1.45

NA

redo writes

2,364,388

996,532

-57.9%

redo write size (KB)

10.61

20.32

+91.5%

 

Wednesday, June 25, 2025

An Overview of Oracle Data Guard Capabilities

  

An Overview of Oracle Data Guard Capabilities:

Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data.

Oracle Data Guard provides a comprehensive set of services that create, maintain, manage, and monitor one or more standby databases to enable production Oracle databases to survive disasters and data corruptions.

Oracle Data Guard maintains these standby databases as copies of the production database.

Then, if the production database becomes unavailable because of a planned or an unplanned outage, Oracle Data Guard can switch any standby database to the production role, minimizing the downtime associated with the outage.

Oracle Data Guard can be used with traditional backup, restoration, and cluster techniques to provide a high level of data protection and data availability.

Oracle Data Guard transport services are also used by other Oracle features such as Oracle Streams and Oracle GoldenGate for efficient and reliable transmission of redo from a source database to one or more remote destinations.

With Oracle Data Guard, administrators can optionally improve production database performance by offloading resource-intensive backup and reporting operations to standby systems.

Oracle Database with Oracle Data Guard

Oracle Data Guard is a high availability and disaster-recovery solution that provides very fast automatic failover (referred to as fast-start failover) in database failures, node failures, corruption, and media failures. Furthermore, the standby databases can be used for read-only access and subsequently for reader farms, for reporting, and for testing and development.

Although traditional solutions (such as backup and recovery from tape, storage-based remote mirroring, and database log shipping) can deliver some level of high availability, Oracle Data Guard provides the most comprehensive high availability and disaster recovery solution for Oracle databases.

Oracle Data Guard Advantages Over Traditional Solutions

Oracle Data Guard provides a number of advantages over traditional solutions, including the following:

  • Fast, automatic or automated database failover for data corruptions, lost writes, and database and site failures
  • Automatic corruption repair automatically replaces a corrupted block on the primary or physical standby by copying a good block from a physical standby or primary database
  • Most comprehensive protection against data corruptions and lost writes on the primary database
  • Reduced downtime for storage, Oracle ASM, Oracle RAC, system migrations and some platform migrations, and changes using Data Guard switchover
  • Reduced downtime with Oracle Data Guard rolling upgrade capabilities
  • Ability to off-load primary database activities—such as backups, queries, or reporting—without sacrificing the RTO and RPO ability to use the standby database as a read-only resource using the real-time query apply lag capability
  • Ability to integrate non-database files using Oracle Database File System (DBFS) as part of the full site failover operations
  • No need for instance restart, storage remastering, or application reconnections after site failures
  • Transparency to applications
  • Transparent and integrated support for application failover
  • Effective network utilization

For data resident in Oracle databases, Oracle Data Guard, with its built-in zero-data-loss capability, is more efficient, less expensive, and better optimized for data protection and disaster recovery than traditional remote mirroring solutions.

Oracle Data Guard provides a compelling set of technical and business reasons that justify its adoption as the disaster recovery and data protection technology of choice, over traditional remote mirroring solutions.

The types of standby databases are as follows:

  • Physical standby database

Provides a physically identical copy of the primary database, with on-disk database structures that are identical to the primary database on a block-for-block basis. The database schema, including indexes, are the same. A physical standby database is kept synchronized with the primary database, through Redo Apply, which recovers the redo data received from the primary database and applies the redo to the physical standby database.

Physical standby database can receive and apply redo while it is open for read-only access. A physical standby database can therefore be used concurrently for data protection and reporting.

Additionally, a physical standby database can be used to install eligible one-off patches, patch set updates (PSUs), and critical patch updates (CPUs), in rolling fashion.

  • Logical standby database

Contains the same logical information as the production database, although the physical organization and structure of the data can be different.

The logical standby database is kept synchronized with the primary database through SQL Apply, which transforms the data in the redo received from the primary database into SQL statements and then executes the SQL statements on the standby database.

The flexibility of a logical standby database lets you upgrade Oracle Database software (patch sets and new Oracle Database releases) and perform other database maintenance in rolling fashion with almost no downtime.

 From Oracle Database 11g onward, the transient logical database rolling upgrade process can also be used with existing physical standby databases.

  • Snapshot Standby Database

A snapshot standby database is a fully updatable standby database.

Like a physical or logical standby database, a snapshot standby database receives and archives redo data from a primary database. Unlike a physical or logical standby database, a snapshot standby database does not apply the redo data that it receives.

The redo data received by a snapshot standby database is not applied until the snapshot standby is converted back into a physical standby database, after first discarding any local updates made to the snapshot standby database.

A snapshot standby database is best used in scenarios that require a temporary, updatable snapshot of a physical standby database.

For example, you can use the Oracle Real Application Testing option to capture the database workload on a primary and then replay it for test purposes on the snapshot standby.

Because redo data received by a snapshot standby database is not applied until it is converted back into a physical standby, the time needed to recover from a primary database                                                                                                                                         failure is directly proportional to the amount of redo data that needs to be applied.

Oracle Data Guard Advantages Compared to Remote Mirroring Solutions

The following list summarizes the advantages of using Oracle Data Guard compared to using remote mirroring solutions:

  • Better network efficiency—With Oracle Data Guard, only the redo data needs to be sent to the remote site and the redo data can be compressed to provide even greater network efficiency. However, if a remote mirroring solution is used for data protection, typically you must mirror the database files, the online redo log, the archived redo logs, and the control file. If the fast recovery area is on the source volume that is remotely mirrored, then you must also remotely mirror the flashback logs. Thus, compared to Oracle Data Guard, a remote mirroring solution must transmit each change many more times to the remote site.
  • Better performance—Oracle Data Guard only transmits write I/O`s to the redo log files of the primary database, whereas remote mirroring solutions must transmit these writes and every write I/O to data files, additional members of online log file groups, archived redo log files, and control files.

Oracle Data Guard is designed so that it does not affect the Oracle database writer (DBWR) process that writes to data files, because anything that slows down the DBWR process affects database performance. However, remote mirroring solutions affect DBWR process performance because they subject all DBWR process write I/O`s to network and disk I/O induced delays inherent to synchronous, zero-data-loss configurations.

Compared to mirroring, Oracle Data Guard provides better performance and is more efficient, Oracle Data Guard always verifies the state of the standby database and validates the data before applying redo data, and Oracle Data Guard enables you to use the standby database for updates while it protects the primary database.

  • Better suited for WANs—Remote mirroring solutions based on storage systems often have a distance limitation due to the underlying communication technology (Fibre Channel or ESCON (Enterprise Systems Connection)) used by the storage systems.        In a typical example, the maximum distance between the systems connected in a point-to-point fashion and running synchronously can be only 10 kilometers. By using specialized devices, this distance can be extended to 66 kilometers. However, when the data centers are located more than 66 kilometers apart, you must use a series of repeaters and converters from third-party vendors. These devices convert ESCON or Fibre Channel to the appropriate IP, ATM, or SONET networks.

 Although in new version Oracle have some of new features and technologies to minimize Redo transfer between long distances such as Oracle cascading Standby or Far-Sync technology. Far sync instances are part of the Oracle Active Data Guard Far Sync feature, which requires an Oracle Active Data Guard license. A far sync instance consumes very little disk and processing resources, yet provides the ability to failover to a terminal destination with zero data loss, as well as offload the primary database of other types of overhead (for example, redo transport). A far sync instance manages a control file, receives redo into standby redo logs (SRLs), and archives those SRLs to local archived redo logs, but that is where the similarity with standbys ends. A far sync instance does not have user data files, cannot be opened for access, cannot run redo apply, and can never function in the primary role or be converted to any type of standby database. All redo transport options available to a primary when servicing a typical standby destination are also available to it when servicing a far sync instance. And all redo transport options are available to a far sync instance when servicing terminal destinations (for example, performing redo transport compression, if you have a license for the Oracle Advanced Compression option).

 

 

  • Better resilience and data protection—Oracle Data Guard ensures much better data protection and data resilience than remote mirroring solutions. This is because corruptions introduced on the production database probably can be mirrored by remote mirroring solutions to the standby site, but corruptions are eliminated by Oracle Data Guard.

For example, if a stray write occurs to a disk, or there is a corruption in the file system, or the host bus adaptor corrupts a block as it is written to disk, then a remote mirroring solution may propagate this corruption to the disaster-recovery site. Because Oracle Data Guard only propagates the redo data in the logs, and the log file consistency is checked before it is applied, all such external corruptions are eliminated by Oracle Data Guard. Automatic block repair may be possible, thus eliminating any downtime in an Oracle Data Guard configuration.

  • Higher flexibility—Oracle Data Guard is implemented on pure commodity hardware. It requires only a standard TCP/IP-based network link between the two computers. There is no fancy or expensive hardware required. It also allows the storage to be laid out in a different fashion from the primary computer. For example, you can put the files on different disks, volumes, file systems, and so on.
  • Better functionality—Oracle Data Guard provides full suite of data protection features that provide a much more comprehensive and effective solution optimized for data protection and disaster recovery than remote mirroring solutions. For example: Active Data Guard, Redo Apply for physical standby databases, and SQL Apply for logical standby databases, multiple protection modes, push-button automated switchover and failover capabilities, automatic gap detection and resolution, GUI-driven management and monitoring framework, cascaded redo log destinations.
  • Higher ROI—Businesses must obtain maximum value from their IT investments, and ensure that no IT infrastructure is sitting idle. Oracle Data Guard is designed to allow businesses get something useful out of their expensive investment in a disaster-recovery site. Typically, this is not possible with remote mirroring solutions.

Tuning and Troubleshooting Synchronous Redo Transport (Part 2)

Tuning and Troubleshooting Synchronous Redo Transport (Part  2 ) Alireza Kamrani (06 /29/ 2025)     Understanding What Causes Outliers Any d...