Sunday, April 20, 2025

Tune and Troubleshoot Oracle Data Guard (Part 7 of 8)

Alireza Kamrani 

04/20/2025


To improve apply performance holistically, leverage the data gathering and troubleshooting methodology described in the topics below.


Understanding Redo Apply and Redo Apply Performance Expectations


Standby database recovery is the process of replaying all DML and DDL operations. 


The high level process is:

  1. Redo is received from the primary database and written into standby redo logs (SRLs). When the database is an Oracle RAC database, each thread (instance) is stored in it's assigned SRLs.
  2. The log merger process, sometimes known as the recovery coordinator, merges the threads of redo and places the resulting change vectors into memory buffers.
  3. Recovery worker processes identify which data blocks are required and read them into the buffer cache if they are not already present. Then the worker processes apply the change vectors to the blocks in the buffer cache.
  4. At checkpoint time, database writer processes write the validated buffer changes to data files, advancing the database's checkpoint time stamp, called the System Commit Number (SCN). Checkpoint can be the most extensive I/O load in the recovery process.

Redo Apply Performance Expectations

Performance, and the resulting apply rate, mainly depend on the type of workload that is being recovered and the system resources allocated to and available for recovery.

Oracle recommends that the primary and standby database systems are symmetric, including equivalent I/O subsystems, memory, and CPU resources. The primary reason for this recommendation is so that the application performs at the same level, no matter which database is the primary database; however, redo apply performance also benefits greatly from symmetric primary and standby databases.

 Features such as data protection (DB_BLOCK_CHECKING, DB_BLOCK_CHECKSUM,DB_LOST_WRITE_PROTECT) require CPU and I/O resources, as does reporting on the standby database using Oracle Active Data Guard. 


For the most part, redo apply performance should keep up with the redo generation rates, resulting in near zero apply lag with system resources are symmetric. During peak workloads, there may be a slight redo apply gap which should naturally reduce to near zero once workloads return to normal levels.


OLTP Workloads

Recovering Online Transaction Processing (OLTP) workloads can be very I/O intensive because an OLTP workload performs small changes to many different blocks. This results in large numbers of small random block reads into the buffer cache during recovery. Subsequently, the database writers run large batches of write I/Os to maintain the buffer cache and to checkpoint the database periodically. 


Therefore, recovery of OLTP workloads requires the storage subsystem to handle a high number of I/Os Per Second (IOPS) in order to achieve optimal rates. This is another reason for recommending that the primary and standby database systems are symmetric.

Recovery testing of OLTP workloads, generated by swingbench on Oracle Exadata Database Machine quarter rack systems with no resource bottlenecks, achieved approximately 150 MB/sec apply rates. Rates of 200+ MB/s with single instance redo apply have been observed by customers on larger Exadata systems. These rates are more challenging to achieve in non-Exadata systems since the I/O and network throughput are lower. 


Batch Workloads

In contrast to OLTP workload recovery, recovering batch workloads is more efficient because batch workloads consist of large sequential reads and writes. A lot more redo changes are occurring while reading and modifying significantly fewer data blocks, resulting in much faster redo apply rates than OLTP workloads. In addition, batch direct load operation recovery optimizations result in greater efficiency and even higher recovery rates. 


Using batch load or parallel DML (PDML) workloads with no impeding system resource bottleneck, internal redo apply testing on small Exadata Database Machine quarter rack systems resulted in approximately 200-300 MB/sec apply rates. Customers have observed 600+ MB/sec apply rates with single instance redo apply for their batch workloads for larger Exadata systems. These rates can be achieved by non-Exadata systems, but system resource capacity and scalable network and I/O subsystems are required to handle these demanding workloads.


Mixed Workloads

The difference between OLTP and batch recovery performance profiles and different system shapes explains why applications with variation in their mixtures of OLTP and batch workloads can have different recovery rates at a standby database, even if the primary database redo generation rates are similar. Customers have achieved 100-1100 MB/sec redo apply rates with various mixed workloads for various Exadata systems. These rates can be achieved by non-Exadata systems, but system resource capacity and scalable database compute, network, and I/O subsystems are required to handle these demanding workloads. These extreme redo apply rates are rarely achieved on non-Exadata systems.


Catch Up Redo Apply Performance Expectations

Compared to real-time redo apply, redo apply during a "catch up" period may require even more system resources. If there is a large redo gap, 


If the apply lag is larger than 24 hours, consider using a standby roll forward method to skip over the gap rather than apply all of the redo. See How to Roll Forward a Standby Database Using Recover Database From Service (12.2 and higher) (Doc ID 2850185.1)

In cases where there is a large redo gap between the primary and standby or when there is an unresolvable gap of missing logs, the standby database can be 'rolled forward' using the RECOVER DATABASE FROM SERVICE command introduced in 12c.  This process recovers blocks from the primary to the standby which have changed since the standby database SCN. This is more efficient in catching up the standby rather than applying every individual change to the database as is done with managed recovery.


This approach pulls changed Oracle data blocks directly from the primary database, and can potentially mitigate a large redo gap in half the time required to apply all of the redo. 

The disadvantages of this approach are:

  • Logical corruption and lost write detection checks and balances that are inherent to redo apply and standby databases are skipped
  • Manual intervention is required to issue these commands and restart redo apply once it's completed. 

Data blocks are still verified for physical corruptions.


Verify Apply Lag

Recovery performance can vary with the workload type and the redo generation rate of the primary database. A lower apply rate does not necessarily indicate a recovery performance issue. However, a persistent or increasing apply lag, without an accompanying transport lag, is the best indication of a recovery performance bottleneck.

To identify and quantify apply lags and transport lags, query the V$DATAGUARD_STATS view in the standby database.

 

SQL> select name, value, time_computed, datum_time from v$dataguard_stats where name=’%lag’;


The DATUM_TIME column is the local time on the standby database when the datum used to compute the metric was received. The lag metrics are computed based on data that is periodically received from the primary database. An unchanging value in this column across multiple queries indicates that the standby database is not receiving data from the primary database. The potential data loss in this scenario would be from the last datum time from V$DATAGUARD_STATS to the current time on the standby.

To obtain a histogram that shows the history of transport or apply lag values since the standby instance was last started, query the V$STANDBY_EVENT_HISTOGRAM view.


 SQL> select * from v$standby_event_histogram where name like '%lag' and count >0;


To evaluate the transport or apply lag over a time period, take a snapshot of V$STANDBY_EVENT_HISTOGRAM in the standby database at the beginning of the time period, and compare that snapshot with one taken at the end of the time period.


SQL> col NAME format a10

SQL> select NAME,TIME,UNIT,COUNT,LAST_TIME_UPDATED from V$STANDBY_EVENT_HISTOGRAM

 where name like '%lag' and count >0 order by LAST_TIME_UPDATED;


Example output:

 

NAME       TIME   UNIT       COUNT LAST_TIME_UPDATED


---------- ------ ---------- ----- -----------

apply lag      23   seconds    3     02/05/2022 16:30:59

apply lag     135   seconds    1     02/05/2022 16:31:02

apply lag     173   seconds    2     02/05/2022 16:32:03

apply lag     295   seconds    2     02/05/2022 16:34:04


A transport lag can cause an apply lag. If a high apply lag is observed with a near zero transport lag, continue with this redo apply investigation in Gather Information section .


If a high transport lag is observed, first address the transport lag, using the methodology in Redo Transport Troubleshooting and Tunin(Part1)


Gather Information

Gather the following information when an unacceptable apply lag is occurring:


When did the apply lag occur?
Record the V$DATAGUARD_STATS and V$STANDBY_EVENT_HISTOGRAM data every 15 to 30 minutes to identify when the lag started and how lag changed over time in the last 24 hours.

SQL>select name, value, time_computed, datum_time from v$dataguard_stats where name=’%lag’;

SQL>select * from v$standby_event_histogram where name like '%lag' and count >0;

Does the apply lag occur at certain time period, such as daily at 12 midnight for daily batch operations, monthly during large batch operation, quarterly during quarter end?


Gather data from the standby Automatic Work Repository (AWR) report V$RECOVERY_PROGRESS, and take multiple standby AWR snapshots at 30 minute intervals before and during the apply lag. 

For example:

SQL> set lines 120 pages 99

SQL> alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';

SQL> select START_TIME, ITEM, SOFAR, UNITS from gv$recovery_progress;

Sample output:

START_TIME          ITEM                                  SOFAR UNITS

----------  -------------  ----------  ---------

2022/02/28 23:02:36 Log Files                                 

8 Files

2022/02/28 23:02:36 Active Apply Rate                     54385 KB/sec

2022/02/28 23:02:36 Average Apply Rate                    12753 KB/sec

2022/02/28 23:02:36 Maximum Apply Rate                    65977 KB/sec

2022/02/28 23:02:36 Redo Applied                           2092 Megabytes

2022/02/28 23:02:36 Last Applied Redo                         0 SCN+Time

2022/02/28 23:02:36 Active Time            

41 Seconds

2022/02/28 23:02:36 Apply Time per Log                        1 Seconds

2022/02/28 23:02:36 Checkpoint Time per Log                   0 Seconds

2022/02/28 23:02:36 Elapsed Time                            168 Seconds

2022/02/28 23:02:36 Standby Apply Lag                         2 Seconds


The simplest way to determine application throughput in terms of redo volume is to collect Automatic Workload Repository (AWR) reports on the primary database during normal and peak workloads, and determine the number of bytes per second of redo data the production database is producing. Then compare the speed at which redo is being generated with the Active Apply Rate columns in the V$RECOVERY_PROGRESS view to determine if the standby database is able to maintain the pace.

If the apply lag is above your expectations, then evaluate redo apply performance by querying the V$RECOVERY_PROGRESS view. This view contains the columns described in the following table. 

The most useful statistic is the Active Apply rate because the Average Apply Rate includes idle time spent waiting for redo to arrive making it less indicative of apply performance.


Active Session History

In cases where standby AWR is not available, or the standby database is not in open read-only mode, the top waits can be gathered using the V$ACTIVE_SESSION_HISTORY view. 


Standby AWR is strongly recommended due to the additional information and detail provided but these queries are useful in some cases.

To select to top 10 waits over the last 30 minutes (replace 30 with some other number of minutes ago from current time):


select * from (

select a.event_id, e.name, sum(a.time_waited) total_time_waited

from v$active_session_history a,  v$event_name e

where a.event_id = e.event_id and a.SAMPLE_TIME>=(sysdate-30/(24*60)) 

group by a.event_id, e.name order by 3 desc)

where rownum < 11;


To select the waits between two timestamps (example shows a 3 hour period between 2025/01/01 00:00:00 and 2025/01/01 03:00:00) :


select * from (

select a.event_id, e.name, sum(a.time_waited) total_time_waited

from v$active_session_history a, 

v$event_name e

where a.event_id = e.event_id 

and a.SAMPLE_TIME 

between to_date('2021/01/01 00:00:00','YYYY/MM/DD HH24:MI:SS') and to_date('2021/01/01 03:00:00','YYYY/MM/DD HH24:MI:SS')

group by a.event_id, e.name 

order by 3 desc)

where rownum < 11

/


Compare Redo Generation Rate History on the Primary

There are cases where the primary database's redo generation rate is exceptionally high for a small period of time, such as during large batch jobs, data loads, data pump operations, create table as select or PDML operations or end of month, quarter or year batch updates.

Obtain the redo generation history from the primary database and compare that to when the redo transport or redo apply lag started. Check if the redo generation rate is exceptionally high due to additional workloads, such as adding new pluggable databases (PDBs) or new application services. Additional tuning may be required to accommodate this additional load. 

As part of troubleshooting, gather the following information or address the following questions:


Gather daily history of the primary database's redo generation rate using this query.


SQL> select trunc(completion_time) as "DATE", count(*) as "LOG SWITCHES", round(sum(blocks*block_size)/1024/1024) as "REDO PER DAY (MB)" 

from v$archived_log 

where dest_id=1 

group by trunc(completion_time) order by 1;

Gather the per log redo generation rate, starting 6 hours before the start of any redo or transport lag.

SQL> alter session set nls_date_format='YYYY/MM/DD HH24:MI:SS';


SQL> select thread#,sequence#,blocks*block_size/1024/1024 MB,(next_time-first_time)*86400 sec, blocks*block_size/1024/1024)/((next_time-first_time)*86400) "MB/s" from v$archived_log 

where ((next_time-first_time)*86400<>0) 

and first_time between to_date('2015/01/15 08:00:00','YYYY/MM/DD HH24:MI:SS') 

and to_date('2015/01/15 11:00:00','YYYY/MM/DD HH24:MI:SS') 

and dest_id=1 order by first_time;

Is this primary redo generation rate exceptionally high compared to prior history?


If possible, determine the workload that corresponds to the high redo generation rate, and evaluate if it's transient or if it can be tuned. 
For example, for large purge operations, consider truncate or drop partition operations to reduce the redo generation volumes.


Tune Single Instance Redo Apply

Single instance redo apply (SIRA) tuning is an iterative process and a mandatory prerequisite before even evaluating multi-instance redo apply (MIRA). 


The iterative process consists of

  1. Evaluating and addressing system resource bottlenecks
  2. Tuning based on top standby database wait events

Evaluate System Resource Bottlenecks

First, evaluate system resources such as CPU utilization and I/O subsystem. Use utilities such as top and iostat or statistics from OSwatcher or ExaWatcher to determine if there is contention for those resources. 

Addressing any resource bottlenecks to free up resources required for redo apply can improve apply performance.


Redo apply can be impacted if:

  • The managed recovery node is completely CPU bound
  • The standby database's I/O system is saturated
  • The standby database SGA, specifically the buffer cache, is not at least the same size (or larger) than that on the primary database


For optimal recovery performance the standby database system requires: 

  • Sufficient CPU utilization for Recovery Coordinator (PR00) and recovery workers (PRnn)
  • Sufficient I/O bandwidth to maintain low I/O latency during peak rates
  • A network interface that can receive the peak redo rate volumes, in addition to any other network activity across the same interface
  • Sufficient memory to accommodate a symmetric SGA and buffer cache; the size of the log buffer and buffer cache generally have the biggest impact on redo apply performance

What to gather and how?

  • Gather standby Automatic Work Repository (AWR) reports with intervals of 30 minutes or less. 
  • Gather Active Session History (ASH) data for more real time granular waits. 
  • Gather Oracle Linux OSwatcher or Oracle Exadata ExaWatcher data to analyze system resources. 
  • Gather top process information to check if the recovery coordinator (PR00) is CPU bound by using top or ps commands.


Some common indicators and causes of resource bottlenecks include:

  • Low CPU idle time may indicate the system is CPU bound
  • Long disk or flash service times or high IOPS may indicate I/O contention or saturation
  • Undersized systems and shared systems with many active databases may cause contention for these resources
  • Reporting workloads in an Active Data Guard standby can also cause contention.


No comments:

Post a Comment

Tuning and Troubleshooting Synchronous Redo Transport (Part 2)

Tuning and Troubleshooting Synchronous Redo Transport (Part  2 ) Alireza Kamrani (06 /29/ 2025)     Understanding What Causes Outliers Any d...