Tuning and
Troubleshooting Synchronous Redo Transport (Part 1)
Alireza Kamrani
(06/08/2025)
Introduction:
At the heart of this process lies the Log Writer (LGWR)
process, responsible for writing redo entries from the log buffer to the online
redo log files. When a session issues a COMMIT, the server process triggers a
log flush, signaling LGWR to initiate an I/O submit operation to persist redo
records. In a synchronous configuration, LGWR must also wait for an ACK from
the Remote File Server (RFS) on the standby system after it has written the
redo data to its standby redo logs.
This entire operation chain — from commit call to I/O
submit, network round-trip, and acknowledgment — is highly sensitive to
latencies at each point. Tuning and troubleshooting synchronous redo transport,
therefore, requires a deep understanding of internal wait events (such as log
file sync, log file parallel write, and SYNC RFS write), network behavior, redo
generation rates, and LGWR performance.
This guide delves into the internal mechanisms that govern
synchronous redo transport, offers diagnostic techniques to pinpoint
bottlenecks, and provides tuning strategies to ensure optimal transaction
throughput and data protection.
Understanding How Synchronous
Transport Ensures Data Integrity
The following algorithms ensure data consistency in an
Oracle Data Guard synchronous redo transport configuration.
- Log
Writer Process (LGWR) redo write on the primary database online redo log
and the Data Guard Network Services Server (NSS) redo write to
standby redo log are identical.
- The
Data Guard Managed Recovery Process (MRP) at the standby database cannot
apply redo unless the redo has been written to the primary database online
redo log, with the only exception being during a Data Guard failover
operation (when the primary is gone).
Finding NSS
processes:
DGMGRL>
host ps -edf | grep --color=auto ora_nss[0-9]
Executing
operating system command(s):" ps -edf | grep --color=auto ora_nss[0-9]"
oracle 2356 1 0 19:15
? 00:00:00
ora_nss3_ORCL
oracle 8971 1 0 19:07
? 00:00:00
ora_nss2_ORCL
In addition to shipping redo synchronously, NSS and LGWR
exchange information regarding the safe redo block boundary that standby
recovery can apply up to from its standby redo logs (SRLs).
This prevents the standby from applying redo it may have
received, but which the primary has not yet acknowledged as committed to its
own online redo logs.
The possible failure scenarios include:
- If
primary database LGWR cannot write to online redo log, then LGWR and the
instance crash. Instance or crash recovery will recover to the last
committed transaction in the online redo log and roll back any uncommitted
transactions.
The current log will be
completed and archived.
- On
the standby, the partial standby redo log completes with the correct value
for the size to match the corresponding online redo log. If any redo
blocks are missing from the standby redo log, those are shipped over
(without reshipping the entire redo log).
- If
the primary database crashes resulting in an automatic or manual zero data
loss failover, then part of the Data Guard failover operation will do
"terminal recovery" and read and recover the current standby
redo log.
Once recovery finishes applying all of the redo in the
standby redo logs, the new primary database comes up and archives the newly
completed log group. All new and existing standby databases discard any redo in
the online redo logs, flashback to a consistent system change number (SCN), and
only apply the archives coming from the new primary database. Once again, the
Data Guard environment is in sync with the (new) primary database.
Assessing
Performance in a Synchronous Redo Transport Environment
When assessing performance in an Oracle Data Guard
synchronous redo transport environment (SYNC) it is important that you know how
the different wait events relate to each other.
The impact of enabling synchronous redo transport varies
between applications.
To understand why, consider the following description of
work the Log Writer Process (LGWR) performs when a commit is issued.
- Foreground
process posts LGWR for commit ("log file sync" starts). If there
are concurrent commit requests queued, LGWR will batch all outstanding
commit requests together resulting in a continuous strand of redo.
- LGWR
waits for CPU.
- LGWR
starts redo write ("redo write time" starts).
- For
Oracle RAC database, LGWR broadcasts the current write to other instances.
- After
preprocessing, if there is a SYNC standby, LGWR starts the remote write
(“SYNC remote write” starts).
- LGWR
issues local write ("log file parallel write").
- If
there is a SYNC standby, LGWR waits for the remote write to complete.
- After
checking the I/O status, LGWR ends "redo write time / SYNC remote
write".
- For
Oracle RAC database, LGWR waits for the broadcast ack.
- LGWR
updates the on-disk SCN.
- LGWR
posts the foregrounds.
- Foregrounds
wait for CPU.
- Foregrounds
ends "log file sync".
Use the
following approaches to assess performance.
- For
batch loads, the most important factor is to monitor the elapsed time,
because most of these processes must be completed in a fixed period of
time.
The database workloads for these
operations are very different than the normal OLTP workloads. For example, the
size of the writes can be significantly larger, so using log file sync averages
does not give you an accurate view or comparison.
- For
OLTP workloads, monitor the volume of transactions per second (from
Automatic Workload Repository (AWR)) and the redo rate (redo size per
second) from the AWR report.
This information gives you a
clear picture of the application throughput and how it is impacted by enabling
synchronous redo transport.
·
Why the Log File Sync
Wait Event is Misleading
Typically,
the "log file sync" wait event on the primary database is the first-place
administrators look when they want to assess the impact of enabling synchronous
redo transport (SYNC).
If the
average log file sync waits before enabling SYNC was 3ms, and after enabling
SYNC was 6ms, then the assumption is that SYNC impacted performance by one
hundred percent.
Oracle
does not recommend using log file sync wait times to measure the impact of SYNC
because the averages can be very deceiving, and the actual impact of SYNC on
response time and throughput may be much lower than the event indicates.
***When a user session commits, the Log Writer Process
(LGWR) will go through the process of getting on the CPU, submitting the I/O,
waiting for the I/O to complete, and then getting back on the CPU to post
foreground processes that the commit has completed. This whole time period is
covered by the log file sync wait event. While LGWR is performing its work
there are, in most cases, other sessions committing that must wait for LGWR to
finish before processing their commits. The size and number of sessions waiting
are determined by how many sessions an application has, and how frequently
those sessions commit. This batching up of commits is generally referred to
as application concurrency.
For
example, assume that it normally takes 0.5ms to perform log writes (log file
parallel write), 1ms to service commits (log file sync), and on average you are
servicing 100 sessions for each commit. If there was an anomaly in the storage
tier, and the log write I/O for one commit took 20ms to complete, then you
could have up to 2,000 sessions waiting on log file sync, while there would
only be 1 long wait attributed to log file parallel write. Having a large
number of sessions waiting on one long outlier can greatly skew the log file
sync averages.
The output from V$EVENT_HISTOGRAM for the log file
sync wait event for a particular period in time is shown in the following
table.
V$EVENT_HISTOGRAM
Output for the Log File Sync Wait Event
Milliseconds |
Number of Waits |
Percent of Total Waits |
1 |
17610 |
21.83% |
2 |
43670 |
54.14% |
4 |
8394 |
10.41% |
8 |
4072 |
5.05% |
16 |
4344 |
5.39% |
32 |
2109 |
2.61% |
64 |
460 |
0.57% |
128 |
6 |
0.01% |
The
output shows that 92% of the log file sync wait times are less than 8ms, with
the vast majority less than 4ms (86%). Waits over 8ms are outliers and only
make up 8% of wait times overall, but because of the number of sessions waiting
on those outliers (because of batching of commits) the averages get skewed. The
skewed averages are misleading when log file sync average waits times are used
as a metric for assessing the impact of SYNC!!
No comments:
Post a Comment