Database-Box: September 2024

💢Oracle Standby troubleshooting💢

Written by: Alireza Kamrani

In this post I represented an essential methods to evaluating , configuring and discovering issues and troubleshooting Network problems for transferring redo files and apply in Standby database. Also I mentioned some best practices to available needed resources and configuration for having a standby environment without any problematic issues.

Also in the next post I will explain a fully technical solution to investigate and tuning network issues for data guard.

Tune to Meet Data Guard Resource Requirements:

Redo transport can be impacted if:

Primary or standby database is completely CPU bound
Primary or standby database I/O system is saturated
Network topology can't support the redo generation rates

Evaluate whether the primary database system has:

Sufficient CPU utilization for Log Writer Process (LGWR) to post foregrounds efficiently
Sufficient I/O bandwidth so local log writes maintain low I/O latency during peak rates
Network interfaces that can handle peak redo rate volumes combined with any other network activity across the same interface
Automatic Workload Repository (AWR), Active Session History (ASH), and OSwatcher or Exawatcher data gathered from the primary database for tuning and troubleshooting

Evaluate whether the standby database system has:

Sufficient CPU utilization for the remote file server (RFS), the Oracle Data Guard process that receives redo at the standby database, to efficiently write to standby redo logs
Sufficient I/O bandwidth to enable local log writes to maintain low I/O latency during peak rates
A network interface that can receive the peak redo rate volumes combined with any other network activity across the same interface
AWR, ASH, and OSwatcher or Exawatcher data gathered from the standby database for tuning and troubleshooting

Note:The top issue encountered with the standby database is poor standby log write latency because of insufficient I/O bandwidth.

This problem can be mitigated by using Data Guard Fast Sync.

If system configuration is tuned and the above resource constraints are removed, evaluate if the transport lag (refer to Verify Transport Lag and Understand Redo Transport Configuration) is reducing to acceptable levels. If that's the case, you have met your goals.

Assessing and Optimizing Network Performance

Oracle Data Guard relies on the underlying network to send redo from the primary database to standby databases. Ensuring that the network is healthy and capable of supporting peak redo generation rates helps avoid future transport lags.

A transport lag forms when the primary database cannot ship redo to the standby faster than primary instance's redo generation rate.

A transport lag can lead to potential data loss if a primary database failure occurs.

Network assessment consists of evaluating

Network reliability
Network bandwidth to accommodate peak redo generation rates

Note: Each instance of the primary database instance generates its own redo and ships redo to the standby database in a single network stream.

Therefore, maximizing single process network throughput for each node is critical for redo transport.

Historically there are areas that can reduce network and redo transport throughput resulting in potential transport lags:

Network firewalls or network encryption:
Network firewalls and network (not Oracle Net) encryption can reduce overall throughput significantly. Verify throughput with the existing tool (will describe in next post - next week), with and without encryption, and tune accordingly.
At times reducing the encryption level can increase throughput significantly. A balance is required to meet security needs with your performance and data loss requirements.
Redo transport compression:
When database initialization parameter has LOG_ARCHIVE_DEST_N attribute COMPRESSION=ENABLE, Oracle background processes have to compress the redo before sending network message, and uncompress the redo before processing the redo. This reduces the overall redo and network throughput. Compression is only recommended if network bandwidth is insufficient between the primary and standby destinations.
Oracle Net encryption:
Depending on the Oracle Net encryption level, this will have varying redo throughput impact, because Oracle Net messages containing redo have to be encrypted before sending and then unencrypted before redo processing.
Note that if database encryption is already enabled with Transparent Data Encryption (TDE), redo is already encrypted, although Oracle Net encryption can also encrypt the message headers.
Untuned network for redo transport:

Increasing maximum operating system socket buffer size can increase single process throughput by 2-8 times. Test with different socket buffer sizes to see what value yields positive results, and ensure throughput is greater than the peak redo throughput.
Compare performance with various MTU settings.
If average redo write size is less than 1500 bytes, then try various MTU settings including MTU=9000 (for example, Jumbo Frames) for network interface that sends or receives redo on your system. This may reduce some unnecessary network round trips which will increase overall throughput.
Also note that for SYNC transport, Oracle's average redo write size (for example, Oracle message send) increases significantly as determined by v$sysstats or AWR statistics "redo size / redo writes".
When sending redo across geographical regions, experiments have shown that using MTU=9000 can also benefit in some network topologies. Conduct performance tests with existing tools and compare the results with default MTU and MTU=9000 settings.

Advanced Oracle Standby Troubleshooting:

Determining Network Time with Asynchronous Redo Transport:

Given enough resources, especially network bandwidth, asynchronous redo transport can maintain pace with very high workloads.

In cases where resources are constrained, asynchronous redo transport can begin to fall behind resulting in a growing transport lag on the standby database.

Asynchronous redo transport (ASYNC) transmits redo data asynchronously with respect to transaction commitment.

A transaction can commit without waiting for an acknowledgment that the redo generated by that transaction was successfully transmitted to a remote standby database.

With ASYNC, the primary database Log Writer Process (LGWR) continues to acknowledge commit success even if limited bandwidth prevents the redo of previous transactions from being sent to the standby database immediately (picture a sink filling with water faster than it can drain).

ASYNC uses a TT00 process to transmit redo directly from the log buffer of the primary database.

If the TT00 process is unable to keep pace, and the log buffer is recycled before the redo can be transmitted to the standby database, then the TT00 process automatically transitions to reading and sending from the online redo log file (ORL) on disk.

Once TT00 transmission has caught up with current redo generation, it automatically transitions back to reading and sending directly from the log buffer.

In cases in which there are two or more log switches before the TT00 has completed sending the original ORL, the TT00 will still transition back to reading the contents of the current online log file.

Any ORLs that were archived in between the original ORL and the current ORL are automatically transmitted using Oracle Data Guard’s redo gap resolution process.

Sufficient resources, such as network bandwidth, CPU, memory, and log file I/O on both the primary and standby databases are critical to the performance of an asynchronous Data Guard configuration.

To determine which resource is constraining asynchronous transport, use krsb stats which can be enabled by setting event 16421 on both the primary and standby databases:

SQL> alter session set events ‘16421 trace name context forever, level 3’;

This event is very lightweight and won't affect performance of the primary or standby database.

This dynamic event should be set on all primary and standby instances, and it will write statistics into the TT00 or remote file server (RFS) trace file when shipping for a given sequence has completed.

Looking in the trace file, you will see the krsb_end stats at the beginning and end of the file.

The stats at the end of the file will provide insight into where asynchronous shipping was spending time.

For example:(Content of Trace file)

krsb_end: Begin stats dump for T-1.S-593

max number of buffers in use 10

Operation elapsed time (micro seconds) 474051333

File transfer time (micro seconds) 474051326

Network Statistics

LOG_ARCHIVE_DEST_2 : OCI REQUEST

Total count : OCI REQUEST 2748

Total time : OCI REQUEST 81374

Average time : OCI REQUEST 29

LOG_ARCHIVE_DEST_2 : NETWORK SEND

Total count : NETWORK SEND 2748

Total time : NETWORK SEND 286554724

Average time : NETWORK SEND 104277

Total data buffers queued 9644

Total data buffers completed 9644

Total bytes written 9885272064

Total bytes completed synchronously 9885272064

Average network send size (blocks) 7025

Average network send buffers 3.51

Average buffer turnaround time 240889

Throughput (MB/s) 19.89

Total network layer time 286636098

Percentage of time in network 60.47

Disk Statistics

Total count : DISK READ 11531

Total time : DISK READ 12335132

Average time : DISK READ 1069

Read-ahead blocks 14151680

Log buffer blocks 266915

Disk stall blocks 4888576

Total count : BUFFER RELEASE 9643

Total time : BUFFER RELEASE 7229

Average time : BUFFER RELEASE 0

Total disk layer time 12342361

Percentage of time in disk layer 2.60

Data Guard Processing Statistics

Total count : SLEEP 198

Total time : SLEEP 172351312

Average time : SLEEP 870461

Total DG layer time 175072867

Percentage of time in DG layer 36.93

Remote Server-Side Network Statistics

LOG_ARCHIVE_DEST_2 : NETWORK GET

Total count : NETWORK GET 8242

Total bytes : NETWORK GET 9885272064

Total time : NETWORK GET 453233790

Average time : NETWORK GET 54990 Total server-side network layer time 453233790

Percentage of time in network 95.61

Remote Server-Side Disk Statistics

LOG_ARCHIVE_DEST_2 : DISK WRITE

Total count : DISK WRITE 9644

Total time : DISK WRITE 8731303

Average time : DISK WRITE 905

LOG_ARCHIVE_DEST_2 : DISK NOSTALL REAP

Total count : DISK NOSTALL REAP 9644

Total time : DISK NOSTALL REAP 579066

Average time : DISK NOSTALL REAP 60

LOG_ARCHIVE_DEST_2 : BUFFER GET

Total count : BUFFER GET 9644

Total time : BUFFER GET 3607

Average time : BUFFER GET 0

Total server-side disk layer time 9313976

Percentage of time in disk layer 1.96

Remote Server-Side Data Guard Processing Statistics LOG_ARCHIVE_DEST_2 : PUBLISH RTA BOUNDARY

Total count : PUBLISH RTA BOUNDARY 8948

Total time : PUBLISH RTA BOUNDARY 3665841

Average time : PUBLISH RTA BOUNDARY 409

LOG_ARCHIVE_DEST_2 : VALIDATE BUFFER

Total count : VALIDATE BUFFER 9644

Total time : VALIDATE BUFFER 1403088

Average time : VALIDATE BUFFER 145

Total Server-Side DG layer time 11503560

Percentage of time in DG layer 2.43

krsb_end: End stats dump

The above output comes from a test run where a transport lag is just beginning to occur.

You can observe a lag due to network congestion increase, and the time waiting on the network layer increases above 50%.

If a transport lag is the result of either compression or encryption, the percentage of time spent in the Data Guard layer would become the majority.

To disable krsb stats set event 16421 to level 1:

SQL>alter session set events ‘16421 trace name context forever, level 1’;

Gather Topology Information:

Understanding the topology of the Oracle Data Guard configuration, and its relevance to Data Guard performance, helps eliminate infrastructure weaknesses that are often incorrectly attributed to the Data Guard architecture.

Oracle recommends that you outline the following high-level architecture information.

Describe the primary and standby database system (number of nodes in Oracle RAC cluster, CPUs and memory per database node, storage I/O system)
Describe network topology connecting the primary and standby systems

Network switches and firewalls in between primary and standby
Network bandwidth and latency

For standby databases with symmetric hardware and configuration, and with a well tuned network configuration that can support peak redo generation rates, the transport lag should be less than 1 second.

Alireza Kamrani

Oracle ACE / Oracle Database Consultant.

Database-Box

Wednesday, September 25, 2024

Oracle Standby troubleshooting

Oracle Hot Block Concept, Detection and Resolving scenarios

Report Abuse

Labels