Alireza Kamrani
04/16/2025
Advanced Troubleshooting: Determining Network Time with Asynchronous Redo Transport
Given enough resources, especially network bandwidth, asynchronous redo transport can maintain pace with very high workloads. In cases where resources are constrained, asynchronous redo transport can begin to fall behind resulting in a growing transport lag on the standby database.
Asynchronous redo transport (ASYNC) transmits redo data asynchronously with respect to transaction commitment. A transaction can commit without waiting for an acknowledgment that the redo generated by that transaction was successfully transmitted to a remote standby database.
With ASYNC, the primary database Log Writer Process (LGWR) continues to acknowledge commit success even if limited bandwidth prevents the redo of previous transactions from being sent to the standby database immediately (picture a sink filling with water faster than it can drain).
ASYNC uses a TT00 process to transmit redo directly from the log buffer of the primary database. If the TT00 process is unable to keep pace, and the log buffer is recycled before the redo can be transmitted to the standby database, then the TT00 process automatically transitions to reading and sending from the online redo log file (ORL) on disk.
Once TT00 transmission has caught up with current redo generation, it automatically transitions back to reading and sending directly from the log buffer.
In cases in which there are two or more log switches before the TT00 has completed sending the original ORL, the TT00 will still transition back to reading the contents of the current online log file. Any ORLs that were archived in between the original ORL and the current ORL are automatically transmitted using Oracle Data Guard’s redo gap resolution process.
Sufficient resources, such as network bandwidth, CPU, memory, and log file I/O on both the primary and standby databases are critical to the performance of an asynchronous Data Guard configuration.
To determine which resource is constraining asynchronous transport, use krsb stats which can be enabled by setting event 16421 on both the primary and standby databases:
alter session set events ‘16421 trace name context forever, level 3’;
This event is very lightweight and won't affect performance of the primary or standby database.
This dynamic event should be set on all primary and standby instances, and it will write statistics into the TT00 or remote file server (RFS) trace file when shipping for a given sequence has completed. Looking in the trace file, you will see the krsb_end stats at the beginning and end of the file. The stats at the end of the file will provide insight into where asynchronous shipping was spending time. For example:
krsb_end: Begin stats dump for T-1.S-593
max number of buffers in use 10
Operation elapsed time (micro seconds) 474051333
File transfer time (micro seconds) 474051326
Network Statistics
LOG_ARCHIVE_DEST_2 : OCI REQUEST
Total count : OCI REQUEST 2748
Total time : OCI REQUEST 81374
Average time : OCI REQUEST 29
LOG_ARCHIVE_DEST_2 : NETWORK SEND
Total count : NETWORK SEND 2748
Total time : NETWORK SEND 286554724
Average time : NETWORK SEND 104277
Total data buffers queued 9644
Total data buffers completed 9644
Total bytes written 9885272064
Total bytes completed synchronously 9885272064
Average network send size (blocks) 7025
Average network send buffers 3.51
Average buffer turnaround time 240889
Throughput (MB/s) 19.89
Total network layer time 286636098
Percentage of time in network 60.47
Disk Statistics
Total count : DISK READ 11531
Total time : DISK READ 12335132
Average time : DISK READ 1069
Read-ahead blocks 14151680
Log buffer blocks 266915
Disk stall blocks 4888576
Total count : BUFFER RELEASE 9643
Total time : BUFFER RELEASE 7229
Average time : BUFFER RELEASE 0
Total disk layer time 12342361
Percentage of time in disk layer 2.60
Data Guard Processing Statistics
Total count : SLEEP 198
Total time : SLEEP 172351312
Average time : SLEEP 870461
Total DG layer time 175072867
Percentage of time in DG layer 36.93
Remote Server-Side Network Statistics
LOG_ARCHIVE_DEST_2 : NETWORK GET
Total count : NETWORK GET 8242
Total bytes : NETWORK GET 9885272064
Total time : NETWORK GET 453233790
Average time : NETWORK GET 54990
Total server-side network layer time 453233790
Percentage of time in network 95.61
Remote Server-Side Disk Statistics
LOG_ARCHIVE_DEST_2 : DISK WRITE
Total count : DISK WRITE 9644
Total time : DISK WRITE 8731303
Average time : DISK WRITE 905
LOG_ARCHIVE_DEST_2 : DISK NOSTALL REAP
Total count : DISK NOSTALL REAP 9644
Total time : DISK NOSTALL REAP 579066
Average time : DISK NOSTALL REAP 60
LOG_ARCHIVE_DEST_2 : BUFFER GET
Total count : BUFFER GET 9644
Total time : BUFFER GET 3607
Average time : BUFFER GET 0
Total server-side disk layer time 9313976
Percentage of time in disk layer 1.96
Remote Server-Side Data Guard Processing Statistics
LOG_ARCHIVE_DEST_2 : PUBLISH RTA BOUNDARY
Total count : PUBLISH RTA BOUNDARY 8948
Total time : PUBLISH RTA BOUNDARY 3665841
Average time : PUBLISH RTA BOUNDARY 409
LOG_ARCHIVE_DEST_2 : VALIDATE BUFFER
Total count : VALIDATE BUFFER 9644
Total time : VALIDATE BUFFER 1403088
Average time : VALIDATE BUFFER 145
Total Server-Side DG layer time 11503560
Percentage of time in DG layer 2.43
krsb_end: End stats dump
The above output comes from a test run where a transport lag is just beginning to occur.
You can observe a lag due to network congestion increase, and the time waiting on the network layer increases above 50%. If a transport lag is the result of either compression or encryption, the percentage of time spent in the Data Guard layer would become the majority.
To disable krsb stats set event 16421 to level 1:
alter session set events ‘16421 trace name context forever, level 1’;
Alireza Kamrani♠️
No comments:
Post a Comment