Database-Box: interconnect network and Cache Fusion in Oracle RAC

Enhancing interconnect network and Cache Fusion in Oracle RAC Exadata bs Non_Exadata.

Visit my LinkedIn group tonfind more:

https://www.linkedin.com/groups/8151826

Dear Database-Box members, My name is Alireza Kamrani, and in this post, you will get acquainted with the following concepts.

-A overview of interconnection.

-A review of main process and concepts in RAC.

-A comparison between Non-Exadata & Exadata interconnection.

-LMS process enhancements.

-In-Memory Commit Cache.

-Cache fusion enhancement.

-Minimize index contention for write growing index in RAC by Fast Index Split.

-Waiting events name changes for interconnection

-Enhancing block remastering by Zero Copy Block.

-Locking mechanism to share blocks between nodes requests.

-Efficacy of RDMA for RAC performance

-Performance comparison between RAC on Exadata and non-Exadata.

-Persistent Memory Commit Accelerator

In the internal of RAC some important operations such as: nodes communication language and protocol, locking mechanism, shared block's requesting handles in the isolated network called by interconnect. Many process works together to achieve consistency of data when there is many concurrently sessions and nodes try out to get blocks from another member. So send and receive internal messaging in this network is heavy and management of these contentions is a big challenge and must be control by a powerful service, yes ,this process is Global Cache Service process (GCS) in the environment as Cache Fusion. another important process is Global resource Directory (GRD) and Global Enqueue service(GES) to resource sharing management.

Oracle RAC Architecture Overview

Oracle's Real Application Clusters (RAC) architecture harnesses the processing power of multiple interconnected computers (nodes) to access a shared Oracle database. It does so via the following components:

A cluster of Oracle instances on the multiple nodes.
A high-speed, high bandwidth communication facility known as a cluster interconnect that connects those nodes.
A process known as cache fusion, which allows the Oracle instances in the cluster to share database blocks between cache areas.

The following diagram, adapted from Oracle Real Application Clusters and illustrates some of the basic components of an Oracle RAC cluster.

What is LMS process in Oracle RAC database

LMS process called as Lock Manager Server Process , also called the GCS (Global

Cache Services) process. Main work of this process is to transport blocks across

to the nodes of Oracle RAC for cache fusion supportability. If any consistent

read request found from another connected node then LMS process makes consistent

read image of block and transport it to across the nodes for full feeling

client¡¦s consistent read request. This images transports using High Speed

Interconnect process to and from remote nodes.

LMSx Global Cache Service Processes

The LMSx are the processes that handle remote Global Cache Service (GCS) messages. Real Application Clusters software provides for up to 10 Global Cache Service Processes. The number of LMSx varies depending on the amount of messaging traffic among nodes in the cluster.

This process maintains statuses of datafiles and each cached block by recording information in a Global Resource Directory(GRD). This process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances. This processing is a part of cache fusion feature.

The LMSx handles the acquisition interrupt and blocking interrupt requests from the remote instances for Global Cache Service resources. For cross-instance consistent read requests, the LMSx will create a consistent read version of the block and send it to the requesting instance. The LMSx also controls the flow of messages to remote instances.

The LMSn processes handle the blocking interrupts from the remote instance for the Global Cache Service resources by:

Managing the resource requests and cross-instance call operations for the shared resources.
Building a list of invalid lock elements and validating the lock elements during recovery.
Handling the global lock deadlock detection and Monitoring for the lock conversion timeouts.

Overview of Cache Fusion Processing

By default, a resource is allocated for each data block that resides in the cache of an instance. Due to Cache Fusion and the elimination of disk writes that occur when other instances request blocks for modifications, the performance overhead to manage shared data between instances is greatly diminished. Not only do Cache Fusion's concurrency controls greatly improve performance, but they also reduce the administrative effort for Real Application Clusters environments.

Cache Fusion addresses several types of concurrency as described under the following headings:

Concurrent Reads on Multiple Nodes
Concurrent Reads and Writes on Different Nodes
Concurrent Writes on Different Nodes

Concurrent Reads on Multiple Nodes

Concurrent reads on multiple nodes occur when two instances need to read the same data block. Real Application Clusters resolves this situation without synchronization because multiple instances can share data blocks for read access without cache coherency conflicts.

Concurrent Reads and Writes on Different Nodes

A read request from an instance for a block that was modified by another instance and not yet written to disk can be a request for either the current version of the block or for a read-consistent version. In either case, the Global Cache Service Processes (LMSn) transfer the block from the holding instance's cache to the requesting instance's cache over the interconnect.

Concurrent Writes on Different Nodes

Concurrent writes on different nodes occur when the same data block is modified frequently by different instances. In such cases, the holding instance completes its work on the data block after receiving a request for the block. The GCS then converts the resources on the block to be globally managed and the LMSn processes transfer a copy of the block to the cache of the requesting instance. The main features of this processing are:

The Global Cache Service (GCS) tracks a each version of a data block, and each version is referred to as a past image (PI). In the event of a failure, Oracle can reconstruct the current version of a block by using the information in a PI.
The cache-to-cache data transfer is done through the high speed IPC interconnect, thus eliminating disk I/O.
Cache Fusion limits the number of context switches because of the reduced sequence of round trip messages. Reducing the number of context switches enables greater cache coherency protocol efficiency. The database writer (DBWn) processes are not involved in Cache Fusion block transfers.

Write Protocol and Past Image Tracking

When an instance requests a block for modification, the Global Cache Service Processes (LMSn)send the block from the instance that last modified it to the requesting instance. In addition, the LMSnprocess retains a PI of the block in the instance that originally held it.

Writes to disks are only triggered by cache replacements and during checkpoints. For example, consider a situation where an instance initiates a write of a data block and the block's resource has a global role. However, the instance only has the PI of the block and not the most current buffer. Under these circumstances, the instance informs the GCS and the GCS forwards the write request to the instance where the most recent version of the block is held. The holder then sends a completion message to the GCS. Finally, all other instances with PIs of the block delete them.

Oracle RAC object ReMastering:

In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance keeps track of the state of the block until the next reconfiguration event .When one instance departs the cluster, the GRD portion of that instance needs to be redistributed to the surviving nodes. Similarly, when a new instance enters the cluster, the GRD portions of the existing instances must be redistributed to create the GRD portion of the new instance. This is called dynamic resource reconfiguration.

In addition to dynamic resource reconfiguration, This is called dynamic remastering. The basic idea is to master a buffer cache resource on the instance where it is mostly accessed. In order to determine whether dynamic remastering is necessary, the GCS essentially keeps track of the number of GCS requests on a per-instance and per-object basis. This means that if an instance, compared to another, is heavily accessing blocks from the same object, the GCS can take the decision to dynamically migrate all of that object’s resources to the instance that is accessing the object most. LMON, LMD and LMS processes are responsible for Dynamic remastering.

Remastering can be triggered as result of

– Manual remastering

– Resource affinity

– Instance crash

Overview of interconnection:

Traditionally, Oracle RAC messaging was implemented using the commonly used networking model using network sockets. In this model, all communications (sends and receives) would go through the OS kernel, thus requiring context switches and memory copies between user space and OS kernel for every RAC message being exchanged. Exafusion is the next generation networking protocol available on Exadata since 12c (on both RoCE and InfniBand), which allows for direct-to-wire messaging from user space, completely bypassing the OS kernel. By eliminating the context switches and OS kernel overhead, Exafusion enables Oracle to process round trip messages in less than 50 µs (micro-seconds), which is 3x faster than a traditional socketbased implementation, and a further 33% improvement compared to the frst generation of Exadata which used the RDS protocol for messaging. Additionally, the CPU cost associated with sending and receiving messages is lower with Exafusion, allowing for higher block transfer throughput and increased headroom in LMS processes before they could become saturated.

Faster messaging not only benefts runtime application performance, it also makes every Oracle RAC operation faster -this includes dynamic lock remastering (DRM), Oracle RAC reconfguration (associated with instance or PDB membership changes), and instance recovery.

The adoption of Exafusion is the foundation of subsequent performance optimizations for RAC on Exadata, including zero copy transfers and adoption of RDMA.

Exafusion and the subsequent optimizations described in this document do not require extra OS resources to operate. When Exafusion is enabled, one may notice that the IPC0 background process uses high RSS memory usage in “ps”, however this is due to the fact that Oracle instance registers (pins) all IPC bufers with the Host Channel Adaptor (HCA) on behalf of all processes running in the instance, and does not indicate excessive memory usage or memory leaks. Further details can be found in MOS note 2407743.1.

Zero Copy Block Sends

RoCE and InfniBand network adapters support Zero Copy messaging. User space bufers are registered with the HCA and the HCA directly places the contents of user space bufers on the wire, unlike traditional messaging protocols where the OS kernel frst makes a copy of the user space bufer and then places them on the wire. Since Oracle RAC 12c, we use this feature on Exadata for inter-instance communications. Elimination of the CPU cycles required for copying bufers improved the transfer latencies by up to 5% compared to Exafusion without Zero Copy sends.

Smart Fusion Block Transfer

Traditionally, Oracle RAC instance would have to wait for redo log fush to complete before sending a dirty block to another instance. This is a common access patern in OLTP systems with frequent DML’s. The redo fush is done to ensure database consistency in the event of an instance failure. This means that inter-instance transfer latency for frequently modifed blocks which have redo pending was always dependent on redo fush I/O latency, and was subject to outliers caused by intermitent spikes in I/O performance.

Oracle RAC 12c utilizes Smart Fusion Block Transfer optimization, which allows an Oracle RAC instance to send the block once the redo I/O is in-fight to the Exadata storage server. Oracle RAC LMS process is permited to initiate a block transfer before receiving I/O completion acknowledgment, allowing sessions on the requestor instance to start accessing that block while the redo I/O may still be pending. The requestor instance checks for I/O completion before it commits further changes to the same block. The commiting process is required to wait for the “remote log force -commit” wait event if the I/O is yet to complete. This is a rare occurrence, which is only seen when there are extreme I/O outliers. Such I/O outliers are mostly eliminated on Exadata with the Smart Flash Logging feature. Smart Fusion Block Transfer optimization allows for improved concurrency across Oracle RAC instances to improve overall application performance. This optimization results in reducing the “gc current block busy” wait times by 3x times for workloads that updates hot blocks concurrently.

Undo Block RDMA Reads

Undo blocks need to be fetched from other Oracle RAC instances when there are transaction rollbacks etc. In Oracle RAC 18c, undo block transfers have been optimized to use a RDMA-based transfer protocol, replacing the traditional messaging-based protocol. By leveraging RDMA, foreground processes are able to directly read the undo blocks from the remote instance’s SGA. The undo block reads no longer invoke processes on the remote instance, removing the server-side CPU and context switch overheads which were always part of traditional Oracle RAC communications. Additionally, the transfer latencies are no longer afected by OS process or overall system CPU load on the remote instance, which helps sustain deterministic read latencies even in the case of a load spike on the remote instance. RDMA read of a remote block would typically complete in less than 10 µs, which is a 5x improvement over the best latencies we would get with the traditional message-based protocol using Exafusion.

In-Memory Commit Cache

Applications that have long running batch jobs and concurrent queries may exhibit high volumes of “undo header” CR block transfers. In Oracle 18c, an inmemory commit cache has been added on Exadata. Each instance would maintain a cache of local transactions and their respective states (commited or not) in the SGA, and the cache can be looked up remotely. This is faster than transferring the undo header blocks, each sized 8kb, to the remote instance. The state of multiple transaction ID’s (XID’s) can be looked up in a single message, which helps reduce the number of roundtrip messages in Oracle RAC, and also the CPU overhead in LMS processes which is responsible for responding to remote lookup requests. With the in-memory commit cache, we are able to batch up to 30 XID lookups in a single roundtrip message which would have been 30x 8k block transfers prior to this optimization.

With the commit cache optimization, we can expect a lot of the “gc cr block 2-way” waits corresponding to “undo header” transfers to be replaced with a smaller number of “gc transaction table 2-way” waits. A single “gc transaction table 2-way” wait represents a remote lookup of multiple XID’s in one roundtrip.

Fast Index Split

When there is a B-tree index leaf block split (frequently seen in OLTP workloads with right-growing indices), applications accessing the spliting leaf & branch blocks on all Oracle RAC instances would need to wait for the split operation to complete. This may cause intermitent hiccups (periods of almost zero activity) in application performance. Traditionally, these waits were implemented under a TX enqueue (“enq: TX-index contention” waits). These split waits have been optimized on Exadata in Oracle 19c, to use a less expensive Cache Fusion based mechanism in lieu of global enqueues. The fast index split waits will be under the new “gc index operation” wait event (“index split completion” in 21c onwards), which replaces the traditional TX enqueue waits.

NOTE: The “gc transaction table 2-way” wait is used in releases starting with Oracle 21c. Earlier releases (Oracle 18c and 19c) would use the “gc transaction table” wait event instead.

Persistent Memory Commit Accelerator

Exadata X8M introduces the Persistent Memory Commit Accelerator, which implements redo log I/O with RDMA writes to persistent memory on the storage servers. This optimization signifcantly improves redo fush I/O performance, which would further improve inter-instance concurrency on systems experiencing high volumes of dirty bufer sharing (see Smart Fusion Block Transfer).

Shared Data Block and Undo Header RDMA Reads

In Oracle 21c, RDMA support for Cache Fusion has been extended to support reads for data blocks, space blocks and undo header blocks. Similar to the Undo Block RDMA read optimization in 18c, this will contribute to faster reads of data cached in remote instances, and further reduction in LMS CPU since LMS will not be invoked when data is read via RDMA. Traditionally, a foreground process would send a request to read a block to the master instance, then the master instance would forward the request to the holder instance, and the request is fulflled by a 3-way Cache Fusion transfer (“gc current block 3-way”). This is a common access patern in read intensive OLTP workloads running on large clusters of 3+ nodes. In large clusters, the size of each instance is typically small, which means that it is less likely that data is cached on the local instance, but chances are higher that it is cached on another instance. With data & space block RDMA, the master instance will respond to the requestor with a lock grant (permission to read the data), along with information about the holder instance for the block requested. The requesting client can then RDMA-read the block directly from the holder instance. This will remove the master-holder messaging, which will help improve read latency and reduce LMS CPU on the holder instance (who traditionally had to send back the block to the requestor).

In this case, the foreground will see the following sequence of wait events instead of the traditional “gc current block 3-way” wait:

“gc current grant 2-way” wait, followed by,
A short “gc current block direct read” wait event

The “gc current block direct read” waits are typically less than 10us, and the combined wait time for the grant & read is usually shorter than the traditional 3-way transfer latency.

If the requestor is also the master instance, the “gc current grant 2-way” in the example above can be eliminated, because the instance can grant itself permission to read data without involving any messaging. In this case, the request can be quickly fulflled by a single “gc current block direct read”. This would replace some “gc current block 2-way” waits that were traditionally seen in Oracle RAC, including 2 node clusters.

Additionally, if a remote master instance is also the holder instance, LMS would respond with a grant message, then the requestor will RDMA-read the data from the holder (who is also the master). This is similar to the 3-way scenario described above, except that the master and holder instances are the same. In this case, the traditional “gc current block 2-way" waits are replaced by a “gc current grant 2-way” and “gc current block direct read”. While the read latencies won’t improve much in this case, the cost for LMS to grant a lock is cheaper compared to sending back a data block, so the RDMA optimization will help reduce LMS CPU usage.

Broadcast-on-Commit over RDMA

Before commiting a transaction, the Broadcast-on-Commit protocol ensures that the system change number (SCN) on all the instances in a cluster is at least as high as the commit SCN. This is required to ensure the Consistent Read (CR) property of Oracle transactions. Traditionally, the Broadcast-on-Commit protocol used messages to broadcast the SCN to all the instances in a cluster. The LGWR process sends the SCN in a message to the LMS process on all instances. LMS process, upon receiving an SCN message, updates its instance’s SCN and sends back an SCN ACK message to the LMS process on the initiating instance. Once the redo I/O completes, LGWR checks whether the redo SCN has been acknowledged by all instances. If so, LGWR notifes the foreground processes waiting for the transaction that the commit operation has completed. If the redo SCN was not acknowledged by the time the redo I/O completes, then the commit won’t complete until all SCN ACKs have been received. Clients will see high “log fle sync” wait times in this case.

In Oracle 21c, Broadcast-on-Commit has been optimized to use RDMA for the following reasons:

RDMA latency is lower than messaging:

As I/O latency improves on Exadata, broadcasting SCN using messaging could potentially become a botleneck.

Reducing load on LMS processes:

Running OLTP applications, we see that SCN messages account for a measurable portion of messaging trafc, especially on clusters with large number of instances. Although these messages are rarely in the critical path latency-wise (because the actual IO would typically take longer), reducing these messages will have a beneft of reducing LMS load, giving us more headroom so that the system can beter tolerate load spikes.

For example, running a large CRM (OLTP) workload on a 3 instance cluster, we saw that 12% of overall RAC messages were for SCN broadcasts. With RDMA, these messages will no longer invoke the LMS process.

In the Broadcast-on-Commit over RDMA mode, the LGWR process directly updates the SCN on each remote instance in the cluster using remote atomic operations. This makes the commit protocol faster as it is not afected by the remote LMS process’s context switch latency or the CPU load on the remote instances.

I hope this article has been of interest to you.

Regards,

Alireza Kamrani

Senior RDBMS Consultant.

Ar.kamrani9@gmail.com

Database-Box

Tuesday, April 2, 2024

interconnect network and Cache Fusion in Oracle RAC

No comments:

Post a Comment

Comparison between ChromaDB , Oracle 26ai, PG_Vector , Milvus (Vector database)

Report Abuse

Labels