Thursday, February 27, 2025

Oracle SGA vs PGA Usages and Concepts

Why Oracle Uses PGA Instead of SGA for Large Table Scans and How to Optimize It


Alireza Kamrani
02/27/2025

When selecting from large tables, Oracle might use PGA instead of SGA, depending on memory settings, query complexity, and table size. Understanding how full table scans, SGA caching, and PGA memory allocation interact can help optimize performance.

Direct path read:
The session is waiting for a direct read to complete. A direct read is a physical  I/O from a data file that bypasses the buffer cache and reads the data block directly into process-private memory.

If asynchronous I/O is supported (and in use), then Oracle can submit I/O requests and continue processing. Oracle can then pick up the results of the I/O request later and wait on "direct path read" until the required I/O completes.

If asynchronous I/O is not being used, then the I/O requests block until completed but these do not show as waits at the time the I/O is issued. The session returns later to pick up the completed I/O data but can then show a wait on "direct path read" even though this wait will return immediately.

Hence this wait event is very misleading because:

• The total number of waits does not reflect the number of I/O requests

• The total time spent in "direct path read" does not always reflect the true wait time.

This style of read request is typically used for:

• Sort I/O (when a sort does not fit in memory)

• Parallel Query slaves

• Read ahead (where a process may issue an I/O request for a block it expects to need in the near future)


Why Oracle Uses PGA Instead of SGA in Large Table Queries

1. Full Table Scans and Direct Path Reads (Bypassing SGA)

• When performing a full table scan on a very large table, Oracle may choose direct path reads, bypassing the SGA (Buffer Cache) and reading data directly into PGA.

• This happens when:

• The table is too large to fit into DB_CACHE_SIZE.

• The query requests a parallel execution plan.

• The system is under memory pressure, and Oracle determines that using PGA is more efficient than filling up SGA with temporary data.

• Result: Data is not cached in SGA, meaning future queries on the same table will not benefit from cache reuse, leading to repeated disk I/O.

2. Impact of DB_BIG_TABLE_CACHE_PERCENT_TARGET

Oracle introduced the DB_BIG_TABLE_CACHE_PERCENT_TARGET parameter to control large table caching in SGA.

• By default, large table scans use direct path reads (bypassing the SGA).

• If DB_BIG_TABLE_CACHE_PERCENT_TARGET is set, Oracle will attempt to cache full table scan results in a special area of the buffer cache, reducing repeated disk I/O.

Example Configuration:

ALTER SYSTEM SET DB_BIG_TABLE_CACHE_PERCENT_TARGET = 30;

• Effect: Reserves 30% of the buffer cache for full table scans, reducing reliance on direct path reads and keeping frequently scanned large tables in memory.


3. When PGA is Used More Than SGA

Oracle moves sorting and temporary operations to PGA instead of SGA in the following cases:

(a) Sort and Hash Joins on Large Datasets

• If a query involves sorting, aggregations, or hash joins, Oracle may store intermediate results in PGA rather than SGA.

• If the PGA_AGGREGATE_TARGET is too small, Oracle will spill temporary data to TEMP tablespace, slowing down performance.


Optimization:

ALTER SYSTEM SET PGA_AGGREGATE_TARGET = 4G;  --adjust based on your workload.

(b) Large Full Table Scans Without Proper Caching

• If DB_BIG_TABLE_CACHE_PERCENT_TARGET is not set, Oracle bypasses the buffer cache for large tables, forcing data to be processed in PGA or TEMP tablespace.

• This is inefficient for frequently accessed large tables.

Solution: Enable Large Table Caching

ALTER SYSTEM SET DB_BIG_TABLE_CACHE_PERCENT_TARGET = 25;

• Now, Oracle will try to keep large table scans in memory, avoiding repeated disk reads.

4. Should You Keep Large Tables in SGA or Let Oracle Use PGA?

• Frequently scanned large tables

Recommendation: Enable DB_BIG_TABLE_CACHE_PERCENT_TARGET.

• One-time full table scans (ad hoc queries)

Recommendation: Allow PGA direct path reads.

• Large table joins with sorting/hashing

Recommendation: Increase PGA_AGGREGATE_TARGET.

• System has low memory

Recommendation: Use direct path reads to reduce SGA pressure.

• Keeping large tables in SGA (DB_BIG_TABLE_CACHE_PERCENT_TARGET) is beneficial only if they are frequently accessed.

• For one-time scans or infrequent queries, direct path reads using PGA are better, as they prevent SGA pollution (filling cache with unnecessary data).

5. Monitoring Full Table Scan Behavior

To check if Oracle is using direct path reads (PGA) instead of SGA:

SELECT name, value FROM v$sysstat WHERE name IN ('table scans (direct read)', 'table scans (long tables)', 'physical reads direct');

• High "physical reads direct" → Indicates full table scans are bypassing the buffer cache (using PGA).

• High "table scans (long tables)" → Oracle is scanning large tables frequently.

If direct path reads are too high and performance is suffering, consider increasing DB_BIG_TABLE_CACHE_PERCENT_TARGET.

Conclusion

• Oracle uses PGA instead of SGA for large table scans when full table scans are performed via direct path reads.
• Setting DB_BIG_TABLE_CACHE_PERCENT_TARGET helps keep frequently accessed large tables in memory to avoid repeated I/O.
• For queries involving sorting, hashing, or aggregations, PGA tuning is critical (PGA_AGGREGATE_TARGET should be increased to prevent excessive disk usage).
• Use SGA for frequently accessed tables and PGA for complex sorting and temporary operations.

Alireza Kamrani, Database Consultant/ACE

Tuesday, February 18, 2025

An Overview of Oracle database buffer cache behavior and keep blocks into memory solutions

An Overview of Oracle database buffer cache behavior and keep blocks into memory solutions.


Alireza Kamrani

18 Feb 2025

We can run some commands to cache  table blocks in Oracle buffer cache, but is there any guarantee to keep all blocks of mentioned table in buffer cache anytime?

In Oracle , if you want a table to always be cached in memory to improve performance, you can use the KEEP buffer pool or the Result Cache. However, BUFFER CACHE in the ALTER TABLE command is primarily a hint for caching table blocks in the default buffer cache.

Options to Keep a Table in Memory:

1. Using the KEEP Buffer Pool

Oracle allows you to keep frequently accessed tables in the KEEP buffer pool to avoid them being aged out of memory.

• Enable the KEEP pool in your SGA:

ALTER SYSTEM SET DB_KEEP_CACHE_SIZE = nG;

n: adjust it based on you memory.

• Assign the table to the KEEP pool:

ALTER TABLE my_table STORAGE (BUFFER_POOL KEEP);

Or also:

ALTER TABLE my_tabl CACHE;

2. Using the BUFFER CACHE Clause

You can specify that table blocks should be kept in the buffer cache:

ALTER TABLE my_table STORAGE (BUFFER CACHE);

• This helps in keeping the table’s data in the buffer cache for longer.

• Useful for small lookup tables that are frequently accessed.

3. Using the Result Cache (For Query Caching)

If the table is frequently queried and doesn't change often, you can enable query result caching:

ALTER TABLE my_table RESULT_CACHE (MODE FORCE);

• This keeps the query results in the Result Cache rather than just buffering the table blocks.


Which One Should You Use?

• Use KEEP Buffer Pool for frequently accessed small tables that should always stay in memory.

• Use BUFFER CACHE for tables where you want their data blocks to stay in the buffer cache longer.

• Use RESULT CACHE if the queries on the table return the same results frequently and the table doesn't change often.


Is there any solution to guarantee keeping all table blocks residents into memory?

No, Oracle does not force a full cache and guarantee that all blocks of a database will remain in memory.
After any of above solution about caching of a sample table in buffer , Oracle does not do anything immediately, and your command done easily!
So, running any mentioned solutions does not keeping blocks as soon as possible, when Oracle try to change your priorities about this table to keep in cache but these is not any guarantee about that and that is very dependent on pressure on memory and workloads.

While Oracle uses a sophisticated caching mechanism, the database will not guarantee that all blocks will stay in memory at all times. Oracle uses the System Global Area (SGA) and the Buffer Cache to cache frequently accessed data blocks, but the caching behavior is influenced by various factors, such as:

• Memory Availability: The SGA has a limited amount of memory, and the database will keep as many blocks in memory as possible, but once it exceeds its capacity, older or less frequently used blocks will be flushed to disk.

• Database Cache Management: Oracle dynamically manages the buffer cache to balance memory between different cache types (like the Database Buffer Cache, Shared Pool, etc.), based on usage patterns.

• LRU (Least Recently Used) Algorithm: Blocks that are not frequently accessed are more likely to be replaced by other blocks. This means that not all blocks are guaranteed to remain in memory.


• Memory Parameters: Settings like DB_CACHE_SIZE and PGA_AGGREGATE_TARGET determine how much memory is allocated to caching. However, the actual memory used can fluctuate based on system activity and load.

• Large Tables or Objects: Very large objects or tables might not fit entirely in memory, and Oracle will load and evict blocks based on access patterns.

In summary, Oracle will try to keep as much data in memory as it can based on available resources and usage patterns, but it does not guarantee that all blocks will remain cached in memory all the time.


What is usage of Force Full Caching mechanism?

If you are using the ALTER DATABASE FORCE FULL DATABASE CACHING command, the behavior changes slightly.

When you run this command, Oracle will indeed attempt to keep all data blocks in memory. Specifically:

• Force Full Database Caching: This command forces Oracle to cache all database blocks for tables and indexes into memory (specifically into the Database Buffer Cache). This means that Oracle will attempt to keep all blocks that are accessed in memory, regardless of the usual cache replacement mechanisms.

However, there are still some important nuances to understand:

• Memory Constraints: While Oracle will attempt to cache all blocks, it doesn't mean the entire database will fit into memory if the available memory is not enough. If the system doesn't have enough memory to store every block, the database will still perform evictions based on available resources. This command doesn't increase physical memory; it just prioritizes caching.

• Increased Memory Usage: This command will increase memory consumption because the cache size will grow, and more memory will be allocated to caching the blocks. You need to have sufficient memory for this to be effective.

• Specific to Tables and Indexes: This command will affect all user tables and indexes. It does not automatically apply to other structures like the redo log buffer or shared pool, which will still be managed according to normal memory rules.


• System Behavior: Even with the FORCE FULL DATABASE CACHING command, blocks are still subject to eviction when there are memory pressures, but the priority of eviction is adjusted. Oracle will avoid evicting blocks from the cache unless absolutely necessary.

In summary:

Yes, using ALTER DATABASE FORCE FULL DATABASE CACHING will attempt to keep all database blocks in memory, but it's still dependent on available system memory. It doesn't guarantee that all blocks will always stay in memory if there are memory limitations or if the system faces memory pressure, but Oracle will prioritize caching them more aggressively compared to the default behavior.


Alireza Kamrani.

Friday, February 14, 2025

Tuning Oracle Database Using some changes in internal layers of OS

Modifying the Default CFS Scheduler in Oracle Linux


Alireza Kamrani
14- Feb- 2025

In the previous post, I reviewed the capabilities available in Oracle Linux for scheduling and planning processes in Linux, in this post I will make changes in this area in order to tune the Oracle database and the operating system itself.

Note:
♨️Everything you see in this post is not a step-by-step method to increase the performance of your Production server and you should only use it to know ability of changes in different layers of the operating system and finally customize some of these observations and after testing for your environment.

Normally, you won't need these, but it can help increase your knowledge of how to apply changes to different layers, and perhaps in some cases you can improve performance by customizing some of the scripts.

Linux Scheduling Mechanisms:

Oracle Linux, like most modern Linux distributions, uses the Completely Fair Scheduler (CFS) as the default CPU scheduler.
You can tune and modify CFS behavior using kernel parameters and system utilities.

1. Check the Current Scheduler Configuration

Before making changes, check the existing CFS settings:

cat /proc/sys/kernel/sched_latency_ns cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_wakeup_granularity_ns

These parameters control how the scheduler handles task latencies and fairness.

2. Modify CFS Scheduling Parameters

a) Change CFS Latency and Granularity

The default CFS behavior can be adjusted using sysctl or directly modifying /proc/sys/kernel values.

Example: Reduce Latency for Faster Response

echo 5000000 | sudo tee /proc/sys/kernel/sched_latency_ns

# Default: 24ms, change to 5ms

echo 1000000 | sudo tee /proc/sys/kernel/sched_min_granularity_ns

# Default: 6ms, change to 1ms

echo 2000000 | sudo tee /proc/sys/kernel/sched_wakeup_granularity_ns

# Default: 4ms, change to 2ms

What This Does:

• Reduces task switch latency, making the system more responsive.

• Increases CPU time slices for interactive processes.

Persistent Configuration (After Reboot)

To make these changes permanent, add them to /etc/sysctl.conf:

echo "kernel.sched_latency_ns = 5000000" | sudo tee -a /etc/sysctl.conf

echo "kernel.sched_min_granularity_ns = 1000000" | sudo tee -a /etc/sysctl.conf

echo "kernel.sched_wakeup_granularity_ns = 2000000" | sudo tee -a /etc/sysctl.conf

Then apply the changes:

sudo sysctl -p


3. Set CPU Affinity for CFS Tasks

CFS allows binding tasks to specific CPU cores for better performance. Use taskset to manually set affinity:

taskset -c 0,1 ./my_app
# Binds process to CPU cores 0 and 1

To make it persistent, use systemd:

echo "0-1" | sudo tee /sys/fs/cgroup/cpuset/cpuset.cpus

4. Modify Per-Process CFS Nice Levels

CFS uses "nice values" (priority levels from -20 to 19) to determine CPU allocation. Lower values give higher priority.

nice -n -10 ./my_task

# Run task with higher priority renice -n -5 -p 1234

# Change priority of running process 1234

5. Use CGroups to Control CPU Allocation

You can limit CPU usage for specific processes using cgroups:

Create a CGroup

sudo mkdir /sys/fs/cgroup/cpu/my_group

echo 50000 | sudo tee /sys/fs/cgroup/cpu/my_group/cpu.cfs_quota_us

# Allow 50ms CPU time

echo $$ | sudo tee /sys/fs/cgroup/cpu/my_group/tasks
# Add current process to group

6. Change CFS Scheduler at Boot (Kernel Parameter)

If you want to experiment with other schedulers (like FIFO, RR, or DEADLINE), you can modify GRUB:

sudo grubby --update-kernel=ALL --args="sched_policy=RR" sudo reboot

To confirm the change:

cat /sys/kernel/debug/sched_features

Conclusion

Oracle Linux allows fine-tuning of the CFS scheduler for different workloads. By adjusting latency, granularity, CPU affinity, and priority, you can optimize performance for desktop, real-time, or server environments.

Optimizing CFS Scheduler for Oracle Database on Oracle Linux

Oracle databases are highly I/O-intensive and require careful CPU and memory scheduling to optimize performance.
Since Oracle Linux uses the Completely Fair Scheduler (CFS) by default, fine-tuning it can significantly improve database responsiveness, especially for OLTP (Online Transaction Processing) and OLAP (Analytical Processing) workloads.

1. Key Performance Factors for Oracle Database

When tuning CFS for Oracle DB, consider:
CPU Scheduling – Ensure Oracle DB processes get priority.
I/O Scheduling – Optimize disk access for high throughput.
NUMA Optimization – Improve memory locality for large databases.
HugePages – Reduce memory fragmentation and TLB misses.

2. Modify CFS Scheduler for Oracle DB Performance

a) Set Oracle Processes to Higher CPU Priority

Oracle DB runs multiple processes (oracle, pmon, dbwr, etc.). These should get more CPU time.

Check Oracle DB Process IDs:

pgrep -u oracle

Increase CPU Priority Using chrt

sudo chrt -f -p 99 $(pgrep pmon)

# Give highest FIFO priority to PMON sudo chrt -f -p 98 $(pgrep dbwr)

# Higher priority for DBWR process sudo

chrt -f -p 90 $(pgrep lgwr)

# High priority for log writer

Why?

• PMON (Process Monitor) needs fast response for crash recovery.

• DBWR (Database Writer) should not be delayed, as it writes dirty pages to disk.

• LGWR (Log Writer) must flush redo logs with minimal delay.

Make it Persistent Using systemd
#if exists:
sudo systemctl edit oracle-db.service

Add:

[Service]
CPUSchedulingPolicy=fifo CPUSchedulingPriority=99

Then reload:

sudo systemctl daemon-reexec

b) Optimize I/O Scheduler for Oracle Database

Set the Deadline Scheduler (Recommended)

echo deadline | sudo tee /sys/block/sdX/queue/scheduler

# Replace sdX with database disk

Why?

• Favors low-latency writes, critical for Oracle redo logs.

• Minimizes starvation of small transactions in OLTP workloads.

• Provides predictable performance for large queries in OLAP.

For SSDs, Use noop Instead

echo noop | sudo tee /sys/block/nvme0n1/queue/scheduler

Why?

• SSDs do not have seek time, so simple FIFO scheduling (noop) is best.

Make it Persistent

sudo grubby --update-kernel=ALL --args="elevator=deadline"
sudo reboot

c) Enable CPU Affinity for Oracle DB

To prevent CPU contention, bind Oracle DB to dedicated CPU cores.

Find Available CPUs:

lscpu | grep "CPU(s):"

Bind Oracle to Specific Cores (e.g., 2-5)

taskset -c 2-5 $(pgrep -u oracle)

Persistent Configuration with cgroups

sudo mkdir /sys/fs/cgroup/cpuset/

oracle echo 2-5 | sudo tee /sys/fs/cgroup/cpuset/oracle/cpuset.cpus

echo $$ | sudo tee /sys/fs/cgroup/cpuset/oracle/tasks

d) NUMA Optimization for Oracle Database

On NUMA systems, Oracle DB can suffer from cross-node memory latency.


Check NUMA Nodes:

numactl --hardware

Force Oracle to Use Local Memory Node:

numactl --membind=0 --cpunodebind=0 ./oracle

• This binds Oracle DB to NUMA node 0 for better memory locality.

Persistent NUMA Binding

sudo systemctl edit oracle-db

Add:

[Service]
ExecStart=/usr/bin/numactl --membind=0 --cpunodebind=0 /u01/app/oracle/product/19c/bin/oracle

e) Enable HugePages for Better Memory Management

HugePages reduce memory fragmentation and improve TLB efficiency for Oracle DB.

Check Current HugePages:

grep Huge /proc/meminfo

Calculate HugePages for Oracle (Assume 64GB SGA)

echo $((64 * 1024 / 2))
# 64GB / 2MB HugePage size = 32768

Set HugePages Count

echo 32768 | sudo tee /proc/sys/vm/nr_hugepages

Make It Persistent

echo "vm.nr_hugepages=32768" | sudo tee -a /etc/sysctl.conf sudo sysctl -p

3. Monitor Performance After Changes

After applying these optimizations, monitor Oracle DB performance.

Check CPU and Scheduling Statistics

ps -eo pid,comm,policy,pri,nice | grep oracle

Monitor I/O Performance

iostat -x 1

Check HugePages Usage

grep Huge /proc/meminfo

Verify NUMA Binding

numastat -p
$(pgrep -u oracle | head -n 1)


Summary: Oracle DB Tuning on Oracle Linux

By tuning the CFS scheduler, I/O scheduler, CPU affinity, NUMA settings, and HugePages, Oracle Database performance on Oracle Linux can be significantly improved, reducing query latency and improving transaction throughput.

Finally I provided a Script to tuning Oracle Database Processes:

Oracle Database Performance Optimization Script for Oracle Linux

This script automates CPU, I/O, NUMA, and memory tuning for Oracle Database on Oracle Linux.

📌 Features:

Sets Oracle DB processes to real-time priority (FIFO)
Configures I/O scheduler to deadline (or noop for SSDs)
Binds Oracle DB to specific CPU cores for optimal performance
Enables NUMA awareness for memory locality
Configures HugePages for efficient memory management

Script: oracle_tune.sh

#!/bin/bash

# Oracle Database Tuning Script for Oracle Linux
# Oracle Linux 8/9

echo "Starting Oracle Database Performance Optimization..."

# Variables
ORACLE_USER="oracle"
ORACLE_DISK="/dev/sda" 
# Change to correct database disk
CPU_CORES="2-5"        
# Adjust CPU affinity as needed
NUMA_NODE="0"          
# Adjust based on `numactl --hardware`
HUGEPAGES_COUNT=32768  
# Adjust based on Oracle SGA size

echo "Setting Real-Time Priority for Oracle Processes..."
for pid in $(pgrep -u $ORACLE_USER); do
    sudo chrt -f -p 99 $pid
done

echo "Optimizing I/O Scheduler..."
if [[ -e /sys/block/sda/queue/scheduler ]]; then
    echo "deadline" | sudo tee /sys/block/$ORACLE_DISK/queue/scheduler
fi

echo " Binding Oracle to CPUs: $CPU_CORES..."
for pid in $(pgrep -u $ORACLE_USER); do
    sudo taskset -cp $CPU_CORES $pid
done

echo "Enabling NUMA Optimization..."
for pid in $(pgrep -u $ORACLE_USER); do
    sudo numactl --membind=$NUMA_NODE --cpunodebind=$NUMA_NODE --physcpubind=$CPU_CORES --localalloc --pid=$pid
done

echo " Configuring HugePages..."
echo $HUGEPAGES_COUNT | sudo tee /proc/sys/vm/nr_hugepages
echo "vm.nr_hugepages=$HUGEPAGES_COUNT" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

echo "Oracle Database Optimization Completed!"

How to Use the Script

Make the Script Executable

chmod +x oracle_tune.sh

Run as Root (or with sudo)

sudo ./oracle_tune.sh

Verify the Optimizations

• Check CPU Priority ps -eo pid,comm,policy,pri | grep oracle

• Check I/O Scheduler cat /sys/block/sda/queue/scheduler

• Check NUMA Binding numastat -p $(pgrep -u oracle | head -n 1)

• Check HugePages grep Huge /proc/meminfo

📌 Notes

• Adjust ORACLE_DISK, CPU_CORES, and HUGEPAGES_COUNT based on your system.

• Run lsblk to find the correct disk.

• Use numactl --hardware to determine NUMA nodes.

• Set SGA size properly when configuring HugePages.

Bonus: Run at Startup (Optional)

To apply these optimizations at every reboot:

sudo cp oracle_tune.sh /etc/init.d/ sudo chmod +x /etc/init.d/oracle_tune.sh sudo ln -s /etc/init.d/oracle_tune.sh /etc/rc.d/



Conclusion

This script automates CPU scheduling, I/O tuning, NUMA optimization, and HugePages configuration for Oracle Database, improving performance on Oracle Linux.

Automatically Running Script:

In Oracle Linux 8 (OEL8), simply placing the script in /etc/rc.d/ will not automatically execute it at boot. Since OEL8 uses systemd, you need to create a systemd service to ensure the script runs at startup.

Steps to Automatically Run the Script at Boot

Move the Script to a System Directory

First, place the script in a proper location:

sudo mv oracle_tune.sh /usr/local/bin/ sudo chmod +x /usr/local/bin/oracle_tune.sh

Create a systemd Service File

Now, create a service file for systemd:

sudo nano /etc/systemd/system/oracle_tune.service

Paste the Following Configuration:

[Unit]
Description=Oracle Database Performance Optimization After=network.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/oracle_tune.sh RemainAfterExit=yes
User=root
[Install]
WantedBy=multi-user.target

Reload systemd and Enable the Service

systemctl daemon-reload
systemctl enable oracle_tune.service
systemctl start oracle_tune.service

Verify That the Service Runs at Boot

Check the status:

systemctl status oracle_tune.service

Test a reboot:

sudo reboot

After the system restarts, verify:

systemctl status oracle_tune.service

Summary

Now, Oracle Linux will automatically optimize your database performance at every reboot.

Alireza Kamrani. 
DATABASE Consultant,  ACE.

Thursday, February 13, 2025

Scheduling Mechanisms in Oracle Linux

Linux provides various synchronization and scheduling mechanisms to handle concurrency and avoid race conditions in multi-threaded or multi-process environments. Here are some key methods:


1. Spinlock


• A spinlock is a busy-wait lock that continuously checks for the lock's availability.


• It is useful in cases where the lock is expected to be held for a very short time.


• Since it does not put the thread to sleep, it avoids context switching overhead.


• Example: Used in kernel space for short critical sections.


spinlock_t lock;

spin_lock(&lock); // Acquire the lock

// Critical section

spin_unlock(&lock); // Release the lock


2. Mutex (Mutual Exclusion)


• A mutex is a sleep-based locking mechanism.


• If a thread cannot acquire the lock, it goes into a sleep state instead of spinning.


• Only the thread that locked it can unlock it.


• Used in both user-space (pthread mutex) and kernel-space.


Example in user space (POSIX pthreads):


pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&lock); // Acquire lock

// Critical section

pthread_mutex_unlock(&lock); // Release lock


Example in kernel space:


struct mutex my_mutex;

mutex_init(&my_mutex);

mutex_lock(&my_mutex); // Acquire lock

// Critical section

mutex_unlock(&my_mutex); // Release lock


3. Semaphore


• A semaphore is a signaling mechanism, allowing multiple threads to access a resource but limiting concurrency.


• It has a counter that determines the number of threads that can access the resource at the same time.


• Two types: 


• Binary Semaphore (similar to mutex, but any thread can unlock).


• Counting Semaphore (allows multiple resources to be acquired).


Example in user space:


sem_t sem;

sem_init(&sem, 0, 1); // Initialize semaphore with 1 resource

sem_wait(&sem); // Acquire resource (decrements counter)

// Critical section

sem_post(&sem); // Release resource (increments counter)


Example in kernel space:


struct semaphore my_sem;

sema_init(&my_sem, 1);

down(&my_sem); // Acquire semaphore

// Critical section

up(&my_sem); // Release semaphore


4. Read-Write Lock


• Allows multiple readers or a single writer at a time.


• Prevents multiple writers from accessing simultaneously.


Example in kernel space:


rwlock_t my_rwlock = RW_LOCK_UNLOCKED; read_lock(&my_rwlock); // Readers can acquire // Read operation read_unlock(&my_rwlock); write_lock(&my_rwlock); // Writers acquire exclusive lock // Write operation write_unlock(&my_rwlock); 


5. RCU (Read-Copy-Update)


• A specialized synchronization mechanism optimized for read-heavy workloads.


• Allows readers to access data without blocking while writers update a copy.


Example in kernel space:


rcu_read_lock(); // Read critical section rcu_read_unlock(); synchronize_rcu(); // Ensures safe update 


6. Futex (Fast User-Space Mutex)


• A hybrid mechanism that primarily works in user space but can use the kernel if contention occurs.


• Used for efficient thread synchronization with minimal kernel interaction.


Example in user space:


syscall(SYS_futex, &futex_var, FUTEX_WAIT, expected_value, NULL); 


Each of these methods is suited for different use cases depending on performance and synchronization needs.



How Oracle Linux handle Scheduling:


Oracle Linux, like other Linux distributions, uses the Completely Fair Scheduler (CFS) as its primary process scheduler. It is the default scheduler in the Linux kernel and is responsible for managing task execution efficiently.


Key Scheduling Methods in Oracle Linux


1. Completely Fair Scheduler (CFS) – Default CPU Scheduler


• Used for scheduling processes in the system.


• Assigns each task a virtual runtime based on the weight of the task (priority).


• Ensures fair distribution of CPU time across processes.


• Implements an O(log N) scheduling complexity using a red-black tree.


CFS Key Features:


• Uses a red-black tree to manage processes.


• Balances CPU time dynamically based on process weight (priority).


• Supports CPU affinity to optimize multi-core scheduling.


• Can be tuned using /proc/sys/kernel/sched_* parameters.


2. Real-Time Scheduling – FIFO and Round Robin


• For real-time tasks, Oracle Linux supports SCHED_FIFO (First In, First Out) and SCHED_RR (Round Robin).


• These are priority-based scheduling policies where high-priority tasks run until completion or voluntarily yield CPU time.



Example: Setting a real-time priority for a process:


chrt -f 10 ./realtime_task # FIFO priority 10

chrt -r 20 ./realtime_task # Round Robin priority 20 


3. Deadline Scheduler – For Real-Time and Latency-Sensitive Workloads


• Uses Earliest Deadline First (EDF) scheduling.


• Ideal for applications requiring guaranteed execution within a specific timeframe.


• Controlled using SCHED_DEADLINE.


4. Task and Thread Synchronization Mechanisms in Oracle Linux


Oracle Linux, like other enterprise Linux distributions, uses several synchronization methods to manage concurrency and avoid race conditions. These include:


• Spinlocks: Used in the kernel for short critical sections.


• Mutexes: Used in both user space (via pthread_mutex_t) and kernel space (struct mutex).


• Semaphores: Used for counting and controlling access to shared resources.


• Read-Write Locks: Used for allowing multiple readers but a single writer.


• Futex (Fast User-Space Mutex): Used by the Linux kernel to optimize user-space thread synchronization.


5. CPU Affinity and NUMA Optimization


• Oracle Linux optimizes scheduling for multi-core and NUMA (Non-Uniform Memory Access) systems.


• CPU affinity can be set using taskset and numactl to bind processes to specific cores or memory nodes.


Example:


taskset -c 0,1 ./app 

# Bind to CPU cores 0 and 1 

numactl --membind=0 --cpunodebind=0 ./app 

# Bind to NUMA node 0 


Conclusion


Oracle Linux primarily uses the Completely Fair Scheduler (CFS) for general scheduling and supports FIFO, Round Robin, and Deadline Scheduling for real-time applications. It also employs various synchronization methods like mutexes, semaphores, and spinlocks to manage concurrency efficiently.




Monday, February 10, 2025

Index Prefix Compression in Oracle With Introduction of a New Advanced Index Compression Capability with Oracle 23c

Index Prefix Compression in Oracle With Introduction of a New Advanced Index Compression Capability with Oracle 23c

Alireza Kamrani 

  02/10/2025


Index Key Compression, also referred to as Index Prefix Compression, in a compression feature included with Oracle Database Enterprise Edition that has the potential to help reduce the overall size of indexes and helps both multi-column unique indexes and non-unique indexes alike. 


As a result, it is one of the most critical index optimization features available to DBAs for effectively managing the space used by the indexes.

Index Key Compression allows for compressing portions of the key values in an index segment (or Index Organized Table (IOT)), by reducing the storage inefficiencies of storing repeating values multiple times. 

It compresses the data by splitting the index key into two parts:

  • Prefix Entries: the leading group of columns, which are potentially shared across multiple key values
  • Suffix Entries: the suffix columns, which are unique to every index key.

As the prefixes are potentially shared across multiple keys in a block, these can be stored more optimally (only once per block) and shared across multiple suffix entries, resulting in the index data being compressed.

Index Key compression is performed in the leaf blocks of a B-Tree index. 


The keys are compressed locally within an index leaf block, meaning that both the prefix and suffix entries are stored within same block. 


Suffix entries make up the compressed representation of the index key. Each one of these compressed rows refers to the corresponding prefix, which is stored in the same block. 

By storing the prefixes and suffixes locally in the same block, each index block is self-contained and it is possible to construct the complete key without incurring any additional block IO. Re-constructing the key is a very inexpensive memory only operation. 


For new indexes and index partitions, enabling Index Key Compression is easy - simply CREATE the index or index partition and specify the index compression clause. 


See the example below:


CREATE INDEX idxname ON tabname(col1, col2, col3) COMPRESS;


An existing index or index partition can be REBUILT compressed using the syntax shown below:


ALTER INDEX idxname REBUILD COMPRESS;


By default, the prefix consists of all indexed columns for non-unique indexes, and all indexed columns excluding the last one for unique indexes. 

Alternatively, it is possible to specify the prefix length as part of the index compression clause, which is the number of columns in the prefix entries:


CREATE INDEX idxname ON tabname(col1, col2, col3) COMPRESS2;


The number, prefix column length, after the COMPRESS keyword denotes how many columns to compress. 


The maximum prefix length for a non-unique index is the number of columns in the index key, and for a unique index is the number of key columns minus one.

Prefix entries are written to the index block only if the index block does not already contain that prefix. 

They are available for sharing across multiple suffix entries immediately after being written and remain available until the last referencing suffix entry is deleted from the block. 


Although key compression reduces the storage requirements of an index by sharing parts of keys across multiple entries, there is a small CPU overhead to reconstruct the key column values during index lookup or scans, which is minimized by keeping the prefixes locally in the block.


Index Key Compression achieves a more optimal representation of an index, and ensures that it stays permanently compressed without any subsequent overhead on the maintenance operations. 

As a result, it has a positive impact on the storage and space savings, but also achieves secondary benefits such as better cache efficiency, fewer leaf blocks and less deep tree resulting in potentially fewer logical IOs and cheaper execution plans. 


In many cases the overhead to construct the complete user row is offset by more efficient representation of the block, ability to fit many more user rows in a given block, reduction in IO required to read the index rows and better buffer cache efficiency, such that the applications sees improvement in overall performance.


Index Key compression can be extremely useful in many different scenarios, a few of which are listed below:

  • Index Key Compression can be used with a non-unique index where ROWID is appended to make the key unique. If such an index is compressed using key compression, the duplicate key is stored only once as a prefix entry in the index block without the ROWID. The remaining rows become suffix entries consisting of only the ROWID
  • Index Key Compression can be used with a unique multicolumn index (key compression is not possible for unique single column index because there is a unique piece but there are no prefix grouping pieces to share)
  • Index Key Compression can be used with Index Organized Tables. The same considerations as unique multicolumn indexes apply


The key to getting good index compression is identifying which indexes will benefit from it and correctly specifying the prefix column length for those indexes. This requires an understanding of the data in order to choose the most optimal prefix column count. 

If you want to estimate the ideal compression ratio and the percentage of leaf block space that could be saved, you need to look at INDEX_STATS view after ANALYZING the index, for example:


ANALYZE INDEX indexname VALIDATE structure; 


SELECT name, height, blocks, opt_cmpr_count, opt_cmpr_pctsave FROM  index_stats 
WHERE name = index name; 


“OPT_CMPR_COUNT” indicates the number of columns to compress in the index to get maximum space savings in the leaf blocks (prefix column length).


 “OPT_CMPR_PCTSAVE” indicates the percentage reduction in leaf block space used if index is compressed using this prefix length.

Compression can be very beneficial when the prefix columns of an index are repeated many times within a leaf block. 


However, if the leading columns are very selective, or if there are not many repeated values for the prefix columns, then index prefix compression may not be the best solution. 


In these scenarios, Oracle still creates prefix entries storing all unique combinations of compressed column values within a leaf block. The index rows will refer to the prefix entry, which are not shared (if at all) by other index rows. 

Thus, it is possible that compression in these cases is not beneficial, and could end up increasing the index size due to the overhead of storing all of the prefix entries.


For index compression to be beneficial, ensure that low cardinality columns are the leading columns in a concatenated index. 


Otherwise, there is a risk of getting negative compression such that leaf blocks can no longer store as many keys as their non-compressed counterparts. 


Additionally, there is no point in compressing a single column unique index or compressing every column in a concatenated, multi-column unique index. 

In these cases, compression will result in an index structure that increases in size rather than decreasing (negative compression) due to all the overhead associated with having prefix entries for every index row.


The key to getting good index compression is identifying which indexes will benefit from it and correctly specifying the prefix column length. 


The discussion above on how to figure out the optimal Prefix Column Length can help, but also please note the following:

  • Implementing compression requires a deep understanding of the data in order to choose the most optimal prefix column count
  • Specified prefix column count may not be optimal to produce the best compression ratio for every block in the index
  • Requires running ANALYZE INDEX to obtain an optimal prefix column count, which produces the optimal count for the index as a whole. This is not at the granularity of a block, so it may not yield the best compression ratio. Additionally, running ANALYZE INDEX takes an exclusive lock on the table, effectively making the table “offline” for this period
  • Possible to get negative compression, as pointed out earlier, such as in the case where the specified prefix columns are unique in a block

Application developers and DBAs need to be very selective on which indexes to compress and correctly set the prefix column count for these indexes. 

Oracle protects you under certain obvious conditions, but it is your responsibility to compress the indexes in the right manner.


For more information about Index Organized Tables (IOTs) and compression, as well as how to find the optimal index key compress level for indexes, please see these MOS notes:

Doc ID 1555637.1 – Index Organized Tables (IOTs) and Compression
Doc ID 601690.1 – How to find the optimal Index Key COMPRESS level for indexes


New Advanced Index Compression Capability with Oracle Database 23c

Advanced Index Compression, a feature of Advanced Compression, simplifies index compression. 

Advanced Index Compression enables the highest levels of data compression and provides enterprises with storage savings and query performance improvements due to reduced I/O. 


Advanced Index Compression is an enabling technology for multiple compression levels, LOW and HIGH. 

This discussion will focus on the LOW level of index compression. 


Advanced Index Compression LOW computes the prefix column count for compressed indexes. Rather than using a static prefix count for all index leaf blocks, it aims to compute an optimal prefix count for every index leaf block. 


The correct and most optimal numbers of prefix columns are calculated automatically on a block-by-block basis and thus produce the best compression ratio possible. 

It is possible to have different index leaf blocks compressed with different prefix column counts or not be compressed at all if there are no repeating prefixes.


So, what is new with Advanced Index Compression LOW with Oracle Database 23c? 

Advanced Index Compression LOW for Index-Organized Tables (IOTs)

An index-organized table is a table stored in a variation of a B-tree index structure. 

In contrast, a heap-organized table inserts rows where they fit.

In an index-organized table, rows are stored in an index defined on the primary key for the table. Each index entry in the B-tree also stores the non-key column values. 


Thus, the index is the data, and the data is the index. 


Applications manipulate index-organized tables just like heap-organized tables, using SQL statements.


IOTs are popular because they provide fast random access by primary key without duplicating primary key columns in two structures – a heap table and an index. 


Index-Organized Tables can now be compressed with Advanced Index Compression (LOW). 


Advanced Index Compression LOW can be enabled easily by specifying the COMPRESS option for indexes. 


For example:


SQL> create table tiot (c1 number, c2 number, c3 number, c4 number, primary key (c1, c2)) organization index compress advanced low;


In earlier releases, IOTs only supported Oracle’s prefix key compression for index compression. 


Usage of prefix key compression required user analysis and had the possibility of negative compression (where the overhead of compression outweighed the compression benefits). 

This new feature extends Advanced Index Compression (LOW) to IOTs, allowing users to enable compression for all IOTs without the possibility of negative compression and without any user analysis required. 


So that does this mean for your organization?

Average IOT storage reduction can range from 2x to 5x. 

Using 2x as an example, this means that the amount of space consumed by uncompressed data will be two times larger than that of the compressed data. 


By reducing their IOT storage requirements, IT managers can reduce, and sometimes eliminate their need to purchase new storage. 


The cost of decompressing a block compressed with Advanced Index Compression LOW is compensated by the fact that in most scenarios, the database would be scanning a smaller number of blocks. 


So, in general, IOT compression typically won’t compromise query performance (no significant degradation).


Use the Free Compression Advisor to see how well your indexes will compress

The “DBMS_COMPRESSION” PL/SQL package (commonly called compression advisor) is included with Oracle Database Enterprise Edition and gathers compression-related information within a database environment. 


The output of running compression advisor is an estimation of the compression ratio for the specific table or partition that was the target of compression advisor. 

Compression advisor provides organizations with the storage reduction information needed to make compression-related usage decisions.


Alireza Kamrani 

Oracle SGA vs PGA Usages and Concepts

Why Oracle Uses PGA Instead of SGA for Large Table Scans and How to Optimize It Alireza Kamrani 02/27/2025 When selecting from large tabl...