Database-Box: March 2024

All in one technique for DBAs about Storage options, I/O concept, hardware models, Disks types

Optimizing for Random I/O and Sequential I/O

This section explains the differences between random I/O and sequential I/O, and DiskSuite strategies for optimizing your particular configuration.

Random I/O

What is random I/O?
Databases and general-purpose file servers are examples of random I/O environments.

In random I/O, the time spent waiting for disk seeks and rotational latency dominates I/O service time.

Why do I need to know about random I/O?
You can optimize the performance of your configuration to take advantage of a random I/O environment.

What is the general strategy for configuring for a random I/O environment?
You want all disk spindles to be busy most of the time servicing I/O requests. Random I/O requests are small (typically 2-8 Kbytes), so it's not efficient to split an individual request of this kind onto multiple disk drives.
The interlace size doesn't matter, because you just want to spread the data across all the disks. Any interlace value greater than the typical I/O request will do.
For example, assume you have 4.2 Gbytes DBMS table space. If you stripe across four 1.05-Gbyte disk spindles, and if the I/O load is truly random and evenly dispersed across the entire range of the table space, then each of the four spindles will tend to be equally busy.
The target for maximum random I/O performance on a disk is 35 percent or lower as reported by DiskSuite Tool's performance monitor, or by iostat(1M). Disk use in excess of 65 percent on a typical basis is a problem. Disk use in excess of 90 percent is a major problem.
If you have a disk running at 100 percent and you stripe the data across four disks, you might expect the result to be four disks each running at 25 percent (100/4 = 25 percent). However, you will probably get all four disks running at greater than 35 percent since there won't be an artificial limitation to the throughput (of 100 percent of one disk).

Sequential Access I/O

What is sequential I/O?
While most people think of disk I/O in terms of sequential performance figures, but only a few servers/DBMS servers dominated by full table scans and NFS servers in very data-intensive environments--will normally experience sequential I/O.

Why do I need to know about sequential I/O?
You can optimize the performance of your configuration to take advantage of a sequential I/O environment.
The goal in this case is to get greater sequential performance than you can get from a single disk. To achieve this, the stripe width should be "small" relative to the typical I/O request size. This will ensure that the typical I/O request is spread across multiple disk spindles, thus increasing the sequential bandwidth.

What is the general strategy for configuring for a sequential I/O environment?
You want to get greater sequential performance from an array than you can get from a single disk by setting the interlace value small relative to the size of the typical I/O request.

max-io-size / #-disks-in-stripe

Example:
Assume a typical I/O request size of 256 Kbyte and striping across 4 spindles. A good choice for stripe unit size in this example would be:
256 Kbyte / 4 = 64 Kbyte, or smaller

Note -

Seek and rotation time are practically non-existent in the sequential case. When optimizing sequential I/O, the internal transfer rate of a disk is most important.

The most useful recommendation is:

max-io-size / #-disks.

Striping Trade-offs

Striping cannot be used to encapsulate existing file systems.
Striping performs well for large sequential I/O and for uneven I/O distributions.
Striping uses more CPU cycles than concatenation, but the trade-off is usually worth it.
Striping does not provide any redundancy of data.

To summarize the trade-offs: Striping delivers good performance, particularly for large sequential I/O and for uneven I/O distributions, but it does not provide any redundancy of data.

Write intensive applications: Because of the read-modify-write nature of RAID5, metadevices with greater than about 20 percent writes should probably not be RAID5.

If data protection is required, consider mirroring.

RAID5 writes will never be as fast as mirrored writes, which in turn will never be as fast as unprotected writes. The NVRAM cache on the SPARCstorage Array closes the gap between RAID5 and mirrored configurations.

Full Stripe Writes: RAID5 read performance is always good (unless the metadevice has suffered a disk failure and is operating in degraded mode), but write performance suffers because of the read-modify-write nature of RAID5.

What is there best Raid config when setup a new system for Oracle database ?

Although answering to this, depends on the database, depends on the part of the database. But there are following recommendations:

Here is what I like (raid 0 = stripes, raid 1 = mirrors, raid 5 = striping+parity):

1- no raid, raid 0 or raid 0+1 for online redo logs AND control files. You should still let us multiplex them ourselves even if you mirror them.

We have more opportunities for failure if the raid subsystem reports a "warning" back to us

-- if we have multiplexed them

-- we are OK with that.

2- no raid or raid 0 for temporary datafiles (used with temporary tablespaces). no raid/raid 0 is sufficient. If you lose these, who cares? You want speed on these, not reliability. If a disk fails, drop and recreate temp elsewhere.

3- no raid, raid 0 or raid 0+1 for archive. Again, let us multiplex if you use no raid or raid 0, let the OS do it (different from online redo log here) if you use 0+1.

4- raid 0+1 for rollback. It get written to lots.

It is important to have protected. We cannot multiplex them so let the OS do it. Use this for datafiles you believe will be HEAVILY written. Bear in mind, we buffer writes to datafiles, they happen in the background so the poor write performance of raid 5 is usually OK except for the heavily written files (such as rollback).

5- raid 5 (unless you can do raid 0+1 for all of course) for datafiles that experience what you determine to be "medium" or "moderate" write activity. Since this happens in the background typcially (not with direct path loads and such)

-- raid 5 can typically be safely used with these. As these files represent the BULK of your database and the above represent the smaller part

-- you achieve most of the cost saving without impacting performance too much.

Try to dedicate specific devices to

o online redo

o archive

o temp

they should not have to share their devices with others in a "perfect" world (even with eachother).

My experience :

Usually using Raid10 is the best choice for all type of files in oracle database with.

And try to set heavy load datafiles on separated LUNs.

Seprated redofiles

Separated datafiles

Seprated Undo, Temp as possible

For redo logs cause of sequential behavior, you can use SAS 15k disks and if you trouble and see waiting events as commit on it try to add disks to raid group as spindle to achieve more throughput.

on heavy load systems may be need SSD disks for redo logs also but if disks pool/raidgroup be separated i/o, normally doesn't matter.

Random versus Sequential I/O

Every time you need to access a block on a disk drive, the disk actuator arm has to move the head to the correct track (the seek time), then the disk platter has to rotate to locate the correct sector (the rotational latency). This mechanical action takes time.

Obviously the amount of time depends on where the head was previously located and how fortunate you are with the location of the sector on the platter: if it’s directly under the head you do not need to wait, but if it just passed the head you have to wait for a complete revolution.

Even on the fastest 15k RPM disk that takes 4 milliseconds (15,000 rotations per minute = 250 rotations per second, which means one rotation is 1/250th of a second or 4ms).

What about the next block?

Well, if that next block is somewhere else on the disk, you will need to incur the same penalties of seek time and rotational latency. We call this type of operation a random I/O. But if the next block happened to be located directly after the previous one on the same track, the disk head would encounter it immediately afterwards, incurring no wait time (i.e. no latency). This, of course, is a sequential I/O.

And sequential i/o is very important option to I/O optimization on databases that resident on SAS disks.

About Multiple Striped Volumes for Sequential and Random Access

If there are enough physical disks in OS/Server, create at least two striped volumes (in addition to a standalone hard disk or striped volume for the operating system).

One striped volume can be used for sequential data access, and the other can be used for random data access.

Oracle Database redo logs and archived redo logs, for example, are written in sequential order. Because of a reduced head movement, hard disks perform best when reading or writing sequential data.

Oracle Database data files, however, are usually accessed in random order. Random access in a hard disk results in significant head movement, translating to slower data access.

Unless redo logs are separated from data files (at physical device level), undo file I/O contention may result, increasing access times for both types of files.

Size Matters

I suppose that readers have a base knowledge about Latency, IOPS and Bandwidth (or Throughput). As a reminder, IOPS stands for I/Os Per Second and indicates the number of distinct Input/Output operations (i.e. reads or writes) that can take place within one second. You might use an IOPS figure to describe the amount of I/O created by a database, or you might use it when defining the maximum performance of a storage system.

One is a real-world value and the other a theoretical maximum, but they both use the term IOPS.

When describing volumes of data, things are slightly different.

Bandwidth is usually used to describe the maximum theoretical limit of data transfer, while throughput is used to describe a real-world measurement.

You might say that the bandwidth is the maximum possible throughput.

Bandwidth and throughput figures are usually given in units of size over units of time, e.g. Mb/sec or GB/sec.

IOPS and throughput were related by the following relationship:

Throughput = IOPS x I/O size

It’s time to start thinking about that I/O size now. If we read or write a single random block in one second then the number of IOPS is 1 and the I/O size is also 1 (I’m using a unit of “blocks” to keep things simple).

The Throughput can therefore be calculated as (1 x 1) = 1 block / second.

Alternatively, if we wanted to read or write eight contiguous blocks from disk as a sequential operation then this again would only result in the number of IOPS being 1, but this time the I/O size is 8. The throughput is therefore calculated as (1 x 8) = 8 blocks / second.

Hopefully you can see from this example the great benefit of sequential I/O on disk systems: it allows increased throughput.

Every time you increase the I/O size you get a corresponding increase in throughput, while the IOPS figure remains resolutely fixed.

But what happens if you increase the number of IOPS?

Latency Kills Disk Performance

In the example above I described a single-threaded process reading or writing a single random block on a disk.

That I/O results in a certain amount of latency, as described earlier on (the seek time and rotational latency).

We know that the average rotational latency of a 15k RPM disk is 4ms, so let’s add another millisecond for the disk head seek time and call the average I/O latency 5ms.

How many (single-threaded) random IOPS can we perform if each operation incurs an average of 5ms wait?

The answer is 1 second / 5 ms = 200 IOPS.

Our process is hitting a physical limit of 200 IOPS on this disk.

What do you do if you need more IOPS?

With a disk system you only really have one choice: add more disks.

If each spindle can drive 200 IOPS and you require 80,000 IOPS then you need (80,000 / 200) = 400 spindles.

♨️Adding disks to Raid Group is one of the efficient ways to minimize Commit wait and latency cause by slowly disks when Redo files residents here. This way can be remove you requirement of SSD disk often specially when you working on a limited infrastructure😀.

🧩This is not Art of a DBA that bypassing some performance issues by adding more and more memory/expensive disks/Cpu into machine instead of realy investigating and analyzing issue technically.

Better clear some space in that data centre, eh?

On the other hand, if you can perform the I/O sequentially you may be able to reduce the IOPS requirement and increase the throughput, allowing the disk system to deliver more data.

I know of Oracle customers who spend large amounts of time and resources carving up and re-ordering their data in order to allow queries to perform sequential I/O.

They figure that the penalty incurred from all of this preparation is worth it in the long run, as subsequent queries perform better.

That’s no surprise when the alternative was to add an extra wing to the data centre to house another bunch of disk arrays, plus more power and cooling to run them.

This sort of “no pain, no gain” mentality used to be commonplace because there really weren’t any other options. Until now.

Flash Offers Another Way

The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous.

Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so don’t really apply. And since the latency of flash is sub-millisecond, it should be possible to see that, even for a single-threaded process, a much larger number of IOPS is possible. When we start considering concurrent operations things get even more interesting.

NVMe disks:

NVMe SSD drives

NVMe SSD drives can deliver higher data transfer rates than either Serial AT Attachment (SATA) or SAS interfaces because they offer more available channels (lanes) through which data can flow. NVMe uses the PCIe bus and an optimized protocol that reduces overhead, further boosting performance over older technologies. Optimizations in the NVMe protocol include deep I/O queues and support for 64 K commands per queue.

SSDs based on NVMe are more responsive for heavy workloads and less prone to performance degradation due to large numbers of I/O requests.

These factors are often an important benefit for database workloads where both high Input Output Operations per Second (IOPS) requirements and increased bandwidth demand scenarios need consistent low latency response.

Organizations are increasingly choosing NVMe SSD as the most cost-effective storage for high-performance database platforms and similar business critical applications. Consider the benefits of using NVMe SSDs for production databases where enterprise performance is critical.

Protecting Data on NVMe Devices

Some compute instance shapes in Oracle Cloud Infrastructure include locally attached NVMe devices. These devices provide extremely low latency, high performance block storage that is ideal for big data, OLTP, and any other workload that can benefit from high-performance block storage.

Caution

NVMe devices are not protected in any way; they are individual devices locally installed on your instance. Oracle Cloud Infrastructure does not take images, back up, or use RAID or any other methods to protect the data on NVMe devices. It is your responsibility to protect and manage the durability of the data on these devices.

Oracle Cloud Infrastructure offers high-performance remote block (iSCSI) LUNs that are redundant and can be backed up using an API call.

Ceph storages

What is Ceph Distributed Storage Cluster?

Ceph is a widely used open source storage platform and it provides high performance, reliability, and scalability. The Ceph free distributed storage system provides an interface for object, block, and file-level storage.

In this tutorial, you will install and build a Ceph cluster on Oracle Linux 7.X with the following components:

Ceph OSDs (ceph-osd) — Handles the data store, data replication and recovery and a Ceph cluster needs at least two Ceph OSD servers which will be based on Oracle Linux
Ceph Monitor (ceph-mon) — Monitors the cluster state, OSD map and CRUSH map.
Ceph Meta Data Server (ceph-mds) — This is needed to use Ceph as a File System.

Ceph is a distributed network file system and function will depend on your network infrastructure. From the perspective of Oracle database it should perform similar to NFS.

However, don't expect high performance from Ethernet and TCP/IP, which were simply no designed to replace local storage solutions and will always introduce high latency.

It's certainly not something I would consider for database redo logs and system tablespaces and can slow down performance quite drastically.

Can I use Ceph for Oracle database ?

In test environments and low size of database, it maybe considered but as totally this technology is not acceptable for mid level and enterprise environment.

Bacause DBA need a stable, low delay, max throughput, minimum latency disks but Ceph only provide availability/scaleability on disks and has many rejected performance test on real environment on oracle database or other RDBMS.

Again, it depends on what you would like to accomplish. I doubt you will find someone who tried and can give you a hands-on experience.

People who invest into Oracle Database are typically using it in a production or enterprise environment and are very likely more interested to eliminate the odds and not use it in a non-supported configuration.

It can be OK, for example if you transfer large chunks of data, like RMAN backup when size is lower medium.

Does RAC Database Support CEPH Storage?

CEPH has not been certified and we are unlikely to certify new cluster file systems at this point in time.

Per current policies, using the CEPH Block Device Service may be permitted and hence, would be supported when used for data files in an Oracle RAC system.
The storage and server vendors of the storage and the servers, on which the CEPH Block Device Service is hosted, must support the Block Device represented to the RAC system as a shared accessible block device in this case.

CEPH Object Storage is not suitable for RAC and hence will not be supported.

Oracle Support will address RAC database specific issues only on a best effort basis. Support will be unable to get development help as this is not certified. Any issues with CEPH cluster or issues stemming fomr CEPH should be taken care of by the respective Vendor.

Oracle Redo logfile block size 512byte vs 4kb

Business Critical Oracle databases would benefit from using 4k redo blocksize to drive IO intensive workloads as opposed to using 512byte redo blocksize for all the obvious reasons , yes , it comes at the cost of redo wastage but there are pros and cons to every thing.

Here explain the advantages of using Oracle Redo Log with 4k blocksize (default is 512 bytes with blocksize choices 512bytes, 1k and 4k) on VMware platforms and current challenges associated with that with roadmap guidance.

Oracle Redo Log blocksize – Implications and Tradeoffs

The advent of the new Advanced Format drives came with larger capacity and increased sector size (i.e., the minimum-possible I/O size) of these drives from a value of 512 bytes to the more efficient value of 4096 bytes (or 4KB).

By default, database online redo log files have a block size of 512 bytes. From Oracle Database version 11.2 this can be changed with the BLOCKSIZE clause to values of 512, 1024 or 4096.

Linux OS requires kernel version 2.6.32 or later for 4K device support.

I/O requests sent to the redo log files are in increments of the redo block size. This is the blocking factor Oracle uses within redo log files and with the default redo log block set to 512bytes, more redo blocks would have to be written for any given workload as opposed to the same workload using a larger redo log blocksize.

However, with a block size of 4K, there is increased redo wastage. In fact, the amount of redo wastage in 4K blocks versus 512B blocks is significant.

To avoid the additional redo wastage, if you are using emulation-mode disks—4K sector size disk drives that emulate a 512B sector size at the disk interface—you can override the default 4K block size for redo logs by specifying a 512B block size or, for some platforms, a 1K block size. However, you will incur a significant performance degradation when a redo log write is not aligned with the beginning of the 4K physical sector. Because seven out of eight 512B slots in a 4K physical sector are not aligned, performance degradation typically does occur.

Thus, you must evaluate the trade-off between performance and disk wastage when planning the redo log block size on 4K sector size emulation-mode disks.

Oracle Redo Log with 4k blocksize

Beginning with Oracle Database version 11.2, redo log blocksize can be changed with the BLOCKSIZE clause to values of 512, 1024 or 4096. For example:

SQL> ALTER DATABASE ADD LOGFILE GROUP 5 SIZE 100M BLOCKSIZE 4096;

For a 512e emulation device, the logical sector size is 512 bytes, and the physical sector size is 4096 bytes (4K).

Hence, the above command will fail unless the sector size check is overridden by setting the underscore parameter _disk_sector_size_override to true in the database instance (do not set this parameter in the ASM instance) using the following command:

ALTER SYSTEM SET “_DISK_SECTOR_SIZE_OVERRIDE”=”TRUE”;

At this point, online redo logs can be created with the desired 4k block size.

VMware vSphere and vSAN support for 512e and 4K Native (4Kn) drives

vSphere/vSAN 6.0 and earlier versions have not been designed to use 4Kn/512e direct attached disk drives. 512e drives are supported only in version 6.5, 6.7, 7.x and later.

vSphere 6.7 released 4kn local storage support. vSphere 6.5/6.7/7.x and vSAN 6.5/6.7/7.x and later support 512e drives as direct attached drives.

Currently, vSphere and vSAN will expose 512n (both logical and physical sector sizes are 512 bytes) to the guest OS as part of this support as per KB 2091600.

Oracle Redo Log 4k blocksize on VMware vSphere / vSAN

As stated above, currently, vSphere and vSAN will expose 512n (both logical and physical sector sizes are 512 bytes) to the guest OS as part of this support.

VMware Engineering is aware of this and end to end 4k sector support is scheduled on the roadmap, stay tuned.

Oracle Redo Log 4k blocksize on Physical Mode RDMs mapped to direct attached 512e drives.

VMware recommends using VMDK (s) for provisioning storage for ALL Oracle environments.

In some cases, RDM (Physical / Virtual) can also be used as VM storage for Oracle databases.

Given the current restriction of using redo logs with 4k blocksize on vmdk’s , we ran our tests using 512e RDM’s to showcase the performance benefit of using redo logs with 4k blocksize compared to 512 byte blocksize.

Here is result test:

You can see complete teat results in:

https://blogs.vmware.com/apps/2022/01/oracle-redo-blocksize-512b-4k.html

Oracle redo log considerations

With all-flash NVMe SSD storage systems, the performance of database workloads increased significantly when compared to storage systems using spinning disks (hard disk drives, or HDDs). As a result, the sizing recommendations for redo log files has also changed to avoid unnecessary database delays. Such delays can occur if the redo log files are not sized correctly.
For example, if the redo log files are sized too small, Oracle will write updates to each file, fill it entirely, then move to the next file. When the last file has been filled, Oracle will return to the first. If the archive log process has not finished archiving the first file, Oracle will have to wait before writing to the log, creating a delay in the workload.
To avoid unnecessary database delays in high-performance databases, it is recommended to size each redo log file (member) at least 10 GB and have at least four files for each database instance (redo log thread). Since each database has a different workload profile, the DBA should monitor the rate of log switches. The database should not need to switch redo logs more than a few times in an hour. There should also be enough logs to allow the archive process to complete its work before the database need to reuse that log file.

4 KB redo log sector size

Oracle 11gR2 introduced the ability to change the redo log block size from its default 512 bytes to 4 KB. One reason for this change was certain drives used 4 KB as their native block size (for example, SSD drives). Another reason was to reduce the metadata overhead that was associated with high density drives by increasing the block size from its legacy 512 bytes to 4 KB.

Note:

Changing redo log sector from 512 to 4k although possible but need test carefully in test environment, however I never seen significant, Impressive performance with this change.

Regards

Alireza Kamrani

Senior RDBMS Consultant

LinkedIn:

https://www.linkedin.com/groups/8151826

Database-Box

Friday, March 29, 2024

🔴♨️Dynamic CPU Scaling Oracle 19c♨️🔴

Friday, March 15, 2024

All in one technique for DBAs about Storage options, I/O concept, hardware models, Disks types

Oracle Hot Block Concept, Detection and Resolving scenarios

Report Abuse

Labels