Monday, January 22, 2024

The Fundamental Characteristics of Storage concepts for DBAs.

Conceptual challenges between DBAs and Storage technicians and Developers.


Visit my group in LinkedIn to find more:

https://www.linkedin.com/groups/8151826


Storage for DBAs: 

As a rule of thumb, pretty much any storage system can be characterised by three fundamental properties:


Latency is a measurement of delay in a system; so in the case of storage it is the time taken to respond to an I/O request. It’s a term which is frequently misused – more on this later – but when found in the context of a storage system’s data sheet it often means the average latency of a single I/O. 

Latency figures for disk are usually measured in milliseconds; for flash a more common unit of measurement would be microseconds.


Latency describes the time required for a sub-system to process a single data request or transaction. With flash storage, read latency includes the time it takes to navigate through the various network connectivity. Once this process is completed, latency also includes the time it takes find the required data blocks and to prepare to transfer data.


IOPS (which stands for I/Os Per Second) represents the number of individual I/O operations taking place in a second. 

IOPS figures can be very useful, but only when you know a little bit about the nature of the I/O such as its size and randomicity. 

If you look at the data sheet for a storage product you will usually see a Max IOPS figure somewhere, with a footnote indicating the I/O size and nature.


Bandwidth (also variously known as throughput) is a measure of data volume over time – in other words, the amount of data that can be pushed or pulled through a system per second. 

Throughput figures are therefore usually given in units of MB/sec or GB/sec.

As the picture suggests, these properties are all related. 

It’s worth understanding how and why, because you will invariably need all three in the real world. It’s no good buying a storage system which can deliver massive numbers of IOPS, for example, if the latency will be terrible as a result.


The throughput is simply a product of the number of IOPS and the I/O size:


Throughput   =   IOPS   x   I/O size


So 2,048 IOPS with an 8k blocksize is (2,048 x 8k) = 16,384 kbytes/sec which is a throughput of 16MB/sec.


The latency is also related, although not in such a strict mathematical sense. Simply put, the latency of a storage system will rise as it gets busier. We can measure how busy the system is by looking at either the IOPS or Throughput figures, but throughput unnecessarily introduces the variable of block size so let’s stick with IOPS. 


We can therefore say that the latency is proportional to the IOPS:


Latency   ∝   IOPS

The proportional (∝) symbol suggests a direct relationship, but actually the latency of a system usually increases exponentially as it nears saturation point.


IOPS isn't latency, but related to this.


IOPS is the number of operations per second; 

as the number of IOPS requested from the device increases the latency will increase.


Imagine your disk is 2 IOPS. 

If you send 2 requests simultaneously, this will average 1 second to complete. 

If you send 20 requests near simultaneously, this will take ten seconds to complete, so any extra requests will have a latency of 10 seconds, because they are waiting for the others to complete.


We can see this if we plot a graph of latency versus IOPS – a common way of visualising performance characteristics in the storage world. The graph shows the SPC benchmark results for an HP 3PAR storage disk system.


See how the response time seems to hit a wall of maximum IOPS? 


Beyond this point, latency increases rapidly without the number of IOPS increasing. 


Even though there are only six data points on the graph it’s pretty easy to visualise where the limit of performance for this particular system is.

Sometimes the term Latency is frequently misused. 

The SPC performance graph is actually plotting response time and not latency. 

These two terms, along with variations of the phrase I/O wait time, are often used interchangeably when they perhaps should not be.


According to Wikipedia, “Latency is a measure of time delay experienced in a system“. 


If your database needs, for example, to read a block from disk then that action requires a certain amount of time. 


And DB  want to read more than one block that doesn't be sequential (no ordering blocks=fragmented related blocks ) this time goes more and more.

Also consider concurrency sessions and all of them need read a shared block, and adding locking database mechanisms....


The time taken for the action to complete is the response time.


If your user session is subsequently waiting for that I/O before it can continue (a blocking wait) then it experiences I/O wait time which Oracle will make to one of the regular wait events such as db file sequential read.


The latency is the amount of time taken until the device is ready to start reading the block, i.e not including the time taken to complete the read. 

In the disk world this includes things like the seek time (moving the actuator arm to the correct track) and the rotational latency (spinning the platter to the correct sector), both of which are mechanical processes (and therefore slow).


You can review my previous post in Database Box LinkedIn group about type of disks and i/o concept if need more information.


When you first began working for a storage vendor you found the intricacies of the terminology confusing and I suppose it’s no different to people entering the database world for the first time. 

I began to realise that there is often a language barrier in I.T. as people with different technical specialties use different vocabularies to describe the same underlying phenomena. 


For example, a storage person might say that the array is experiencing “high latency” while the database admin says that there is “high User I/O wait time“. 


The OS admin might look at the server statistics and comment on the “high levels of IOWAIT“, yet the poor user trying to use the application is only able to describe it as “slow“.


Therefore, it’s the application and its users that matter most, since without them there would be no need for the infrastructure. 

So with that in mind, let’s finish off this post by attempting to translate the terms above into the language of applications.

Translating Storage Into Application

Earlier we defined the three fundamental characteristics of storage. Now let’s attempt to translate them into the language of applications:


Latency is about application acceleration. 

If you are looking to improve user experience, if you want screens on your ERP system to refresh quicker, if you want release notes to come out of the warehouse printer faster…then latency is critical. 


It is extremely important for highly transactional (OLTP) applications which require fast response times. 

Examples include call centre systems, CRM, trading systems, e-Business, core banking systems, etc where real-time data is critical and the high latency of spinning disk has a direct negative impact on revenue.


IOPS is for application scalability. 

IOPS are required for scaling applications and increasing the workload, which most commonly means one of three things: 

  • in the OLTP space, increasing the number of concurrent users; 
  • in the data warehouse space increasing the parallelism of batch processes, 
  • in the consolidation / virtualisation space increasing the number of database instances located on a single physical platform (i.e. the density). 


This last example is becoming ever more important as more and more enterprises consolidate their database estates to save on operational and licensing costs.


Bandwidth / Throughput is effectively the amount of data you can push or pull through your system. 

Obviously that makes it a critical requirement for batch jobs or datawarehouse, type workloads where massive amounts of data need to be processed in order to aggregate and report, or identify trends. 

Increased bandwidth allows for batch processes to complete in reduced amounts of time or for Extract Transform Load (ETL) jobs to run faster. 

And every DBA that ever lived at some point had to deal with a batch process that was taking longer and longer until it started to overrun the window in which it was designed to fit…


Finally, 

The above “explanations” is just a rough guid to better understanding essential concepts.

the real message is to remember that I/O is driven by applications. 

Data sheets tell you the maximum performance of a product in ideal conditions, but the reality is that your applications are unique to your organization so only you will know what they need. 

If you can understand what your I/O patterns look like using the three terms above, you are halfway to knowing what the best storage solution is for you.


☑️My experiences:

When working as a DBA with a storage support team as a coworker, and you faced a slowing on disk, you can see that IOPS graph provided by storage man is healthy stats and he/she could not see any delay or pressure on disks, but you have slow at sometimes, especially when you are slow on Redo logs, because of sequential mechanisms of redo vectors this type of object in database need a separated and exactly isolated disk with preferred speed and minimal latency, at this situation your problem is not available high IOPS disks, instead actually you need lower IOPS but heavy and heay sequential small IOPS for redo generation and you need to convince her/him to replace high IOPS disks with a more Throughput and minimum latency disk  by adding disks to Raid group(add spindle) for example use 4 disks in raid 10 instead of 2 disk.


IOPS vs. Throughput

To summarize the difference between throughput vs. IOPS, IOPS is a count of the read/write operations per second, but throughput is the actual measurement of read/write bits per second that are transferred over a network. To visualize this in a typing context, IOPS would be similar to words per second (where words can be different lengths, but only whole words are being measured), and throughput would be characters per second (where characters are the indivisible component). You can also think about this contrast as IOPS vs. read/write speed.


IOPS =Count of read/write operations per second


Throughput = Count of read/write bits per second (bps)


What’s The Best Storage Performance Metric?

This can be a tricky question to answer because storage requirements can vary between environments, and can even be different based on the specific application. A high-performing storage environment can mean many different things. One thing we can confidently say (based on the content of this article) is that the different between IOPS vs. throughput is frequently asked about.

Sometimes high throughput or high IOPS mean success but may not tell the full story. That is why we recommend using latency in addition to IOPS and throughput.


🟥 I wrote 3 sequential post about hardware concepts and challenges for DBA in LinkedIn (Database Box Group) that always are in collaboration with storage team and concurrently with applications developers so we meed deep knowledge of hardware concepts to achieve better performance and discovering issues in every layer as a DBA responsibilities.


Best regards,

Alireza Kamrani 

Senior RDBMS Consultant.

Thursday, January 18, 2024

Improve database performance with connection pooling and load balancing

 🏗️Improve database performance with connection pooling and load balancing 🎢


We tend to rely on caching solutions to improve database performance. Caching frequently-accessed queries in memory or via a database can optimize write/read performance and reduce network latency, especially for heavy-workload applications, such as gaming services and Q&A portals. But you can further improve performance by pooling users' connections to a database.

Client users need to create a connection to a web service before they can perform CRUD operations. Most web services are backed by relational database servers such as Postgres or MySQL. With PostgreSQL, each new connection can take up to 1.3MB in memory. In a production environment where we expect to receive thousands or millions of concurrent connections to the backend service, this can quickly exceed your memory resources (or if you have a scalable cloud, it can get very expensive very quickly).

Because each time a client attempts to access a backend service, it requires OS resources to create, maintain, and close connections to the datastore. This creates a large amount of overhead causing database performance to deteriorate.

Consumers of your service expect fast response times. If that performance deteriorates, it can lead to poor user experiences, revenue losses, and even unscheduled downtime. If you expose your backend service as an API, repeated slowdowns and failures could cause cascading problems and lose you customers.

Instead of opening and closing connections for every request, connection pooling uses a cache of database connections that can be reused when future requests to the database are required. It lets your database scale effectively as the data stored there and the number of clients accessing it grow. Traffic is never constant, so pooling can better manage traffic peaks without causing outages. Your production database shouldn’t be your bottleneck.


Oracle load balancing & Pooling solutions:

Oracle Real application clustering (RAC)

When using RAC you can configure SCAN-IP to achieve load balancing. But cause of some applications behavior and existing limitation, you can not always using Scan ip, so you have to create services to minimize traffic on interconnection, in this case you lost load balancing feature of Rac and use scalability and 

high availability of Rac.


1-Oracle Traffic Director Manager(TDM)

Oracle Traffic Director is a fast, reliable, and scalable layer-7 software load balancer. You can set up Oracle Traffic Director to serve as the reliable entry point for all HTTP, HTTPS and TCP traffic to application servers and web servers in the back end. Depending on the needs of your IT environment, you can configure Oracle Traffic Director to apply multiple, complex rules when distributing requests to the back-end servers and when forwarding responses to clients.

Oracle Traffic Director distributes the requests that it receives from clients to servers in the back end based on the specified load-balancing method, routes the requests based on specified rules, caches frequently accessed data, prioritizes traffic, and controls the quality of service.


On engineered systems platforms, you can set up pairs of Oracle Traffic Director instances and leverage its built-in High Availability capability to setup either Active-Passive or Active-Active failover. As the volume of traffic to your network grows, you can easily scale the environment by reconfiguring Oracle Traffic Director with additional back-end servers to which it can route requests.

Oracle Traffic Director provides the following features:

  • Advanced methods for load distribution
    Configure Oracle Traffic Director to distribute client requests to servers in the back-end using one of these methods:
    • Round robin
    • Least connection count
    • Least response time
    • Weighted round robin
    • Weighted least connection count
  • Flexible routing and load control on back-end servers
    • Request-based routing
    • Content-based routing
    • Request rate acceleration
    • Connection limiting
  • Controlling the request load and quality of service
    • Request rate limiting
    • Quality of service tuning
  • Support for WebSocket connections
  • Integration with Oracle Fusion Middleware
  • Easy-to-use administration interfaces
  • Security
    Oracle Traffic Director enables and enhances security for your IT infrastructure in the following ways:
    • Reverse proxy
    • Support for TLS 1.0, 1.1, and 1.2
    • Web Application Firewall
    • HTTP Forward Proxy Support in Origin Server Pools
  • High availability
    Oracle Traffic Director provides high availability for your enterprise applications and services through the following mechanisms:
    • Health checks for the back end
    • Backup servers in the back end
    • Failover for load balancing
    • Dynamic reconfiguration
  • Monitoring statistics
    Administrators can monitor a wide range of statistics pertaining to the performance of Oracle Traffic Director instances through several methods: the administration console, the command-line interface, and a report in XML format.
  • High performance
    • SSL/TLS offloading
    • Content caching
      • HTTP compression

Oracle Traffic Director Terminology

An Oracle Traffic Director configuration is a collection of elements that define the run-time behavior of an Oracle Traffic Director instance. An Oracle Traffic Director configuration contains information about various elements of an Oracle Traffic Director instance such as listeners, origin servers, failover groups, and logs.


2-Orcale Connection Manager (CMAN) 

CMAN is one of the useful software that help you achieve connectivity management easily.


Oracle Connection Manager (CMAN) is a multi-purpose database connection proxy server used in Oracle deployments. CMAN is available as part of the Oracle Database Enterprise Edition (EE) client package.

CMAN in Traffic Director Mode (TDM), or CMAN-TDM, brings in intelligence to manage incoming client connections through session multiplexing and dynamic load balancing that can optimize the usage of database resources.

Session multiplexing can be handled efficiently through the Proxy Resident Connection Pooling (PRCP) feature in CMAN-TDM. When PRCP is enabled on CMAN-TDM, connection pool(s) are created on the CMAN-TDM server that enable mid-tier clients to share connections to the database.

Oracle Database 21c and beyond supports the multi-tenant architecture model with multiple Pluggable Databases (PDBs) and a single Container Database (CDB). Until recently, PRCP pools can be created for each database service. So if a PDB database instance has multiple services, then a PRCP pool is created for each of the services.

Starting from Oracle EE 23c version, CMAN-TDM provides a per-PDB PRCP option as well. With this feature, a PRCP pool will be created at the PDB level. This per-PDB PRCP pool will be shared across multiple services associated with the same PDB.


Sample per-PDB PRCP architecture

To enable per-PDB PRCP, the user has to set the PRCP connection factor (TDM_PERPDB_PRCP_CONNFACTOR) parameter to a non-zero positive value in cman.ora (The configuration file for CMAN).

The per-PDB PRCP feature dynamically configures the PRCP pool, by adjusting the maximum size based on the Oracle Compute Unit (OCPU) — an internal parameter — and the PRCP connection factor parameter at the PDB level. CMAN-TDM periodically checks the OCPU value of each PDB and reconfigures the per-PDB PRCP pool size, if required.
The user can further tune the pool sizing by tweaking the PRCP connection factor to a higher or lower value.

You can also view the per-PDB PRCP performance statistics by setting another cman.ora parameter TDM_STATS_FREQUENCYto a non-zero positive value. This will add the statistics in the newly created V$TDM_STATS dynamic view.

In essence, consolidation of PRCP pools per PDB minimizes the fragmentation of session pools across multiple services and automatically applies the sizing for session pools for the whole PDB in an effective manner, leading to better resource management on CMAN-TDM.

What’s more! you can monitor the connection performance with the per-PDB PRCP statistics available in dynamic views as well! Pretty cool!


For advanced connection pooling and security enhancements you can use F5 applience that provide a hardware load balancing with many security options.


For web applications using weblogic that can offer HA and scalability features and you can use Traffic director manager TDM also.


Conclusion

Database performance can be improved beyond connection pooling. Replication, load balancing, and in-memory caching can contribute to efficient database performance.

If a web service is designed to make a lot of read and write queries to a database, then you have multiple instances of a Postgres database in place to take care of write queries from clients through a load balancer such as pgpool-II while in-memory caching can be used to optimize read queries.

Despite the pgpool-II ability to function as a loader balancer and connection pooler, pgbouncer is the preferred middleware solution for connection pooling because it is easy to set up, not too difficult to manage, and primarily serves as a connection pooler without any other functions.


When use Oracle you have this solution:

Setup oracle RAC with config Scan-ip and using new features such as TAF, FAN, and specially TAC(Transparent Application Continuity).

Another integrated solution is using CMAN.

If thses options can not achieve your load balancing requirements then use F5 appliences that give you hardware load balancing and security options.

For Orcale environment when you are challenging on web applications, use Weblogic or traffic director software to achieve connection pooling, load balancing.


Best Regards,


Alireza Kamrani 

Senior RDBMS Consultant.

Fleet Patching & Provisioning (FPP) Concepts and cloning methods for Oracle/Grid Home

  Fleet Patching & Provisioning (FPP) Concepts and cloning methods for Oracle/Grid Home:                           ♠️Alireza Kamrani♠️  ...