About Quarantined SQL Plans
You can configure Oracle Database to automatically quarantine
the plans for SQL statements terminated by Oracle Database Resource Manager
(the Resource Manager) for exceeding resource limits.
How SQL Quarantine Works
The Resource Manager can set a maximum estimated execution
time for a SQL statement, for example, 20 minutes. If a statement execution
exceeds this limit, then the Resource Manager terminates the statement.
However, the statement may run repeatedly before being terminated, wasting 20
minutes of resources each time it is executed.
The SQL Quarantine infrastructure (SQL Quarantine) solves
the problem of repeatedly wasting resources. If a statement exceeds the
specified resource limit, then the Resource Manager terminates the execution
and “quarantines” the plan. To quarantine the plan means to put it on a
blocklist of plans that the database will not execute for this statement. Note
that the plan for a terminated statement is quarantined, not
the statement itself.
The query in our example runs for 20 minutes only once, and
then never again—unless the resource limit increases or the plan changes. If
the limit is increased to 25 minutes, then the Resource Manager permits the
statement to run again with the quarantined plan. If the statement runs for 23
minutes, which is below the new threshold, then the Resource Manager removes
the plan from quarantine. If the statement runs for 26 minutes, which is above
the new threshold, then the plan remains quarantined unless the limit is
increased.
SQL Quarantine User Interface
The DBMS_SQLQ PL/SQL package enables you to
manually create quarantine configurations for execution plans by specifying
thresholds for consuming system resources. For example, you can enable a
quarantine threshold of 10 seconds for CPU time or drop the threshold for I/O
requests. You can also immediately save the quarantine information to disk or
drop configurations.
To enable SQL Quarantine to create configurations
automatically after the Resource Manager terminates a query, set
the OPTIMIZER_CAPTURE_SQL_QUARANTINE initialization parameter
to true (the default is false). To disable the use of existing
SQL Quarantine configurations,
set OPTIMIZER_USE_SQL_QUARANTINE to false (the default
is true).
The V$SQL.SQL_QUARANTINE column indicates whether
a plan was quarantined for a statement after the Resource Manager canceled
execution.
The AVOIDED_EXECUTIONS column
indicates how often Oracle Database prevented the statement from running with
the quarantined plan.
About
the Expression Statistics Store (ESS)
The Expression Statistics Store (ESS) is a
repository maintained by the optimizer to store statistics about expression
evaluation.
When an IM column store is enabled, the database leverages
the ESS for its In-Memory Expressions (IM expressions) feature. However, the
ESS is independent of the IM column store. The ESS is a permanent component of
the database and cannot be disabled.
The database uses the ESS to determine whether an expression
is “hot” (frequently accessed), and thus a candidate for an IM expression.
During a hard parse of a query, the ESS looks for active expressions in
the SELECT list, WHERE clause, GROUP BY clause,
and so on.
For each segment, the ESS maintains expression statistics
such as the following:
- Frequency of execution
- Cost of evaluation
- Timestamp evaluation
The optimizer assigns each expression a weighted score based
on cost and the number of times it was evaluated. The values are approximate
rather than exact. More active expressions have higher scores. The ESS
maintains an internal list of the most frequently accessed expressions.
The ESS resides in the SGA and also persists on disk. The
database saves the statistics to disk every 15 minutes, or immediately using
the DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO procedure. The ESS
statistics are visible in the DBA_EXPRESSION_STATISTICS view.
SQL Quarantine
SQL Quarantine in Oracle Database 23c (introduce in 19c
above) is a feature that automatically prevents runaway or resource-intensive
SQL statements from consuming excessive system resources by quarantining their
execution plans. When a SQL statement exceeds predefined resource limits
(e.g., CPU time, elapsed time) as set in Oracle Resource Manager, its execution
plan can be automatically quarantined.
- Automatic Creation:
If OPTIMIZER_CAPTURE_SQL_QUARANTINE is enabled,
the database automatically creates a SQL Quarantine configuration for the
execution plan of a SQL statement that is terminated by Resource Manager due to
exceeding resource limits.
- Preventing Future Executions:
Once an execution plan is quarantined, subsequent attempts
to execute the same SQL statement using that specific quarantined plan will be
prevented or terminated, depending on the configuration. This helps to
maintain system stability and prevent performance degradation.
- Configuration
with DBMS_SQLQ: such as DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID
The DBMS_SQLQ package provides an interface for
manually configuring and managing SQL Quarantine. This allows for
fine-grained control over which SQL plans are quarantined and under what
conditions. You can specify thresholds for resources like CPU time or
elapsed time that, if exceeded, will trigger the quarantine.
- Integration with Resource
Manager:
SQL Quarantine works in conjunction with Oracle Resource
Manager. Resource Manager defines the consumer groups and resource limits,
while SQL Quarantine enforces these limits by preventing problematic SQL plans
from running.
- Monitoring:
The V$SQL view includes
a QUARANTINED column, indicating if a SQL statement's plan has been
quarantined. The AVOIDED_EXECUTION column tracks the number of
times execution attempts were made after a plan was quarantined.
- Scope:
The quarantine applies to a specific execution plan, not the
SQL statement itself. If the SQL statement can be executed with a
different, non-quarantined plan, it will be allowed to run.
- Behavior with Increased Limits:
If the resource limits in the Resource Manager plan are
increased, a previously quarantined SQL statement might be allowed to run again
with its quarantined plan. If it then runs within the new limits, the
quarantine can be removed.
Managing Transactions
Managing
transactions include tasks such as setting transaction priority and
automatically rolling back transactions.
- Priority
Transactions
The Oracle database allows transactions to be automatically rolled back and includes parameters to control this behavior. - Automatic
Transaction Quarantine
Oracle Database quarantines, or isolates, the recovery of transactions that could potentially cause a system crash. These transactions must be manually resolved by the DBA so that row locks are released.
Priority Transactions
The Oracle
database allows transactions to be automatically rolled back and includes
parameters to control this behavior.
A row lock
is a lock on a single row of a table. A transaction acquires a row lock for
each row modified by one of the following
statements: INSERT, UPDATE, DELETE, MERGE, and SELECT
... FOR UPDATE. The row lock exists until the transaction commits or rolls
back. Transactions can hold row locks for a long duration in certain cases. For
example, the application modifies some rows but doesn't commit or terminate the
transaction because of an exception in the application. Traditionally, when a
transaction is blocked on a rowlock by another transaction for a long time, it
required the database administrator to manually terminate the blocking
transaction by using the ALTER SYSTEM KILL SESSION command.
Starting
with Oracle Database 26ai, the database provides parameters to control when and
which transactions holding rowlocks can be automatically rolled back. Oracle
database rolls back the transaction but the session stays alive. The
application must acknowledge the automatic rollback of the transaction by
issuing a ROLLBACK SQL statement.
Applications
can specify the priority of their transactions. If a low priority transaction
blocks a high priority transaction on rowlocks, Oracle database will
automatically roll back the low priority transaction to let the high priority
transaction(s) progress. The
database
administrator can configure the time after which the low priority transaction
is rolled back.
Automatic Transaction Quarantine
Oracle
Database quarantines, or isolates, the recovery of transactions that could
potentially cause a system crash. These transactions must be manually resolved
by the DBA so that row locks are released.
About Redo
Application
Database
buffers in the buffer cache in the SGA are written to disk only when necessary,
using a least-recently-used (LRU) algorithm. Because of the way that the
database writer process uses this algorithm to write database buffers to
datafiles, datafiles may contain some data blocks modified by uncommitted
transactions and some data blocks missing changes from committed transactions.
Crash Recovery and Instance Recovery
Crash
recovery is used to recover from a failure either when a single-instance
database crashes or all instances of an Oracle Real Application Clusters
database crashes. Instance recovery refers to the case where a surviving
instance recovers a failed instance in an Oracle Real Application Clusters
database.
The goal of
crash and instance recovery is to restore the data block changes located in the
cache of the dead instance and to close the redo thread that was left open.
Instance and crash recovery use only online redo log files and current online
datafiles.
Two
potential problems can result if an instance failure occurs:
- Data blocks modified by a
transaction might not be written to the datafiles at commit time and may
only appear in the redo log. Therefore, the redo log contains changes that
must be reapplied to the database during recovery.
- After the roll forward phase, the datafiles may contain changes that had not been committed at the time of the failure. These uncommitted changes must be rolled back to ensure transactional consistency. These changes were either saved to the datafiles before the failure or introduced during the roll forward phase.
To solve
this dilemma, two separate steps are generally used by Oracle for a successful
recovery of a system failure: rolling forward with the redo log (cache
recovery) and rolling back with the rollback or undo segments (transaction
recovery).
Cache Recovery
The online
redo log is a set of operating system files that record all changes made to any
database buffer, including data, index, and rollback segments, whether the
changes are committed or uncommitted. All changes to Oracle blocks are recorded
in the online log.
The first
step of recovery from an instance or disk failure is called cache recovery or
rolling forward and involves reapplying all of the changes recorded in the redo
log to the datafiles. Because rollback data is also recorded in the redo log,
rolling forward also regenerates the corresponding undo segments.
Rolling
forward proceeds through as many redo log files as necessary to bring the
database forward in time. Rolling forward usually includes online redo log
files (instance recovery or media recovery) and may include archived redo log
files (media recovery only).
After
rolling forward, the data blocks contain all committed changes. They may also
contain uncommitted changes that were either saved to the datafiles before the
failure or were recorded in the redo log and introduced during cache recovery.
Transaction Recovery
Undo
tablespaces (in automatic undo management mode) contain undo segments that
record the before-image of changes to the database. In database recovery, the
undo blocks inside the undo segments roll back the effects of uncommitted
transactions previously applied by the rolling forward phase.
After the roll forward, any changes that were not committed must be undone. Oracle applies undo blocks to roll back uncommitted changes in data blocks that were either written before the crash or introduced by redo application during cache recovery. This process of rolling back uncommitted transactions in the database is called transaction recovery.
The
following figure illustrates rolling forward and rolling back, the two steps
necessary to recover from any type of system failure.
Failure During Transaction Recovery
Transaction recovery can fail due to the following reasons:
- Physical
data corruption of database blocks (ORA-01578, ORA-28304)
- Logical
data corruption (ORA-00600)
- Memory
corruption (ORA-00602, ORA-07445)
- State
Corruptions (ORA-00600)
A failure during transaction recovery can be irrecoverable
to the entire database instance and bring down the entire container database
(CDB) including its pluggable databases. Inability to recover all the
transactions in the system leads to rowlocks being held by unrecovered
transactions for longer. This severely impacts critical business operations.
Starting with Oracle Database 26ai, transactions that fail
to recover are quarantined and left un-recovered until the DBA can resolve the
issue. This increases the availability of the database. The Database Developer
is notified about the quarantined transaction and must take immediate action so
that the row locks held by quarantined transactions can be released.
Transaction quarantines are maintained in a persistent data
dictionary table inside the database. Therefore, you can manage quarantines
from any RAC instance in the database.
When a DML operation tries to access rows locked by a quarantined transaction error ORA-60451 will be raised as the DML operation cannot be executed while the rows are still locked.
Quarantined Transaction and Replication
Since Oracle Data Guard uses logical replication, quarantine
metadata is not replicated to the standby server when using Oracle Data
Guard. Therefore, contents of transaction quarantine views, such
as DBA_QUARANTINED_TRANSACTIONS, on the standby server may be different
than the entries on the primary server.
When running with Active Data Guard (ADG), the replication
is physical which means that for the transaction quarantine feature, both the
dead transaction and the catalog representation of the quarantine will be
replicated to the standby database.
Monitoring Quarantined Transactions
Alerts and data dictionary views warn the database developer
of quarantined transactions.
Oracle Database warns DBAs of quarantined transactions in
several ways, which include:
- ALERT_QUE -
the transaction quarantine alert is sent to the persistent alert
queue SYS.ALERT_QUE. This alert is automatically displayed in the
data dictionary
views DBA_OUTSTANDING_ALERTS and DBA_ALERT_HISTORY, as well
as Enterprise Manager Cloud Control and the AWR report.
- Attention
log - introduced in Oracle 21c, the attention log contains
information about critical and highly visible database events. Starting
with Oracle Database 26ai, it includes the transaction quarantine
information as well.
- Alert
log - an incident will be generated for the internal error and
traced in the alert log. The DBA can monitor the quarantine incident
in V$DIAG_ALERT_EXT.
Views named DBA_QUARANTINED_TRANSACTIONS and CDB_QUARANTINED_TRANSACTIONS monitor
all active quarantined transactions. These views provide all the necessary
information to resolve the quarantine.
Resolving Quarantined Transactions
The database developer will be alerted when a transaction
quarantine is generated. Quarantines should be monitored and resolved quickly
to prevent row locks from being held for a long time.
Quarantines can be monitored
using DBA_QUARANTINED_TRANSACTIONS. The REASON column of the
view shows why the transaction was quarantined. For example:
SQL> select usn, slt, sqn, reason, undo_record_objn
from dba_quarantined_transactions;
USN SLT
SQN REASON UNDO_RECORD_OBJN
------ ------ ------ ----------------------
-------------------
6 18
10 ORA-00600[ktubko_1] 73646
7 20
13 ORA-28304 73650
Once the reason for the transaction quarantines has been
identified (ORA-00600[ktubko_1] and ORA-28304 in the example
above), then refer to the Primary MOS note for Automatic Transaction Quarantine
(Doc ID 3005962.1) where detailed instructions are provided
for how to resolve the different causes of transaction quarantines.
Dropping Quarantined Transactions
After the issue related to the quarantine has been fixed,
the quarantine must be dropped.
Transaction Recovery cannot be retried until the issue
concerning the transaction quarantine is fixed. Therefore, after fixing the
quarantine with the corrective action, the quarantine must be manually dropped
for transaction recovery to restart for the quarantined transaction. The
following DDL syntax can be used to drop the quarantine:
ALTER DATABASE DROP TRANSACTION QUARANTINE
<xid_undo_seg_no> <xid_slot_no>
<xid_sequence_no>;
where
- xid_undo_seg_no is
the undo segment number of the quarantined transaction (USN column of
view DBA_QUARANTINED_TRANSACTIONS)
- xid_slot_no is
the slot number of the quarantined transaction (SLT column of
view DBA_QUARANTINED_TRANSACTIONS)
- xid_sequence_no is
the sequence number of the quarantined transaction (SQN column of
view DBA_QUARANTINED_TRANSACTIONS)
For example, to drop the quarantine for xid 8.20.275, use
command:
ALTER DATABASE DROP TRANSACTION QUARANTINE 8 20 275;
When the transaction quarantine limit is reached (default of
3) for a PDB, it is automatically shut down on all RAC instances so that the
database developer can resolve the issue. The other PDBs in the CDB are not
affected.
Transaction quarantine is designed to help in cases when the
failure, such as memory, data, or state corruption, is confined to a single
transaction. That is, the inactive transaction that fails to recover is
quarantined, other inactive transactions can be recovered, and there's no need
to shut down the PDB or the CDB.
When failures happen across multiple transactions or span
the entire PDB, such as physical corruption of multiple blocks, a PDB SGA
corruption, or a logical data corruption due to an internal error, quarantining
the failed inactive transaction recovery may or may not help. It depends on
whether the root cause for those failures is the same or not, because
recovering other inactive transactions might run into the same issue. The
system keeps on running in an inconsistent state even after quarantining a few
transactions. It can be dangerous when the failure is due to logical data
corruption, because it spreads over time. To prevent this from happening, there
is a transaction quarantine limit of three (3), after which the quarantine is
escalated to the database level and the PDB will be terminated
using shutdown abort if archive
When an escalation occurs, perform the following steps:
- Open
the PDB.
- Query
the view DBA_QUARANTINED_TRANSACTIONS to get information about the
quarantined transactions.
- For
each quarantined transaction in the database, resolve the cause of the
transaction quarantine (Resolving Quarantined Transactions) and then drop
the transaction quarantine (see Dropping Quarantined Transactions).
- Enable
transaction recovery for the PDB.
To enable transaction recovery, use the command:
ALTER SYSTEM SET TRANSACTION_RECOVERY=ENABLED sid='*';
The SCOPE clause is not necessarily required. The
default values for SCOPE are:
- For
PDBs, the default value is SCOPE=BOTH.
- For
CDB$ROOT, if a server parameter file was used to start the database, then
the default is SCOPE=BOTH. If a parameter file was used to start the
database, then the default is SCOPE=MEMORY.
These default values for SCOPE will re-enable
transaction recovery for automatic transaction quarantine.
To determine if transaction quarantines were escalated to
the PDB, alerts are published to all the alert channels described
in Monitoring Quarantined Transactions (SYS.ALERT_QUE, Attention log,
and Alert log).
Alireza Kamrani
No comments:
Post a Comment