Where Are My Snapshots?

I’m sharing this in the hope that it will provide others with all the relevant information to understand this behaviour, should they hit it, rather than have to spend time searching MOS and performing their own tests. The information is there on MOS, but it took me a little time to find it (note IDs listed later).

“AWR is broken! We don’t have any snapshots over the weekend.”

That’s what started this investigation. I took a look and sure enough we didn’t have an snapshots for some instances over a few recent time periods. I check the alert log and found trace files as shown below:

Trace file /data/<db os owner>/<DB NAME>/admin/diag/rdbms/<db name>/<DB NAME>8/trace/<DB NAME>8_m001_23365.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /data/<db os owner>/<DB NAME>/product/11.2.0.2
System name:    Linux
Node name:      <hostname removed>
Release:        2.6.18-194.3.1.0.3.el5
Version:        #1 SMP Tue Aug 31 22:41:13 EDT 2010
Machine:        x86_64
Instance name: <DB NAME>8
Redo thread mounted by this instance: 8
Oracle process number: 87
Unix process pid: 23365, image: <db os owner>@<hostname removed> (M001)

*** 2011-04-06 18:48:50.018
*** SESSION ID:(704.3391) 2011-04-06 18:48:50.018
*** CLIENT ID:() 2011-04-06 18:48:50.018
*** SERVICE NAME:(SYS$BACKGROUND) 2011-04-06 18:48:50.018
*** MODULE NAME:(MMON_SLAVE) 2011-04-06 18:48:50.018
*** ACTION NAME:(Remote-Flush Slave Action) 2011-04-06 18:48:50.018

DDE rules only execution for: ORA 12751
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
Executing ASYNC actions
----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
Runtime exceeded 900 seconds

The key thing to be aware of is that there has been a change in behaviour from 10g to 11g that limits the amount of time MMON (and slaves) are allowed to run. MOS 761298.1 “ORA-12751 cpu time or run time policy violation” covers this. The purpose of the change to stop MMON (and slaves) contending with application sessions that are already fighting for resources.

In this case the instance was getting hit very hard by the application during testing and the team reporting the problem were trying to squeeze every last drop of performance out of the system. Up until this point AWR reports were being used to identify where bottlenecks were occurring and then approaches for avoiding them proposed. Without AWR snapshots there would be no AWR reports, which in turn stopped them using the same approach to improve throughput…

Having highlighted the MOS note there was some question over whether this was actually the reason for the missing snapshots, so I had to come up with a way of validating the situation. I ran a manual snapshot, with SQL trace enabled and then ran tkprof on the trace file.

There are 2 elements of the trace file/tkprof output that I feel are relevant.

1) This section of top section of the tkprof output produced using the options “sort=exeela” shows the following.

SQL ID: 1uk5m5qbzj1vt Plan Hash: 0

BEGIN dbms_workload_repository.create_snapshot; END;

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.09     901.15          0          0         22           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.09     901.15          0          0         22           1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: SYS

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  enq: WF - contention                           19        0.00          0.00
  row cache lock                                  3        0.00          0.00
  gc current grant busy                           2        0.00          0.00
  reliable message                              900        1.00        900.85
  log file sync                                   1        0.00          0.00
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00
********************************************************************************

900 seconds (15 mins) waiting for “reliable message”. Unfortunately “reliable message” doesn’t make it onto the list of “wait events of greatest interest” in the Oracle documentation and information is sparse on MOS. However, I believe this represents the session waiting for a response from the other instances in the cluster.

2) Looking into the raw trace file at the section where the wait on “reliable message” ends shows the following message:

KEWRGRFR: failed waiting for remote flushers, rcode=-125

This was accepted as confirmation that we were hitting the issue in MOS 761298.1, but focus was then turned to how to live with this behaviour… And, obviously hitting the instance(s) that they were testing on a little let aggressively wasn’t an option, even though (in my opinion) they were unlikely to be getting optimal performance at this point (which I didn’t have time to investigate).

The options I offered where:

1) Manually take an AWR snapshot right before the testing, accept that it is unlikely that there will be snapshots during the application testing, manually take an AWR snapshot as soon as the testing completes, generate an AWR report covering the period between the 2 manual snapshots.

2) Trace the MMON slaves when at least one of the database servers is heavily loaded, perform a manual snapshot, examine trace files to pin point the exact cause of the delay in creating the snapshot, raise an SR to establish if we can disable the part of the snapshot activity that is responsible for the delay – assuming there is a specific part of the snapshot activity that is being slowed down by the load on the server(s) rather than all of it.

It was clear to me that option 1 was the logical choice. Option 2 was suggested on the basis of the hidden parameter “_awr_disabled_flush_tables”, which is covered in MOS 1301503.1 “How To Diagnose AWR Snapshots”. It would be useful if searching MOS for “ORA-12751” found this note.

Leave a comment