Mohamed Houri’s Oracle Notes

June 7, 2011

Trouble shooting read by other sessions wait event

Filed under: Oracle — hourim @ 6:02 pm

Have you ever been asked how you trouble shoot an Oracle performance problem? The first answer to this question that should pop up immediately  is: “is this performance problem instance wide or localized in a SQL or PL/SQL piece of code?”  For the former you need (if licensed to) an AWR report and in order to diagnose where time is spent. For the later you should investigate where time is spent by having an explain plan taken from the memory or by investigating the design and/or the statistics of related data base objects.

But,  bear in mind that in case of instance wide performance problem you should collect an AWR report with a snapshot that corresponds to the time interval when the application was performing badly.  In other words, if you are asked to investigate an instance wide performance problem you should ask the end user how long he has been suffering from this performance problem? If the end user answer is 15 minutes then you should ask to generate an AWR report for a period of 15 minutes and during a period where the application is suffering.

Several weeks ago I received a phone call from one of my customer relating an instance wide performance problem (he said that from time to time the application hangs during 15-20 minutes). Below is the first AWR report I have been sent for investigation

Top 5 Timed Events

Event                          Waits   Time(s)    Avg Wait(ms)    % Total Call Time     Wait Class</span>
enq: TX - row lock contention  8,468    25,344       2,993            78.7             Application
read by other session          714,628  4,604         6               14.3             User I/O
db file sequential read        323,264  1,977         6               6.1              User I/O
CPU time                                171                           .5
db file scattered read          1,885   49           26               .2               User I/O

When I saw that 78% of the response time has been spent on a transaction row lock contention I immediately verified that there are no bitmap indexes nor unindexed foreign keys (using Tom Kyte script http://tkyte.blogspot.com/2009/10/httpasktomoraclecomtkyteunindex.html) on this OLTP 10gR2 data base.

So what else then?

In the process of trouble shooting you should remember; remember what have you read in the past few months or sometime in the past couple of years ago?  Oh yes, I did remember that Jonathan Lewis has blogged on a similar case (http://jonathanlewis.wordpress.com/2008/02/06/trouble-shooting-2/#more-320) .  Then I did understand that the TX – row lock contention wait event could be just a side effect of this read by other session wait event on which I should concentrate my efforts. Typical scenario is that an end user starts an update and locks a couple of rows. This update is followed by a critical query which, in normal situation completes in a fraction of seconds, starts doing an extreme amount of physical I/O. In the meantime other sessions will request the same data from the same block and start waiting for the physical I/O of the first session to complete putting data in the buffer cache. This is the read by other session wait event.

The second day, I was still thinking about the first AWR report when I received a second AWR corresponding to a fresh problem that occurred right an hour ago. And here below is that new report:

Top 5 Timed Events                                         Avg %Total
54     ~~~~~~~~~~~~~~~~~~                                        wait   Call
55     Event                                 Waits    Time (s)   (ms)   Time Wait Class
56     ------------------------------ ------------ ----------- ------ ------ ----------
57     read by other session               896,860       5,670      6   67.7   User I/O
58     db file sequential read             417,699       2,551      6   30.4   User I/O
59     CPU time                                            184           2.2
60     control file parallel write           2,407           3      1    0.0 System I/O
61     log file parallel write               2,921           2      1    0.0 System I/O

All this seems now very coherent. There is no TX-row lock contention wait event in this new report. There is more than 67% of the response time have been spent on the read by other session wait event while the remaining 30% has been allocated to the db file sequential read wait event.

A “read by other session” is a wait event that occurs when one session wants to read a block another session is reading from disk and loading into the memory (buffer cache). In short, I want to read a block from disk and someone else is reading it for me; hence I have to wait. But in my actual situation I have many sessions trying to read the same block from the disk. This means that the object those sessions are trying to read can’t be kept in the buffer cache or the cache is filled by other process so that our hot object can’t stay in the buffer cache generating this “read by other session wait” event and consequently more physical I/O.

So what else then?

Generally, there is a correlation between physical I/O, read by other session and “Sql Ordered by read” part of the AWR. This is why I’ve drilled down to this part to observe this:

SQL ordered by Reads           DB/Inst: xxxx/xxxx  Snaps: 13491-13492
-> Total Disk Reads:         506,742
-> Captured SQL account for      2.0% of Total
Reads              CPU     Elapsed
Physical Reads  Executions    per Exec   %Total Time (s)  Time (s)    SQL Id
-------------- ----------- ------------- ------ -------- --------- -------------
5,203           7         743.3    1.0    10.48     39.39 a6hkd1d84h84p
Module: JDBC Thin Client
SELECT  tab1.ID
      , tab1.col2
      , tab1.id2
      , tab1.col1
      , tab1.col3
      , tab1.col4
      , tab1.col5
      , tab1.col6
      , tab1.col7
      , tab1.col8
      , tab1.col9
      , tab1.col10
      , tab1.col11
      , tab2.ID
      , tab3.col3
     , tab4.col4
FROM tab1 LEFT OUTER JOIN tab2
    ON tab1.acl_id = tab2.ID
LEFT OUTER JOIN tab3
     ON tab1.ID = tab3.node_id
LEFT OUTER JOIN tab4
     ON tab1.ID = tab4.node_id
WHERE tab1.ID = :1
AND tab1.id2 IN (:2, :3, :4, :5, :6, :7, :8);

Ah we’ve narrowed the problem now up to the above query.

So what else then?

Since we’ve the SQL responsible of the important part of physical I/O, let’s drill down to the ‘Segment per physical read’ part of the AWR in order to localize the ‘hot’ object:

Segments by Physical Reads      DB/Inst: xxxx/xxxx  Snaps: 13491-13492
-> Total Physical Reads:         506,742
-> Captured Segments account for    98.8% of Total

Tablespace                      Subobject  Obj.      Physical
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
xxxx        xxxx       SYS_C0010893                    INDEX      472,853   93.31
xxxx        xxxx       TABLE_T3                        TABLE        9,051    1.79
xxxx        xxxx       FK_TABLE_T3                     INDEX        7,036    1.39
xxxx        xxxx       SYS_C0010896                    INDEX        2,918     .58
xxxx        xxxx       TABLE_T1                        TABLE        2,252     .44
-------------------------------------------------------------

Oh yes; there is an index there (SYS_C0010893 ) which is consuming 93% of the whole physical I/O.

Finally, I asked to have the explain plan of the above SQL query which I have reproduced here below (explain plan has been taken with the help of DBMS_DISPLAY_AWR package):

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                         |       |       |  4891 (100)|          |
|   1 |  NESTED LOOPS OUTER             |                         |  8487 |  2320K|  4891   (1)| 00:00:59 |
|   2 |   NESTED LOOPS OUTER            |                         |   758 |   167K|  1313   (1)| 00:00:16 |
|   3 |    NESTED LOOPS OUTER           |                         |   187 | 40392 |   751   (0)| 00:00:10 |
|   4 |     INLIST ITERATOR             |                         |       |       |            |          |
|   5 |      TABLE ACCESS BY INDEX ROWID| TABLE_T1                |   187 | 26367 |   377   (0)| 00:00:05 |
|   6 |       INDEX UNIQUE SCAN         | SYS_C0010893            |   187 |       |   193   (0)| 00:00:03 |
|   7 |     TABLE ACCESS BY INDEX ROWID | TABLE_T2                |     1 |    75 |     2   (0)| 00:00:01 |
|   8 |      INDEX UNIQUE SCAN          | SYS_C0010774            |     1 |       |     1   (0)| 00:00:01 |
|   9 |    INDEX RANGE SCAN             | SYS_C0010896            |     4 |    40 |     3   (0)| 00:00:01 |
|  10 |   TABLE ACCESS BY INDEX ROWID   | TABLE_T3                |    11 |   594 |     5   (0)| 00:00:01 |
|  11 |    INDEX RANGE SCAN             | FK_TABLE_T3             |    11 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Where I see the presence of this hot SYS_C0010893 index.
Oh I will not finish without displaying the buffer pool advisory part of the problematic AWR:

            Snap Id      Snap Time         Sessions Curs/Sess
            --------- ------------------- --------- ----------
Begin Snap:   13491     25-Mei-11 09:00:56        36       6.1
End Snap:     13492     25-Mei-11 10:00:30        36       6.1
Elapsed:                59.56 (mins)
DB Time:                109.69 (mins)

 

Cache Sizes
~~~~~~~~~~~                       Begin        End
                                 ---------- ----------
               Buffer Cache:       776M       776M  Std Block Size:         8K
               Shared Pool Size:   96M        96M      Log Buffer:     6,924K

Load Profile
~~~~~~~~~~~~                            Per Second       Per Transaction
                                        ---------------  ---------------
                Redo size:              1,432.57                159.56
                Logical reads:            984.03                109.60
                Block changes:              9.01                  1.00
                Physical reads:           141.79                 15.79
                Physical writes:            1.09                  0.12
                User calls:                47.98                  5.34
                Parses:                     7.37                  0.82
                Hard parses:                0.63                  0.07
                Sorts:                      3.66                  0.41
                Logons:                     0.05                  0.01
                Executes:                  14.46                  1.61
                Transactions:               8.98

% Blocks changed per Read:    0.92    Recursive Call %:    75.36
Rollback per transaction %:   97.99       Rows per Sort:     8.40

Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait             %:   76.32       Redo NoWait %:  100.00
Buffer  Hit               %:   85.59    In-memory Sort %:  100.00
Library Hit               %:   88.79        Soft Parse %:   91.43
Execute to Parse          %:   49.02         Latch Hit %:   91.43
Parse CPU to Parse Elapsd %:   44.68     % Non-Parse CPU:   98.06

Shared Pool Statistics        Begin    End
                             ------  ------
            Memory Usage %:   78.60   77.65
   % SQL with executions>1:   80.50   85.53
 % Memory for SQL w/exec>1:   76.65   82.61

Buffer Pool Advisory                   DB/Inst: xxxx/xxxx  Snap: 13492
-> Only rows with estimated physical reads >0 are displayed
-> ordered by Block Size, Buffers For Estimate
Est
Phys
Size for   Size      Buffers for   Read          Estimated
P    Est (M) Factor         Estimate Factor     Physical Reads
--- -------- ------ ---------------- ------ ------------------
D         72     .1            8,991    1.5          1,955,905
D        144     .2           17,982    1.5          1,919,097
D        216     .3           26,973    1.5          1,905,107
D        288     .4           35,964    1.4          1,869,577
D        360     .5           44,955    1.4          1,852,229
D        432     .6           53,946    1.3          1,743,851
D        504     .6           62,937    1.3          1,738,194
D        576     .7           71,928    1.3          1,727,611
D        648     .8           80,919    1.1          1,367,322
D        720     .9           89,910    1.0          1,295,649
D        776    1.0           96,903    1.0          1,293,144   ----- actual size
D        792    1.0           98,901    1.0          1,292,411
D        864    1.1          107,892    1.0          1,242,738
D        936    1.2          116,883    0.9          1,212,406
D      1,008    1.3          125,874    0.9          1,132,996
D      1,080    1.4          134,865    0.8          1,029,577
D      1,152    1.5          143,856    0.7            929,565
D      1,224    1.6          152,847    0.6            732,968  ----- suggested size
D      1,296    1.7          161,838    0.6            720,034
D      1,368    1.8          170,829    0.6            714,496
D      1,440    1.9          179,820    0.6            712,110
-------------------------------------------------------------

Knowing the limitation I have that is  (a) I can’t change the query (b) I can’t create a new index and as far as it is a problem of physical I/O perpetuated by a “read by other sessions” wait event, I have suggested  the client  to increase the buffer cache size from 776MB to 1224MB.

I am still waiting for their feedback.

12 Comments »

  1. how you come to this number 1224MB for buffer pool

    Comment by nilesh — August 7, 2012 @ 9:20 pm | Reply

  2. Nilesh,

    The buffer Pool advisory is easy to interpret. The actual configuration is:

     776    1.0           96,903    1.0          1,293,144   
    

    which means that with a buffer cache size of 776MB there have been 1,293,144 physical I/O generated

    Multiply this size by a factor of 1,6 (776*1,6 = 1241,6 MB ~ 1224MB) and you will reduce the physical I/O to 732,968 as shown within the same buffer pool advisory

     1,224    1.6          152,847    0.6            732,968  
    

    Double the buffer cache size (776*1,9 = 1474,4 MB ) and you will reduce the physical I/O to 712,110

     1,440    1.9          179,820    0.6            712,110 
    

    Halve the the buffer cache size (776 *0.5 =388MB) and you will have a jump in the physical I/O to 1,852,229 (559,085 extra physical reads)

    360     .5           44,955    1.4          1,852,229 
    

    Best Regards

    Comment by hourim — August 8, 2012 @ 6:51 am | Reply

  3. No feedback?

    Comment by rogerdpack — March 6, 2013 @ 9:40 pm | Reply

    • Ok so where you say “suggested size” that was *your* suggested size, to the client, right?

      Comment by rogerdpack — March 7, 2013 @ 12:26 am | Reply

  4. if you have the option to modify the query or the index, What could be the change ?

    Comment by Ahmed Mousaafa — August 29, 2013 @ 1:07 am | Reply

  5. Ahmed,

    It has been two years since I tried to trouble shoot this problem from home at a distant client. If I will be confronted to the same issue today, I would have tried to ask more question like (a) how much data you are after (b) what kind of index you have on the concerned tables (c) how fresh the statistics are (d) how data are scattered and then get an execution plan of the same query from the memory and eventually get a trace (tkprof) of it. It is quite sure that enhancing the query will make the application behaving as if there were no locks because this query follows and update before the commit and as far as this query will run very quickly other user will not be locked when they will try to update the same set of data.
    Best regards

    Comment by hourim — August 29, 2013 @ 1:01 pm | Reply

  6. Hi Muhammad,

    Could you please explain what metrics to see in SQL ordered by Elapsed Time,SQL ordered by CPU Time when figuring high consuming SQL.

    Please throw some light on this.

    Regards,
    Shadab

    Comment by Shadab — September 11, 2014 @ 7:32 am | Reply

  7. Hi Shadab,

    There are few wait events that correlate very well with few of the SQL ordered by AWR information. For example, db file sequential read correlates very well with SQL ordered by Gets. Db file scattered read correlates very well with SQL ordered by reads (or by physical reads). When I have high waits on direct path read I usually try to find the underlying table (or index) by looking at the Segment by Direct physical read (an 11g Option I think)

    I always start by looking at the number of execution and ask myself if this is a normal behaviour to have a great number of executions during the AWR snapshot. Elapsed time per execution is also a good indication. If this information exceeds 7 seconds I know that the corresponding sql_id has been monitored which allows me to generate the very important Real time SQL monitoring report.

    I would advise you to read the Jonathan Lewis collection of article about AWR and Statspack

    AWR / Statspack

    Best regards

    Comment by hourim — September 11, 2014 @ 8:06 am | Reply

  8. You can also add this hot object to keep pool right , if it is very small?

    Comment by Jyothish — November 16, 2016 @ 1:31 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a free website or blog at WordPress.com.

Tony's Oracle Tips

Tony Hasler's light hearted approach to learning about Oracle

Richard Foote's Oracle Blog

Focusing Specifically On Oracle Indexes, Database Administration and Some Great Music

Hatem Mahmoud's blog

Just another blog : Databases, Linux and other stuffs

Mohamed Houri’s Oracle Notes

Qui se conçoit bien s’énonce clairement

Oracle Diagnostician

Performance troubleshooting as exact science

Raheel's Blog

Things I have learnt as Oracle DBA

Coskan's Approach to Oracle

What I learned about Oracle

So Many Oracle Manuals, So Little Time

“Books to the ceiling, Books to the sky, My pile of books is a mile high. How I love them! How I need them! I'll have a long beard by the time I read them”—Lobel, Arnold. Whiskers and Rhymes. William Morrow & Co, 1988.

Carlos Sierra's Tools and Tips

Tools and Tips for Oracle Performance and SQL Tuning

Oracle Scratchpad

Just another Oracle weblog

OraStory

Dominic Brooks on Oracle Performance, Tuning, Data Quality & Sensible Design ... (Now with added Sets Appeal)