Have you ever been asked how you trouble shoot an Oracle performance problem? The first answer to this question that should pop up immediately is: “is this performance problem instance wide or localized in a SQL or PL/SQL piece of code?” For the former you need (if licensed to) an AWR report and in order to diagnose where time is spent. For the later you should investigate where time is spent by having an explain plan taken from the memory or by investigating the design and/or the statistics of related data base objects.
But, bear in mind that in case of instance wide performance problem you should collect an AWR report with a snapshot that corresponds to the time interval when the application was performing badly. In other words, if you are asked to investigate an instance wide performance problem you should ask the end user how long he has been suffering from this performance problem? If the end user answer is 15 minutes then you should ask to generate an AWR report for a period of 15 minutes and during a period where the application is suffering.
Several weeks ago I received a phone call from one of my customer relating an instance wide performance problem (he said that from time to time the application hangs during 15-20 minutes). Below is the first AWR report I have been sent for investigation
Top 5 Timed Events Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class</span> enq: TX - row lock contention 8,468 25,344 2,993 78.7 Application read by other session 714,628 4,604 6 14.3 User I/O db file sequential read 323,264 1,977 6 6.1 User I/O CPU time 171 .5 db file scattered read 1,885 49 26 .2 User I/O
When I saw that 78% of the response time has been spent on a transaction row lock contention I immediately verified that there are no bitmap indexes nor unindexed foreign keys (using Tom Kyte script http://tkyte.blogspot.com/2009/10/httpasktomoraclecomtkyteunindex.html) on this OLTP 10gR2 data base.
So what else then?
In the process of trouble shooting you should remember; remember what have you read in the past few months or sometime in the past couple of years ago? Oh yes, I did remember that Jonathan Lewis has blogged on a similar case (http://jonathanlewis.wordpress.com/2008/02/06/trouble-shooting-2/#more-320) . Then I did understand that the TX – row lock contention wait event could be just a side effect of this read by other session wait event on which I should concentrate my efforts. Typical scenario is that an end user starts an update and locks a couple of rows. This update is followed by a critical query which, in normal situation completes in a fraction of seconds, starts doing an extreme amount of physical I/O. In the meantime other sessions will request the same data from the same block and start waiting for the physical I/O of the first session to complete putting data in the buffer cache. This is the read by other session wait event.
The second day, I was still thinking about the first AWR report when I received a second AWR corresponding to a fresh problem that occurred right an hour ago. And here below is that new report:
Top 5 Timed Events Avg %Total 54 ~~~~~~~~~~~~~~~~~~ wait Call 55 Event Waits Time (s) (ms) Time Wait Class 56 ------------------------------ ------------ ----------- ------ ------ ---------- 57 read by other session 896,860 5,670 6 67.7 User I/O 58 db file sequential read 417,699 2,551 6 30.4 User I/O 59 CPU time 184 2.2 60 control file parallel write 2,407 3 1 0.0 System I/O 61 log file parallel write 2,921 2 1 0.0 System I/O
All this seems now very coherent. There is no TX-row lock contention wait event in this new report. There is more than 67% of the response time have been spent on the read by other session wait event while the remaining 30% has been allocated to the db file sequential read wait event.
A “read by other session” is a wait event that occurs when one session wants to read a block another session is reading from disk and loading into the memory (buffer cache). In short, I want to read a block from disk and someone else is reading it for me; hence I have to wait. But in my actual situation I have many sessions trying to read the same block from the disk. This means that the object those sessions are trying to read can’t be kept in the buffer cache or the cache is filled by other process so that our hot object can’t stay in the buffer cache generating this “read by other session wait” event and consequently more physical I/O.
So what else then?
Generally, there is a correlation between physical I/O, read by other session and “Sql Ordered by read” part of the AWR. This is why I’ve drilled down to this part to observe this:
SQL ordered by Reads DB/Inst: xxxx/xxxx Snaps: 13491-13492 -> Total Disk Reads: 506,742 -> Captured SQL account for 2.0% of Total Reads CPU Elapsed Physical Reads Executions per Exec %Total Time (s) Time (s) SQL Id -------------- ----------- ------------- ------ -------- --------- ------------- 5,203 7 743.3 1.0 10.48 39.39 a6hkd1d84h84p Module: JDBC Thin Client SELECT tab1.ID , tab1.col2 , tab1.id2 , tab1.col1 , tab1.col3 , tab1.col4 , tab1.col5 , tab1.col6 , tab1.col7 , tab1.col8 , tab1.col9 , tab1.col10 , tab1.col11 , tab2.ID , tab3.col3 , tab4.col4 FROM tab1 LEFT OUTER JOIN tab2 ON tab1.acl_id = tab2.ID LEFT OUTER JOIN tab3 ON tab1.ID = tab3.node_id LEFT OUTER JOIN tab4 ON tab1.ID = tab4.node_id WHERE tab1.ID = :1 AND tab1.id2 IN (:2, :3, :4, :5, :6, :7, :8);
Ah we’ve narrowed the problem now up to the above query.
So what else then?
Since we’ve the SQL responsible of the important part of physical I/O, let’s drill down to the ‘Segment per physical read’ part of the AWR in order to localize the ‘hot’ object:
Segments by Physical Reads DB/Inst: xxxx/xxxx Snaps: 13491-13492 -> Total Physical Reads: 506,742 -> Captured Segments account for 98.8% of Total Tablespace Subobject Obj. Physical Owner Name Object Name Name Type Reads %Total ---------- ---------- -------------------- ---------- ----- ------------ ------- xxxx xxxx SYS_C0010893 INDEX 472,853 93.31 xxxx xxxx TABLE_T3 TABLE 9,051 1.79 xxxx xxxx FK_TABLE_T3 INDEX 7,036 1.39 xxxx xxxx SYS_C0010896 INDEX 2,918 .58 xxxx xxxx TABLE_T1 TABLE 2,252 .44 -------------------------------------------------------------
Oh yes; there is an index there (SYS_C0010893 ) which is consuming 93% of the whole physical I/O.
Finally, I asked to have the explain plan of the above SQL query which I have reproduced here below (explain plan has been taken with the help of DBMS_DISPLAY_AWR package):
----------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 4891 (100)| | | 1 | NESTED LOOPS OUTER | | 8487 | 2320K| 4891 (1)| 00:00:59 | | 2 | NESTED LOOPS OUTER | | 758 | 167K| 1313 (1)| 00:00:16 | | 3 | NESTED LOOPS OUTER | | 187 | 40392 | 751 (0)| 00:00:10 | | 4 | INLIST ITERATOR | | | | | | | 5 | TABLE ACCESS BY INDEX ROWID| TABLE_T1 | 187 | 26367 | 377 (0)| 00:00:05 | | 6 | INDEX UNIQUE SCAN | SYS_C0010893 | 187 | | 193 (0)| 00:00:03 | | 7 | TABLE ACCESS BY INDEX ROWID | TABLE_T2 | 1 | 75 | 2 (0)| 00:00:01 | | 8 | INDEX UNIQUE SCAN | SYS_C0010774 | 1 | | 1 (0)| 00:00:01 | | 9 | INDEX RANGE SCAN | SYS_C0010896 | 4 | 40 | 3 (0)| 00:00:01 | | 10 | TABLE ACCESS BY INDEX ROWID | TABLE_T3 | 11 | 594 | 5 (0)| 00:00:01 | | 11 | INDEX RANGE SCAN | FK_TABLE_T3 | 11 | | 3 (0)| 00:00:01 | -----------------------------------------------------------------------------------------------------------
Where I see the presence of this hot SYS_C0010893 index.
Oh I will not finish without displaying the buffer pool advisory part of the problematic AWR:
Snap Id Snap Time Sessions Curs/Sess --------- ------------------- --------- ---------- Begin Snap: 13491 25-Mei-11 09:00:56 36 6.1 End Snap: 13492 25-Mei-11 10:00:30 36 6.1 Elapsed: 59.56 (mins) DB Time: 109.69 (mins)
Cache Sizes ~~~~~~~~~~~ Begin End ---------- ---------- Buffer Cache: 776M 776M Std Block Size: 8K Shared Pool Size: 96M 96M Log Buffer: 6,924K Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 1,432.57 159.56 Logical reads: 984.03 109.60 Block changes: 9.01 1.00 Physical reads: 141.79 15.79 Physical writes: 1.09 0.12 User calls: 47.98 5.34 Parses: 7.37 0.82 Hard parses: 0.63 0.07 Sorts: 3.66 0.41 Logons: 0.05 0.01 Executes: 14.46 1.61 Transactions: 8.98 % Blocks changed per Read: 0.92 Recursive Call %: 75.36 Rollback per transaction %: 97.99 Rows per Sort: 8.40 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 76.32 Redo NoWait %: 100.00 Buffer Hit %: 85.59 In-memory Sort %: 100.00 Library Hit %: 88.79 Soft Parse %: 91.43 Execute to Parse %: 49.02 Latch Hit %: 91.43 Parse CPU to Parse Elapsd %: 44.68 % Non-Parse CPU: 98.06 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 78.60 77.65 % SQL with executions>1: 80.50 85.53 % Memory for SQL w/exec>1: 76.65 82.61 Buffer Pool Advisory DB/Inst: xxxx/xxxx Snap: 13492 -> Only rows with estimated physical reads >0 are displayed -> ordered by Block Size, Buffers For Estimate Est Phys Size for Size Buffers for Read Estimated P Est (M) Factor Estimate Factor Physical Reads --- -------- ------ ---------------- ------ ------------------ D 72 .1 8,991 1.5 1,955,905 D 144 .2 17,982 1.5 1,919,097 D 216 .3 26,973 1.5 1,905,107 D 288 .4 35,964 1.4 1,869,577 D 360 .5 44,955 1.4 1,852,229 D 432 .6 53,946 1.3 1,743,851 D 504 .6 62,937 1.3 1,738,194 D 576 .7 71,928 1.3 1,727,611 D 648 .8 80,919 1.1 1,367,322 D 720 .9 89,910 1.0 1,295,649 D 776 1.0 96,903 1.0 1,293,144 ----- actual size D 792 1.0 98,901 1.0 1,292,411 D 864 1.1 107,892 1.0 1,242,738 D 936 1.2 116,883 0.9 1,212,406 D 1,008 1.3 125,874 0.9 1,132,996 D 1,080 1.4 134,865 0.8 1,029,577 D 1,152 1.5 143,856 0.7 929,565 D 1,224 1.6 152,847 0.6 732,968 ----- suggested size D 1,296 1.7 161,838 0.6 720,034 D 1,368 1.8 170,829 0.6 714,496 D 1,440 1.9 179,820 0.6 712,110 -------------------------------------------------------------
Knowing the limitation I have that is (a) I can’t change the query (b) I can’t create a new index and as far as it is a problem of physical I/O perpetuated by a “read by other sessions” wait event, I have suggested the client to increase the buffer cache size from 776MB to 1224MB.
I am still waiting for their feedback.
how you come to this number 1224MB for buffer pool
Comment by nilesh — August 7, 2012 @ 9:20 pm |
Nilesh,
The buffer Pool advisory is easy to interpret. The actual configuration is:
which means that with a buffer cache size of 776MB there have been 1,293,144 physical I/O generated
Multiply this size by a factor of 1,6 (776*1,6 = 1241,6 MB ~ 1224MB) and you will reduce the physical I/O to 732,968 as shown within the same buffer pool advisory
Double the buffer cache size (776*1,9 = 1474,4 MB ) and you will reduce the physical I/O to 712,110
Halve the the buffer cache size (776 *0.5 =388MB) and you will have a jump in the physical I/O to 1,852,229 (559,085 extra physical reads)
Best Regards
Comment by hourim — August 8, 2012 @ 6:51 am |
No feedback?
Comment by rogerdpack — March 6, 2013 @ 9:40 pm |
Ok so where you say “suggested size” that was *your* suggested size, to the client, right?
Comment by rogerdpack — March 7, 2013 @ 12:26 am |
if you have the option to modify the query or the index, What could be the change ?
Comment by Ahmed Mousaafa — August 29, 2013 @ 1:07 am |
Ahmed,
It has been two years since I tried to trouble shoot this problem from home at a distant client. If I will be confronted to the same issue today, I would have tried to ask more question like (a) how much data you are after (b) what kind of index you have on the concerned tables (c) how fresh the statistics are (d) how data are scattered and then get an execution plan of the same query from the memory and eventually get a trace (tkprof) of it. It is quite sure that enhancing the query will make the application behaving as if there were no locks because this query follows and update before the commit and as far as this query will run very quickly other user will not be locked when they will try to update the same set of data.
Best regards
Comment by hourim — August 29, 2013 @ 1:01 pm |
Hi Muhammad,
Could you please explain what metrics to see in SQL ordered by Elapsed Time,SQL ordered by CPU Time when figuring high consuming SQL.
Please throw some light on this.
Regards,
Shadab
Comment by Shadab — September 11, 2014 @ 7:32 am |
Hi Shadab,
There are few wait events that correlate very well with few of the SQL ordered by AWR information. For example, db file sequential read correlates very well with SQL ordered by Gets. Db file scattered read correlates very well with SQL ordered by reads (or by physical reads). When I have high waits on direct path read I usually try to find the underlying table (or index) by looking at the Segment by Direct physical read (an 11g Option I think)
I always start by looking at the number of execution and ask myself if this is a normal behaviour to have a great number of executions during the AWR snapshot. Elapsed time per execution is also a good indication. If this information exceeds 7 seconds I know that the corresponding sql_id has been monitored which allows me to generate the very important Real time SQL monitoring report.
I would advise you to read the Jonathan Lewis collection of article about AWR and Statspack
Best regards
Comment by hourim — September 11, 2014 @ 8:06 am |
[…] https://hourim.wordpress.com/2011/06/07/trouble-shooting-read-by-other-sessions-wait-event/ http://tkyte.blogspot.jp/2009/10/httpasktomoraclecomtkyteunindex.html […]
Pingback by Trouble-shooting wait event – About Oracle and more — July 12, 2016 @ 2:59 am |
You can also add this hot object to keep pool right , if it is very small?
Comment by Jyothish — November 16, 2016 @ 1:31 am |
[…] https://hourim.wordpress.com/2011/06/07/trouble-shooting-read-by-other-sessions-wait-event/ […]
Pingback by Tuning issues – LEARNING DBA — February 9, 2020 @ 9:42 am |
[…] Ref: https://hourim.wordpress.com/2011/06/07/trouble-shooting-read-by-other-sessions-wait-event/ […]
Pingback by Solution for wait event ” Read by other session” | orasolution — July 29, 2020 @ 5:12 am |