April 30, 2012

Trouble Shooting: CRS-4534: Cannot communicate with Event Manager

There is a three nodes Rac cluster.
All DBs are up and running fine on all three nodes.
oracle $ ps -ef|grep smon|wc -l
43
oracle $ ps -ef|grep tns|wc -l
50
oracle $ ps -ef|grep d.bin
root      8177     1  0 Apr28 ?        00:18:29 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle    8550     1 22 Apr28 ?        08:52:59 /stage/grid/11.2.0/grid/bin/oraagent.bin
root     26480     1  0 Apr28 ?        00:03:26 /stage/grid/11.2.0/grid/bin/ohasd.bin reboot
oracle   29210     1  0 Apr28 ?        00:00:00 /stage/grid/11.2.0/grid/bin/mdnsd.bin
oracle   29232     1  0 Apr28 ?        00:01:09 /stage/grid/11.2.0/grid/bin/gpnpd.bin
oracle   29257     1  0 Apr28 ?        00:05:32 /stage/grid/11.2.0/grid/bin/gipcd.bin
root     29307     1  0 Apr28 ?        00:00:13 /stage/grid/11.2.0/grid/bin/cssdmonitor
root     29334     1  0 Apr28 ?        00:00:10 /stage/grid/11.2.0/grid/bin/cssdagent
oracle   29351     1  8 Apr28 ?        03:25:51 /stage/grid/11.2.0/grid/bin/ocssd.bin
root     29540     1  0 Apr28 ?        00:20:57 /stage/grid/11.2.0/grid/bin/orarootagent.bin
oracle   29556     1  0 Apr28 ?        00:00:18 /stage/grid/11.2.0/grid/bin/diskmon.bin -d -f
root     30160     1  0 Apr28 ?        00:00:33 /stage/grid/11.2.0/grid/bin/octssd.bin reboot
oracle   30350     1  0 Apr28 ?        00:00:04 /stage/grid/11.2.0/grid/bin/evmd.bin
root     31947     1  0 Apr28 ?        00:03:50 /stage/grid/11.2.0/grid/bin/crsd.bin reboot

From above output red part we aslo can see evmd.bin process is up.
But when use crsctl command, it shows that Event Manger not work on node 2 and node 3.
oracle $ crsctl check cluster -all
**************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************
node3:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4534: Cannot communicate with Event Manager
**************************************************************

Checked cluster alert.log, it also indicate EVMD started normal:
2012-04-28 18:29:48.154
[evmd(30579)]CRS-1401:EVMD started on node node2.

Below is from evmd.log:
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD Starting
2012-04-28 18:29:17.518: [ EVMD][1928129552] Initializing OCR
2012-04-28 18:29:17.518: [ EVMD][1928129552] EVMD running in cluster mode
2012-04-28 18:29:18.524: [ EVMD][1928129552] Waiting for OCR. Seconds elapsed: 0
2012-04-28 18:29:48.136: [ EVMD][1928129552] Get OCR context succeeded
2012-04-28 18:29:48.153: [ EVMD][1928129552] Initializing Diagnostics Settings
2012-04-28 18:29:48.224: [ EVMD][1928129552] Authorization database built successfully.
2012-04-28 18:29:49.105: [ EVMAPP][1928129552] EVMD Started
2012-04-28 18:29:49.123: [ EVMD][1928129552] Authorization database built successfully.

All seems fine. Then why "CRS-4534: Cannot communicate with Event Manager"?

Let's check the current status for evmd:
oracle $ crsctl stat res -t -init
-------------------------------------------------------
NAME             TARGET          STATE       SERVER   
-------------------------------------------------------
ora.evmd      1  ONLINE   INTERMEDIATE node3

The current status INTERMEDIATE means the evmd did starting but not totally completed.

Below recorded in oracle_agent.logfile:
2012-04-28 18:45:50.936: [    AGFW][1113385280] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: OFFLINE
.................

2012-04-28 18:46:23.441: [    AGFW][1114609984] {0:0:2} ora.evmd 1 1 state changed from: UNKNOWN to: STARTING
.................

2012-04-28 20:41:27.768: [    AGFW][1116698944] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
.................

We  can see the evmd was starting in progress, status changed from UNKNOWN --> OFFLINE --> STARTING --> PARTIAL.
And next the status should change from PARTIAL to ONLINE.
But from the logfile, we don't see the final step, it never happen.

It failed at the final step due to below error message:
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort {
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] abort command: check
2012-04-28 20:43:27.787: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock {
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] did not get lock
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] tryActionLock }
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort: Exception LockAcquireTimeoutException
2012-04-28 20:44:27.816: [ora.evmd][1112496448] {0:0:2} [check] clsn_agent::abort, agent exiting }

This is very similar to bug 11807012: oraagent.bin Exits After Check Timed Out

But this bug should have been fixed in 11.2.0.2.3,  and our server already patched 11gR2 Patch Set Update 4.

It looks like that PSU don't real fully solved that bug.


Raised SR and will check with Oracle Support further.


More......

April 24, 2012

rowid paging benefit from 11g new feature

Consider below request:
We need to get the 900 to 1000 records after ordered by time.

We call this kind of request as "paging".  Because in most case we encounter this kind of request in web page's redirect buttons:  page 1, page 2, page 3..... 
We can jump to any page via clicking the conrrespoding buttion.

There are two kinds of paging method in oracle: Rownum paging and Rowid paging.
Almost all the documents which i read all announce that rowid is a better method then rownum for paging.
They said:
For the first a few pages, two methods have similiar performance.But as long as the page number become bigger and bigger, rowid paging perform better and better then rownum paging.



Pay attention to the red part. Is that really true?  Let's make a experiment.

First test in 11g. Create test table as below:
11gR2 >create table test(id number,status VARCHAR2(7),type VARCHAR2(19),created date);
Table created.

11gR2 >insert into test select OBJECT_ID,STATUS,OBJECT_TYPE,CREATED from dba_objects;
12926 rows created.

11gR2 >alter table test modify created not null;
Table altered.

11gR2 >create index test_ind1 on test(CREATED);
Index created.

11gR2 >ANALYZE TABLE TEST compute statistics;
Table analyzed.


We are going to read row 900-1000 from table test order by created.

Test 11g paging with rownum method:
11gR2 >select *
       from (
             select rownum rn,t.*
             from
             (select  id,status,type,created from test order by created) t
             where rownum<1000)
       where rn >900; 
99 rows selected.
--------------------------------------------------------------------------------------------
| Id  | Operation                      | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |           |   999 | 50949 |     9   (0)| 00:00:01 |
|*  1 |  VIEW                          |           |   999 | 50949 |     9   (0)| 00:00:01 |
|*  2 |   COUNT STOPKEY                |           |       |       |            |          |
|   3 |    VIEW                        |           |   999 | 37962 |     9   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID| TEST      | 12926 |   277K|     9   (0)| 00:00:01 |
|   5 |      INDEX FULL SCAN           | TEST_IND1 |   999 |       |     4   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         22  consistent gets
          0  physical reads
          0  redo size
       3703  bytes sent via SQL*Net to client
        590  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         99  rows processed


Test 11g paging with rowid method:
11gR2 >select /*+ ordered use_nl(p s) */ *
      from (
            select rownum rn,rd
            from (select  rowid rd from test order by created)
            t where rownum<1000) p,
            test s
      where rn>900 and p.rd=s.rowid;  2    3    4    5    6    7
99 rows selected.
-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |   999 | 52947 |  1003   (0)| 00:00:13 |
|   1 |  NESTED LOOPS               |           |   999 | 52947 |  1003   (0)| 00:00:13 |
|*  2 |   VIEW                      |           |   999 | 24975 |     4   (0)| 00:00:01 |
|*  3 |    COUNT STOPKEY            |           |       |       |            |          |
|   4 |     VIEW                    |           |   999 | 11988 |     4   (0)| 00:00:01 |
|   5 |      INDEX FULL SCAN        | TEST_IND1 | 12926 |   239K|     4   (0)| 00:00:01 |
|   6 |   TABLE ACCESS BY USER ROWID| TEST      |     1 |    28 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         19  consistent gets
          0  physical reads
          0  redo size
       5450  bytes sent via SQL*Net to client
        590  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         99  rows processed


From the red part we can see rowid paging(19 logic read)  do  better then  rownum paging(22 logic read).
The reason is rowid method reduce the "TABLE ACCESS BY ROWID" for first 900 rows.


Let's check in 10g:
Build the Test table as excatly the same with 11g, created the index and gathered stats.

Test 11g paging with rownum method:
10gR2 >select *
       from (
             select rownum rn,t.*
             from
             (select  id,status,type,created from test order by created) t
             where rownum<1000)
       where rn >900; 
99 rows selected.
--------------------------------------------------------------------------------------------
| Id  | Operation                      | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |           |   999 | 50949 |    17   (0)| 00:00:01 |
|*  1 |  VIEW                          |           |   999 | 50949 |    17   (0)| 00:00:01 |
|*  2 |   COUNT STOPKEY                |           |       |       |            |          |
|   3 |    VIEW                        |           |   999 | 37962 |    17   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID| TEST      | 45620 |  1113K|    17   (0)| 00:00:01 |
|   5 |      INDEX FULL SCAN           | TEST_IND1 |   999 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         19  consistent gets
          0  physical reads
          0  redo size
       3842  bytes sent via SQL*Net to client
        558  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         99  rows processed


19 logic read, very close to 11g. It is a normal value and close to our expected.

Test 11g paging with rowid method:
10gR2 >select /*+ ordered use_nl(p s) */ *
      from (
            select rownum rn,rd
            from (select  rowid rd from test order by created)
            t where rownum<200) p,
            test s
      where rn>100 and p.rd=s.rowid;
99 rows selected.
-----------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |   199 | 11144 |   261   (1)| 00:00:04 |
|   1 |  NESTED LOOPS               |           |   199 | 11144 |   261   (1)| 00:00:04 |
|*  2 |   VIEW                      |           |   199 |  4975 |    62   (2)| 00:00:01 |
|*  3 |    COUNT STOPKEY            |           |       |       |            |          |
|   4 |     VIEW                    |           | 45620 |   534K|    62   (2)| 00:00:01 |
|   5 |      INDEX FULL SCAN        | TEST_IND1 | 45620 |   846K|    62   (2)| 00:00:01 |
|   6 |   TABLE ACCESS BY USER ROWID| TEST      |     1 |    31 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        108  consistent gets
          0  physical reads
          0  redo size
       5252  bytes sent via SQL*Net to client
        558  bytes received via SQL*Net from client
          8  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         99  rows processed


wow, why logic turn to 108? The execution plan is same with 11g, but why so big difference in logic read?
      rownum    rowid
11g:      22       19
10g:      19      108


Why?
Below is the Reason:
11gR2 >select rowid,id from test where rowid in ('AAADSHAABAAAH3hAAA','AAADSHAABAAAH3hAAB','AAADSHAABAAAH3hAAC','AAADSHAABAAAH3hAAD','AAADSHAABAAAH3hAAE','AAADSHAABAAAH3hAAF','AAADSHAABAAAH3hAAG','AAADSHAABAAAH3hAAH','AAADSHAABAAAH3hAAI','AAADSHAABAAAH3hAAJ');
ROWID                      ID
------------------ ----------
AAADSHAABAAAH3hAAA         20
AAADSHAABAAAH3hAAB         46
AAADSHAABAAAH3hAAC         28
AAADSHAABAAAH3hAAD         15
AAADSHAABAAAH3hAAE         29
AAADSHAABAAAH3hAAF          3
AAADSHAABAAAH3hAAG         25
AAADSHAABAAAH3hAAH         41
AAADSHAABAAAH3hAAI         54
AAADSHAABAAAH3hAAJ         40
10 rows selected.
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          2  consistent gets
          0  physical reads
          0  redo size
        875  bytes sent via SQL*Net to client
        524  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed


10gR2 >select rowid,id from test where rowid in ('AAAzo5AABAAAKnWAAA','AAAzo5AABAAAKnWAAB','AAAzo5AABAAAKnWAAC','AAAzo5AABAAAKnWAAD','AAAzo5AABAAAKnWAAE','AAAzo5AABAAAKnWAAF','AAAzo5AABAAAKnWAAG','AAAzo5AABAAAKnWAAH','AAAzo5AABAAAKnWAAI','AAAzo5AABAAAKnWAAJ');
ROWID                      ID
------------------ ----------
AAAzo5AABAAAKnWAAA         30
AAAzo5AABAAAKnWAAB          8
AAAzo5AABAAAKnWAAC         14
AAAzo5AABAAAKnWAAD         34
AAAzo5AABAAAKnWAAE         45
AAAzo5AABAAAKnWAAF         39
AAAzo5AABAAAKnWAAG         47
AAAzo5AABAAAKnWAAH         51
AAAzo5AABAAAKnWAAI         11
AAAzo5AABAAAKnWAAJ         48
10 rows selected.

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
         10  consistent gets
          0  physical reads
          0  redo size
        861  bytes sent via SQL*Net to client
        492  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         10  rows processed


A Rowid is made by: OOOOOOFFFBBBBBBRRR
O is object id,F is file id,B is block id,R is row number.

From above we can see, read 10 rows from same block in 11g cost 2 block read while in 10g cost 10 block read.

Further research also confirmed in 10g, every rowid will gernerate an logic read, even all the rowid are come from same block.
But in 11g some new machanism are introduced, if next rowid is in the same block with current  row, then current block can be reuse.
More......

April 9, 2012

MV refresh cost long time due to internal recursive call

We have a job defined to do mutiple MV refresh. Most MVs can refresh normally, but there are two MVs cost very long time to complete.
No locks/latch contention found during the refresh.

Below is the log:
START_TIME          OBJECT_NAME                    TIME_COST(s)
------------------- ------------------------------ -------------
2012-01-06 00:45:44 MV_APP_CURR_ST_1_HR_YLD_AVAIL  17.65
2012-01-06 00:46:02 MV_APP_CURR_STATE_YIELD_AVAIL  2.7
2012-01-06 00:46:05 MV_APP_CURRENT_STATE           220.34

2012-01-06 00:49:45 MV_APP_CURR_POLE_1_HR_Y_AVAIL  22.93
2012-01-06 00:50:08 MV_APP_CURR_STATE_POLE_Y_AVAIL 3.62
2012-01-06 00:50:12 MV_APP_CURRENT_STATE_POLE      
125.59

From above we can see MV_APP_CURRENT_STATE and MV_APP_CURRENT_STATE_POLE cost very long time.
When check these MVs, found their definition are simply. I tried to run MVs' SELECT part, both can return all record in 1 second.

But they will take 100+ seconds when refresh the MV with below sentence:
dbms_mview.refresh('"GAPC"."MV_APP_CURRENT_STATE_POLE"', atomic_refresh ='FALSE');

After make an 10046 trace, found during the MV refresh, below internal recursive SQL cost most of time:
********************************************************************************
SQL ID: 6v8ygt9zbxsnh Plan Hash: 2840274064
select count(*), max(scn)
from
( select d.scn scn from sys.sumdelta$ d, sys.sumdep$ o
where o.sumobj# = :1 and o.p_obj# = d.tableobj# and d.spare1 =
0 and d.scn > :2 union all select m.commit_scn scn
from sys.sumdelta$ d, sys.sumdep$ o, sys.snap_xcmt$ m where
o.sumobj# = :1 and o.p_obj# = d.tableobj# and d.spare1 = 1 and
d.xid = m.xid and m.commit_scn > :2 )
call    count  cpu      elapsed    disk       query      current    rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse   7      0.00     0.00       0          1          0          0
Execute 7      0.00     0.00       0          0          0          0
Fetch   7      31.25    367.18     99567      127778     0          7
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total   21     31.26    367.19     99567      127779     0          7


Notice red parts, above internel recursive SQL cost 367 seconds while total MV refresh time is 370 seconds.

After checking above SQL, find its slowness mainly caused by table sumdelta$ which is 1.5GB big:
SQL> select TABLEOBJ#,count(*) from sys.sumdelta$ group by  TABLEOBJ#;
TABLEOBJ#    COUNT(*)
---------- ----------
  11784739    1536910
   2857695     449018
   7506186    8808703


This table contains 10 millions records. And most record are last year's.

SQL> select owner,object_id,object_type,object_name,status from dba_objects where object_id in (11784739,2857695,7506186);
OWNER       OBJECT_ID OBJECT_TYPE         OBJECT_NAME                    STATUS
---------- ---------- ------------------- ------------------------------ -------
GAPC         11784739 TABLE               MV_EVENT_METER_ONE_DAY         VALID
GAPC          2857695 TABLE               MV_APP_CURRENT_ALERT_STATE     VALID


These records in sys.sumdelta$ do related to our issue MVs.

Till now the issue is clearly. Due to some bug, expired data not purged normally in sys.sumdelta$.
This table grew bigger and bigger, and make the SQL related to it become rather slow.


After truncate this table sys.sumdelta$.
Those issue MVs finished in a few seconds.
More......

April 6, 2012

buffer busy waits, resmgr:cpu quantum, and Suspending MMON action ..

When checking on one DB, below queries hang:
bdr1 >select s.sid,p.SPID,s.serial# from v$session s, v$process p where s.PADDR=p.ADDR and s.sid= (select sid from v$mystat where rownum<2);
SID SPID  SERIAL#
--- ----- ------
591 10422 25701

bdr1 >SELECT INST_ID,DECODE(request,0,'Holder: ','Waiter: ')||sid sess, id1, id2, lmode, request, type,ctime
FROM gV$LOCK WHERE (id1, id2, type) IN (SELECT id1, id2, type from gV$LOCK WHERE request>0)
ORDER BY id1, request;

---Hang there

After waited 5 minutes but no lucky, i decide to have a check.
First check the wait event:
bdr1 >select event from v$session where sid=591;
EVENT
------------------------------------------
buffer busy waits

bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Wait for 10 seconds, then query again:
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Above output means the session is waiting buffer busy waits on file 81 block 5925 for more then ten seconds.
It is so wired, i definitely haven't encounter this before. As we all known, buffer busy waits occurs when another session pining one block during an DML, and it is a very light operation. How can a session wait one block for ten seconds?

Let's check what the block is:
bdr1 >alter system dump datafile 81 block 5925;
System altered.


The trace file shows this block is an undo segment header block:
index state cflags wrap# uel scn dba parent-xid nub stmt_num cmt
-------------------------------------------------------------

0x00 9 0x00 0x30998 0x0013 0x0b4f.57013a1b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741051
0x01 9 0x00 0x30999 0x002f 0x0b4f.57013c9b 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332741600
0x02 9 0x00 0x30999 0x001d 0x0b4f.5701397a 0x198009af 0x0000.000.00000000 0x00000001 0x00000000 1332740819
0x03 9 0x00 0x3099a 0x0021 0x0b4f.57013d32 0x198009b2 0x0000.000.00000000 0x00000001 0x00000000 1332741601
0x04 9 0x00 0x30999 0x0020 0x0b4f.57013d05 0x198009b2 0x0000.000.00000000 0x00000004 0x00000000 1332741600
0x05 10 0x00 0x3099a 0x001d 0x0b4f.57015ee9 0x198009ab 0x0000.000.00000000 0x00000001 0x00000000 0
......

bdr1 >SELECT SEGMENT_TYPE,OWNER||'.'||SEGMENT_NAME FROM DBA_EXTENTS where file_id= 81 AND block_id BETWEEN 5925 AND 5925+BLOCKS -1;
SEGMENT_TYPE       OWNER||'.'||SEGMENT_NAME
------------------ ---------------------------------------------
TYPE2 UNDO         SYS._SYSSMU3_424634979$
TYPE2 UNDO         SYS._SYSSMU9_2057384446$


Tried to flush shared_pool and buffer_cache, but appear useless:
bdr1 >alter system flush shared_pool;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


bdr1 >alter system flush buffer_cache;
System altered.
bdr1 >select P1,P2,P3 from v$session where sid=591;
P1   P2      P3
---- ------- ----------
81   5925    21


Let's check which session is holding that undo segment header block:
bdr1 >select BLOCKING_INSTANCE,BLOCKING_SESSION from v$session where sid=591;
BLOCKING_INSTANCE BLOCKING_SESSION
----------------- ----------------
1                 55

The block is holding by SID 55. Let's check the session 55:
SSID SERIAL# LOGON_TIME      STATE    STATUS  EVENT  PROGRAM
----- ---------- ----------- -------- -----   ------ --------
55   18167   26-MAR-12 02:00 WAITING  ACTIVE  resmgr:cpu quantum
oracle@ora03.globaltax.ge.com (M000)

Pay attention to the red part's event: resmgr:cpu quantum. It is related to resouce manager.

This session is MMON slave process.
Watched some time and confirmed this session is hang.

Let's check the alert.log and see weather we can find some clue:
Thu Mar 29 19:00:40 2012
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Thu Mar 29 19:10:42 2012
Suspending MMON slave action kewfmadsa_ for 82800 seconds
Thu Mar 29 23:05:58 2012
Suspending MMON action 'undo usage' for 82800 seconds

From the alert.log we can see that the MMON suspended, and unluckily at that moment it was pining the undo segment header block, and therefore blocking other sessions.

Suspending MMON is a new feature in 11G. If the system is so over-loaded that it takes over 15 minutes to gather statistics or other MMON tasks, this error is expected. It is a functionality enhancement in 11g, as it prevents MMON from locking resources those other processes might be waiting for. In 10g , mmon slaves are allowed to run indefinitely. 


Let's check the OS load:
bdr1 >ho free -m
             total       used       free     shared    buffers     cached
Mem:        128874     128107        766          0        302     117527
-/+ buffers/cache:      10277     118596
Swap:        66463       4350      62113

Cpu(s): 12.8%us,  2.4%sy,  0.0%ni, 84.4%id,  0.3%wa,  0.0%hi,  0.1%si,  0.0%st



The load is quite low, then why MMON suspend? 
The MMON is charge of AWR, let's check the AWR report:
Enter value for num_days: 3
Listing the last 3 days of Completed Snapshots
                                                        Snap
Instance     DB Name        Snap Id    Snap Started    Level
------------ ------------ --------- ------------------ -----
bdr1         BDR1             31208 24 Mar 2012 00:00      1
................
................
                              31251 25 Mar 2012 19:00      1
                              31252 25 Mar 2012 20:00      1
                              31253 25 Mar 2012 21:00      1
                              31254 25 Mar 2012 22:00      1
                              31255 25 Mar 2012 23:00      1
                              31256 26 Mar 2012 00:00      1
                              31257 26 Mar 2012 01:00      1


Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap:
bdr1 >
bdr1 >ho date
Fri Mar 30 05:41:54 EDT 2012

We can see currently is Mar30th, but the latest AWR report is gernerated at Mar26th. MMON facing the issue since then. 

Let's check the alert.log what happend at Mar26th:
Sun Mar 25 22:03:08 2012
DBMS_STATS: GATHER_STATS_JOB encountered errors.  Check the trace file.
ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-04040: file outdata_to_show.txt in STADM_DIR not found
Mon Mar 26 02:00:00 2012
Closing scheduler window
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Setting Resource Manager plan SCHEDULER[0xD9D6EFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Mon Mar 26 02:28:21 2012
Suspending MMON action 'undo usage' for 82800 seconds
Mon Mar 26 03:10:25 2012
Process 0x0xdcbda2a8 appears to be hung in Auto SQL Tuning task
Current time = 1332745824, process death time = 1332745800
Attempting to kill process 0x0xdcbda2a8 with OS pid = 11769
OSD kill succeeded for process 0xdcbda2a8


Till now, we can obviously consider this issue somehow related to resouce manager.
 After created an SR, Oracle Support confirmed this is an resource manager's bug, the solution is to disable resource manager:
1. alter system set resource_manager_plan='';
2. For each window_name:
execute dbms_scheduler.set_attribute('WEEKNIGHT_WINDOW','RESOURCE_PLAN',''); 

More......