December 20, 2012

Performance Shooting: How much can SEQUENCE slow down a SQL

How much can SEQUENCE drag down speed of SQL?
10%? 20%? Below my working case show you SEQUENCE drag down 600% speed of one SQL.

It is during my recently shooting of a job's performance issue, i notice this SQL in the job block.
This is quite a simple SQL, two tables join, only thing secial is that it is using a sequence GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL:

dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL        ,330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:01:00.21

Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id  | Operation           | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   1 |  SEQUENCE           | SEQ_MIM_CCCC_PPPP_SSS |       |       |            |         |
|*  2 |   HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   3 |    TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   4 |    TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
     128804  recursive calls
      96793  db block gets
      77309  consistent gets
        661  physical reads
   21141944  redo size
   25998756  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

From above red part we see the SQL esclape time is one minute.
That is too long after considerring the query's base tables data volumn, it should not take that much time.

We also notice that big "recursive calls" and "db block gets" Statistics.
It is very clear that they are caused by the sequence in the SQL.

So let's remove the sequence from the query and check again:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:00:10.23

Execution Plan
----------------------------------------------------------
Plan hash value: 2074598172
-------------------------------------------------------------------------------------------
| Id  | Operation          | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|*  1 |  HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   2 |   TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   3 |   TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
-------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      44998  consistent gets
          0  physical reads
          0  redo size
   21501260  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

Without the sequence the job completes in 10 seconds.
In another word, the sequence drag down 600% speed of this SQL.

Let's check the sequence attribute:
dwxxxx2 >select * from dba_sequences where SEQUENCE_NAME='SEQ_MIM_CCCC_PPPP_SSS';
SEQUENCE_OWNER SEQUENCE_NAME         INCREMENT_BY C O CACHE_SIZE LAST_NUMBER
-------------- --------------------- ------------ - - ---------- -----------
GGGC_SSBA      SEQ_MIM_CCCC_PPPP_SSS            1 N N         20   109594869

The CACHE_SIZE is 20.
20 is a default value, it is not suitable in our case.
I increase the value to 500 and re-execute the SQL.

This time:
dwxxxx2 > SELECT /*+ USE_HASH(S1 TM)*/    GGGC_SSBA.SEQ_MIM_CCCC_PPPP_SSS.NEXTVAL        ,330404
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2012-12-18 06:45:18', 'YYYY-MM-DD HH24:MI:SS')
  ,TO_DATE('2100-12-31', 'YYYY-MM-DD')    ,S1.MATNR AS MATNR      ,S1.VKORG AS
  VKORG   ,S1.VTWEG AS VTWEG      ,S1.KONDM AS KONDM      ,S1.DWERK AS DWERK
  ,S1.VMSTA AS VMSTA      ,S1.VMSTD AS VMSTD      ,S1.LVORM AS LVORM
  ,S1.DT_TM_XTR AS DT_TM_XTR      ,S1.CDHDR_UDATE AS CDHDR_UDATE  
  ,S1.PROVG AS PROVG FROM    GEPC_TEMP_TABLE.KKK_MIM_XXXX_RRRR_SSS TM
  GEPC_SOURCE.EEEIE_ZZZZZ_MMML_SLS S1 WHERE      TM.SOURCE_ROWID = S1.ROWID;

643418 rows selected.
Elapsed: 00:00:29.60

Execution Plan
----------------------------------------------------------
Plan hash value: 3276368232
--------------------------------------------------------------------------------------------
| Id  | Operation           | Name                  | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   1 |  SEQUENCE           | SEQ_MIM_CCCC_PPPP_SSS |       |       |            |         |
|*  2 |   HASH JOIN         |                       | 36837 |  2590K|   759   (3)| 00:00:14|
|   3 |    TABLE ACCESS FULL| KKK_MIM_XXXX_RRRR_SSS | 36837 |   359K|    29   (0)| 00:00:01|
|   4 |    TABLE ACCESS FULL| EEEIE_ZZZZZ_MMML_SLS  |   643K|    38M|   724   (3)| 00:00:14|
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TM"."SOURCE_ROWID"="S1".ROWID)

Statistics
----------------------------------------------------------
       5153  recursive calls
       3870  db block gets
      46287  consistent gets
          0  physical reads
     844460  redo size
   25998756  bytes sent via SQL*Net to client
     472358  bytes received via SQL*Net from client
      42896  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     643418  rows processed

SQL completes in 30 seconds. Also "db block gets" and "recursive calls" is much smaller then first serneriao--cache is 20.

30 seconds means still the sequence drag down 3 times of the sql's speed.
It is still not reasonable.
The DB have mutiple samiliar queries, I don't have enough free time to look deep into this. I may raise a ticket with oracle support to check this once i get free time.
More......

December 7, 2012

MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs

A DB suddenly hang during a load. User come to me.
When i login the server, i can see the issue DB consuming 100% CPU on the server.

Tried "sqlplus / as sysdba" but also hang.
I gernerate a systemstate dump and notice that the PMON is blocked:

PROCESS 2: PMON
    O/S info: user: oracle, term: UNKNOWN, ospid: 1193 
    OSD pid info: Unix process pid: 1193, image: oracle@alpcisddb484.corporate.ge.com (PMON)
      waiting for 60108b48 Child shared pool level=7 child#=3 
        Location from where latch is held: kgh.h LINE:6387 ID:kghalo: 
        Context saved from call: 0
        state=busy [holder orapid=89] wlstate=free [value=0]
          waiters [orapid (seconds since: put on list, posted, alive check)]:
possible holder pid = 89 ospid=10478

Let's check which session is holding latch 60108b48, and what is the session waiting for:
PROCESS 89: 
    O/S info: user: oracle, term: UNKNOWN, ospid: 10478 
    OSD pid info: Unix process pid: 10478, image: oracle@alpcisddb484.corporate.ge.com
      holding    (efd=8) 60108b48 Child shared pool level=7 child#=3 
    Current Wait Stack:
     0: waiting for 'library cache: mutex X'
        idn=0xfc57b2b7, value=0xd100000000, where=0x4f
        wait_id=57 seq_num=60 snap_id=1
        wait times: snap=45 min 11 sec, exc=45 min 11 sec, total=45 min 11 sec
        wait times: max=infinite, heur=45 min 11 sec
        wait counts: calls=0 os=0
        in_wait=1 iflags=0x5a2

Let’s see which session is holding mutex 0xfc57b2b7 and what is the session waiting for:
PROCESS 42: M000
    SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
    Current Wait Stack:
      Not in wait; last wait ended 45 min 10 sec ago 

OK, till now we found the root blocking is M000.
It is holding mutex:
PROCESS 42: M000
    SO: 0x19f40dda8, type: 4, owner: 0x19f2de3f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
     proc=0x19f2de3f8, name=session, file=ksu.h LINE:12459, pg=0
    (session) sid: 209 ser: 521 trans: (nil), creator: 0x19f2de3f8
    Current Wait Stack:
      Not in wait; last wait ended 45 min 10 sec ago 
      KGL-UOL SO Cache(total=240, free=0)
      KGX Atomic Operation Log 0x17d569fc0
       Mutex 0x19eb2b4c8(209, 0) idn 1fc57b2b7 oper EXCL
       Library Cache uid 209 efd 7 whr 49 slp 0
       oper=0 pt1=(nil) pt2=(nil) pt3=(nil)

But above also shows that M000 itself not in a wait since 45 mins back. What is M000 doing and why it doesn't release that mutex all along?
It is un-reasonable.

After review the process state of M000, i notice the process is already dead:
SO: 0x19f2de3f8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19f2de3f8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:42, ser:16, calls cur/top: 0x1908c1eb8/0x1906a28f8
            flags : (0x3) DEAD
            flags2: (0x8030),  flags3: (0x0) 
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
    Cleanup details:
      Found dead = 44 min 27 sec ago
      Total Cleanup attempts = 10, Cleanup time = 15 min 49 sec, Cleanup timer = 15 min 45 sec
      Last attempt (full) ended 54 sec ago, Length = 2 min 30 sec, Cleanup timer = 2 min 30 sec

OK. Now it is clear to us that M000's death leading to the DB hang.
But why M000 suddenly dead?
After checking alert.log i find blow:
Thu Dec 06 22:00:48 2012
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x1060] [PC:0x90ED147, kglic0()+1075] [flags: 0x0, count: 1]
Errors in file /x484/dump01/oracle/hrxt/diag/rdbms/hrxdt/hrxdt_m000_10618.trc  (incident=6337):
ORA-07445: exception encountered: core dump [kglic0()+1075] [SIGSEGV] [ADDR:0x1060] [PC:0x90ED147] [Address not mapped to object] []

Now all are clear.
After searching on metalink with above symbol, our scenario excatly matches this bug:
MMON Slave Process Reports ORA-7445 [kglic0] Error, Plus Database Hangs [ID 1487720.1]
More......

December 2, 2012

solve ORA-1658 VIA errorstack

Sometimes a issue itself is not complicated, but it is hard due to other reason, such as lack of neccesary information.

For example, one day user come to us and complain that their application job got below error:
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27451|GXXS01|conv|ENDING DUE TO SERIOUS SQL ERROR|
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ABORTED-SQLCODE -1658 AT 1223 ErrText = |
F|Nov 29 20:47:34|UNXSERVER|E|27462|GXXS02|conv|ENDING DUE TO SERIOUS SQL ERROR|


All they can offer us is above error message and a DB name.
Then how should a DBA look into it?

Above error is not a standard DB error message which should begin with: "ORA-".
But from "SQLCODE -1658", we know it should be a DB error, it is not in a standard format due to their application catch the error message and throw it out with its own exception defination.

Let's see what is 1658:
server003[oracle]_db058> oerr ora 1658
01658, 00000, "unable to create INITIAL extent for segment in tablespace %s"// *Cause: Failed to find sufficient contiguous space to allocate INITIAL
// extent for segment being created.
// *Action: Use ALTER TABLESPACE ADD DATAFILE to add additional space to the
// tablespace or retry with a smaller value for INITIAL


OK, seems it is a simple tablespace free size issue.
But after checking in the DB, found there is not error message in alert.log.
The database have 30+ samiliar very tiny tablespaces.
So what tablespace caused the issue:

Tablespace                    Free MB   Total MB        PCT
------------------------- ----------- ---------- ----------
ICDM_20140801                       5         10         50
ICDM_20130901                       5         10         50
ICDM_20131201                       5         10         50
ICDM_20130501                       5         10         50
ICDM_20130701                       5         10         50
ICDM_20140101                       5         10         50
ICDM_20130801                       5         10         50
ICDM_20131001                       5         10         50
ICDM_20130301                       5         10         50
ICDM_20131101                       5         10         50
ICDM_20140201                       5         10         50
ICDM_20140501                       5         10         50
ICDM_20140601                       5         10         50
ICDM_20130401                       5         10         50
ICDM_20140301                       5         10         50
ICDM_20130601                       5         10         50
ICDM_20140401                       5         10         50
ICDM_20140701                       5         10         50
.............

Still user doesn't know anything, they got the error when running a job. But they don't know what is the current SQL and which table got the error on which tablespace.

Then i enabled the errorstack:
alter system set events '1658 trace name errorstack level 1';
Through above command i require to gernerate errorstack for error id ORA-1658 at DB level.

Then i ask user to re-run their job again.
Of course it fails again with error ORA-1658, but with the help of errorstack, this time we get the detail in alert.log:
Errors in file /server/backup/oracle/dump/getsstm2/diag/rdbms/db058/db058/trace/db058_ora_13006.trc:
ORA-01658: unable to create INITIAL extent for segment in tablespace FILE_20130101


OK, now we got the tablespace name: FILE_20130101.
After adding 4gb to FILE_20130101, the issue solved.

More......

November 27, 2012

November 5, 2012

DEBUG RMAN: "RMAN-20052: invalid datafile create SCN" due to PLUGIN_SCN out of SYNC

Below error persists for one database during SYNC operation:

RMAN> resync catalog;

starting full resync of recovery catalog
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of resync command on default channel at 11/05/2012 07:10:03
RMAN-20052: invalid datafile create SCN

Target DB version:
11.2.0.2.6

Catalog DB version:
11.2.0.2.6

Catalog schema version:
catalog > Select * from rcver ;
VERSION
------------
11.02.00.03


After compare between target and catalog, found the create_scn for all datafiles and tempfiles are SYNC in both DB:
CATALOG DB:
catalog > select file#,CREATION_CHANGE# from RC_DATAFILE where DB_NAME='ATPPSS' order by 1;
     FILE# CREATION_CHANGE#
---------- ----------------
         1                7
         2             2178
         3           969548
         4            18151
         5          1002153
         6          1053956
         7          1054333
         8          1501572
         9          1502015
        10          1506616
        11          1515561
        12          1516921
        13          5361004
        14          5779799
14 rows selected.

TARGET DB:
target > select file#,CREATION_CHANGE# from v$datafile;
     FILE# CREATION_CHANGE#
---------- ----------------
         1                7
         2             2178
         3           969548
         4            18151
         5          1002153
         6          1053956
         7          1054333
         8          1501572
         9          1502015
        10          1506616
        11          1515561
        12          1516921
        13          5361004
        14          5779799
14 rows selected.

After debug rman session, we can see the error report when doing DBMS_RCVCAT.CHECKDATAFILE on file# 13:
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CHECKDATAFILE excl: 0
   DBGRCVCAT: checkDatafile - size changed for file#: 12 from 131072 to 153600
   DBGRCVCAT: incResynActions - Not debugging
   DBGRCVCAT: setCloneName: file#=12, creation_fname=1516921, plugin_change#=0, old_clone_fname=, new_clone_fname=NONE
DBGRESYNC:     channel default: Resyncing datafile +DATA/atppss/datafile/example.3079.785280031 [23:38:41.110] (resync)
DBGRESYNC:     channel default: Calling checkDataFile for fileno 13 size 131072 [23:38:41.110] (resync)
DBGRPC:        krmxrpc - channel default kpurpc2 err=20052 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CHECKDATAFILE excl: 97
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=rcvcat proc=RMANADMIN.DBMS_RCVCAT.CANCELCKPT excl: 0
   DBGRCVCAT: cancelCkpt - rollback, released all locks
DBGRPC:        krmxrpc - channel default kpurpc2 err=0 db=target proc=SYS.DBMS_BACKUP_RESTORE.CFILEUSECURRENT excl: 0

Below is the defination for 20052 in DBMS_RCVCAT.CHECKDATAFILE:
PROCEDURE CHECKDATAFILE()
IF (plugged_readonly = 'NO' AND create_scn > this_ckp_scn) THEN              
   raise_application_error(-20052, 'Invalid datafile create SCN');           
ELSIF (plugged_readonly = 'YES' AND plugin_scn > this_ckp_scn) THEN          
   raise_application_error(-20055, 'Invalid datafile plugin SCN');           
END IF;                                                                      
...........
   IF (create_scn = dfRec.create_scn AND                                     
       plugin_scn = dfRec.plugin_scn) THEN                                   
       ...........
   ELSIF ((case when plugged_readonly = 'NO' then                            
           create_scn else plugin_scn end) >                                 
          (case when dfRec.plugged_readonly = 'NO' then                      
           dfRec.create_scn else dfRec.plugin_scn end)) THEN
        ...........                 
   ELSE -- (create_scn < dfRec.create_scn)                                   
      -- The client passed us a create SCN for this datafile that is         
      -- less than the SCN in the rcvcat.  I.e., the target database         
      -- controlfile now contains a previous incarnation of this datafile.   
      -- This can happen only if the user has some old controlfile, or       
      -- has done a resetlogs and not told us about it.                      
      IF (plugged_readonly = 'NO') THEN                                      
         raise_application_error(-20052, 'Invalid datafile create SCN');     
      ELSE                                                                   
         raise_application_error(-20055, 'Invalid datafile plugin SCN');     
      END IF;
    END IF;  

All the judgement done by three argument: plugin_scn, plugged_readonly,create_scn. Let's check above three values in both target DB and catalog DB:
CATALOG DB:
catalog > select CREATE_SCN,PLUGGED_READONLY,PLUGIN_SCN from df where DBINC_KEY='3199420' and FILE# =13;
CREATE_SCN PLU PLUGIN_SCN
---------- --- ----------
   5361004 NO      992188

TARGET DB:
target > select CREATION_CHANGE#,PLUGGED_READONLY,PLUGIN_CHANGE# from v$datafile where file#=13;
CREATION_CHANGE# PLU PLUGIN_CHANGE#
---------------- --- --------------
         5361004 NO               0

Now we can clearly see that red part's differ is the reason we got the error.
Since pluggable-DB is a new feature in 12c, and not published in any current version as yet.
So currently all databases’ all datafiles’ PLUGIN_CHANGE# should be 0.

The PLUGIN_SCN is 992188 in CATALOG DB is obviously not correct. Un-Register/Re-register the DB in catalog may or may not help to solve this issue.
More......

October 29, 2012

LMON (ospid: 11863): terminating the instance due to error 481

Two nodes RAC, suddenly begin reconfiguration and very soon the first instance terminated with error:
LMON (ospid: 11863): terminating the instance due to error 481

So we have two problems:
1. Why the configration suddenly began?
2. Why the first instance terminated during the reconfiguration?

1:Why the configration suddenly began?
From the first node's lmon trace file:

* DRM RCFG called (swin 0) 

*** 2012-10-22 01:23:17.277
  CGS recovery timeout = 85 sec
Begin DRM(27911) (swin 0)
* drm quiesce

*** 2012-10-22 01:25:28.038
* Request pseudo reconfig due to drm quiesce hang 

*** 2012-10-22 01:25:28.051
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
  CSS recovery timeout = 31 sec (Total CSS waittime = 65)
  IMR Reconfig timeout = 75 sec
  CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 94 0.
...............

From above we can see it is due to DRM(DYNAMIC BLOCK RE-MASTER) query timeout, hence the reconfigration began.

Also we can see the type is 6.
We often see type 1,2,3 as reconfigration reason. But type 6 is not a normal reason.
After check with oracle support, they also confirmed that type 6 means DRM.

OK. So now the reconfigration is clear: DRM query timeout.

Next let's see why the first instance terminated during the re-configration step.
Since this reconfigration is not triggered due to interconnect failure nor controlfile heartbeat issue, so instance eviction is not absolutly required.

After checking from below LMON trace we can see the reason why LMON terminated the instance:
*** 2012-10-22 01:25:29.650
 All grantable enqueues granted
2012-10-22 01:25:29.650271 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY 

*** 2012-10-22 01:30:56.000
2012-10-22 01:30:56.000338 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance 
kjzduptcctx: Notifying DIAG for crash event
...............

From red part it is clear when during the reconfigration, the LMS failed to finish its work(maybe either hang, blocked by some resouce, or some bug) and timeout, hence LMON terminated the instace.

Next we should go ahead to check LMS process trace file. Since we set max_dump_file_size for the DB and lms trace file already reached its max size and not updating for a long time, there is no information in LMS for us to refer.

The issue can be avoided by set below parameter to disable DRM:
_gc_read_mostly_locking=FALSE

More......

October 11, 2012

A few troubles when doing PSU patching on AIX RAC

when help a friend patching an AIX Rac server, i got a few troubles, record them here.

1. opatch failed and reporting:

Required amount of space(23428.225MB) is not available.
UtilSession failed: 
Prerequisite check "CheckSystemSpace" failed.
Log file location: /oracle/app/11.2.0/grid/cfgtoollogs/opatch/opatch2012-10-07_18-26-07PM_1.log
OPatch failed with error code 73
The mount point have 20gb free space but still we get the error.
For this bug we can use below parameter to skip space check when using opatch:
OPatch.SKIP_VERIFY_SPACE=true

2. opatch failed and report:
Patching component oracle.rdbms, 11.2.0.3.0...

Copy failed from '/oracle/11.2.0.3.3PSU/13919095/custom rver/13919095/files b bclsra11.so' to '/oracle/app/11.2.0/grid b bclsra11.so'...  
Please verify all applications associated with the Oracle Home '/oracle/app/11.2.0/grid' are shut down. If this is AIX, please perform solution documented in Note 739963.1 on https://myoraclesupport.oracle.com.

Do you want to retry copying the file? [y|n]

opatch reports a file is in use while currently from fuser command we can see there is no process holding the file.
It is a known issue in AIX, and the fix is to run below command as root:


4. opatch failed and report:
.storage is locked by another opatch session.
This is because another opatch activity just terminated due to the third issue and left its lock file t there.
The solution is remove *lock* file under $ORACLE_HOME/.storage
/usr/sbin/slibclean

5. Finally patch succeed, but when starting grid service on second node, failed with below error:
2012-10-07 13:23:18.069: [    AGFW][5912] {0:0:31} Created alert :  (:CRSAGF00123:) :  Failed to start the agent process:  /oracle/grid/11.2.0.3/bin/oraagent Category: -2 Operation: setpcred Loc:  spawnproc26 OS error: 2 Other : child couldn't setpcred

On metalink there is only one bug report for same error messsage and same case as us----CRS failed to startup after patching.

No workaround.



More......

September 22, 2012

cluster on first node failed to startup after reboot due to GPNP hang on third node

3 nodes Rac, in a PSU activity, the cluster on first node failed to startup after patching.

Below is the error in alertnode1.log:

2012-09-22 03:01:39.894
[ohasd(28563)]CRS-2112:The OLR service started on node cintrnpdb001.
2012-09-22 03:01:39.917
[ohasd(28563)]CRS-1301:Oracle High Availability Service started on node cintrnpdb001.
2012-09-22 03:01:39.918
[ohasd(28563)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2012-09-22 03:01:53.765
[gpnpd(28771)]CRS-2328:GPNPD started on node cintrnpdb001.
2012-09-22 03:01:56.228
[cssd(28831)]CRS-1713:CSSD daemon is started in clustered mode
2012-09-22 03:11:55.298
[/prod/product/oracle/11.2.0.2/grid/bin/cssdagent(28819)]CRS-5818:Aborted command 'start for resource: ora.cssd 1 1' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/product/oracle/11.2.0.2/grid/log/cintrnpdb001/cssd/ocssd.log
2012-09-22 03:11:55.299
[cssd(28831)]CRS-1603:CSSD on node cintrnpdb001 shutdown by user.


During the whole time cssd.log keep throwing below error:
2012-09-22 03:00:00.546: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:02.558: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
2012-09-22 03:00:04.568: [    GPNP][2692574496]clsgpnpm_newWiredMsg: [at clsgpnpm.c:741] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]

Above information show that cssd failed to get GPNPD profile.

Let's go to check gpnpd to see what's wrong:
2012-09-22 02:46:51.594: [    GPNP][1088584000]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2104] put-profile call to url "tcp://cintrnpdb003:33612" disco "mdns:service:gpnp._tcp.local.://cintrnpdb003:33612/agent=gpnpd,cname=crs_transp,host=cintrnpdb003,pid=20890/gpnpd h:cintrnpdb003 c:crs_transp" [f=0 claimed- host:cintrnpdb001 cname:crs_transp seq:6 auth:CN=GPnP_peer]
2012-09-22 02:47:05.424: [  OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]
2012-09-22 02:47:26.446: [  OCRMSG][1085512000]GIPC error [29] msg [gipcretConnectionRefused]

From above we can see when first node's gpnpd startup, it send a call to third node's gpnpd, but get no response.

So till now we can suspect that the issue is because gpnpd on third node have some issue.

After i kill gpnpd.bin on third node and let it re-start, the cluster on first node succeed startup immediately.
More......

September 12, 2012

CSSD terminated from clssnmvDiskPingMonitorThread without disk timeout countdown

One node's cluster suddenly terminated.
Below is the message in cluster's alert.log:

2012-09-11 11:41:30.328
[ctssd(22428)]CRS-2409:The clock on host node8 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchroniza
tion Service is running in observer mode.
2012-09-11 12:30:03.122
[cssd(21061)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to
ensure data integrity; details at (:CSSNM00018:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log
2012-09-11 12:30:03.123
[cssd(21061)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /prod/grid/11.2.0/grid/log/node8/cssd/ocssd.log 
2012-09-11 12:30:03.233 [cssd(21061)]CRS-1652:Starting clean up of CRSD resources.

Let‘s check ocssd.log:
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sending status msg to all nodes
2012-09-11 12:30:02.543: [    CSSD][1113450816]clssnmSendingThread: sent 5 status msgs to all nodes
2012-09-11 12:30:03.122: [    CSSD][1082489152](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1 configured voting disks available, need 1
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
2012-09-11 12:30:03.123: [    CSSD][1082489152]###################################
2012-09-11 12:30:03.123: [    CSSD][1082489152](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

So it looks like an IO issue to VOTEDISK.
But after checking we didn't find any abnormal or error message from neither OS level nor storage side.

Then i reviewed the ocssd.log, and with surprise i found there is no long-disk timeout countdown in log.

We know if ocssd failed to reach Votedisk, then it will start to count 200s. Only if after 200s the votedisk still unavailable, then occsd will terminated itself.
The countdown information should be like:
clssscMonitorThreads clssnmvDiskPingThread not scheduled for 16020 msecs

But in our case, there is not such countdown, the occssd just terminated from clssnmvDiskPingMonitorThread all of a sudden.
It should be a bug instead of VOTEDISK IO issue. I will raise an SR with oracle support for further checking.
More......

September 5, 2012

RAC: "CRS-0184: Cannot communicate with the CRS daemon“ due to EVMD

One node of a Rac server get some trouble, its CRSD keep terminating and rebooting. When try to query CRSD will get error “Cannot communicate with the CRS daemon”.

Below error repeate in alertnode1.log:

2012-08-28 19:32:32.995
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-28 19:32:38.319
[crsd(13117)]CRS-1012:The OCR service started on node racnode001.
2012-08-28 19:42:54.134
[crsd(13117)]CRS-0810:Cluster Ready Service aborted due to failure to communicate 
with Event Management Service with error [1]. 
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-28 19:42:54.804
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-28 19:42:59.175
[crsd(10695)]CRS-1012:The OCR service started on node cihcispdb001.
2012-08-28 19:53:15.226
[crsd(10695)]CRS-0810:Cluster Ready Service aborted due to failure to communicate 
with Event Management Service with error [1]. 
Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-28 19:53:15.575
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.

From above log we can see CRSD keep failing and rebooting.

Let‘s check crsd.log to see detail:

2012-08-29 18:07:25.462: [ default][3503137264] CRS Daemon Starting
2012-08-29 18:07:25.464: [ default][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGENT  1
2012-08-29 18:07:25.465: [ default][3503137264] ENV Logging level for Module: AGFW  0
.............
.............
2012-08-29 18:15:40.648: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:15:55.584: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:11.408: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:25.586: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:16:41.668: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:16:55.597: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:12.929: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:25.609: [ CRSMAIN][1106962752] Policy Engine is not initialized yet!
2012-08-29 18:17:43.190: [ COMMCRS][3503137264]Authorization failed, network error
2012-08-29 18:17:43.191: [ CRSMAIN][3503137264] Created alert : (:CRSD00120:) :  Could not init EVM, error: 1
2012-08-29 18:17:43.191: [    CRSD][3503137264][PANIC] CRSD exiting: Could not init EVMMgr.1
2012-08-29 18:17:43.191: [    CRSD][3503137264] Done.
2012-08-29 18:17:44.287: [ default][3316335088] First attempt: init CSS context succeeded.
2012-08-29 18:17:44.290: [  clsdmt][1078745408]PID for the Process [29342], connkey 1
2012-08-29 18:17:44.291: [  clsdmt][1078745408]Creating PID [29342] file for home /prod/grid/11.2.0.2/grid 
host racnode001 bin crs to /prod/grid/11.2.0.2/grid/crs/init/
2012-08-29 18:17:44.291: [  clsdmt][1078745408]Writing PID [29342] to the file 
[/prod/grid/11.2.0.2/grid/crs/init/racnode001.pid]
2012-08-29 18:17:45.182: [ default][1078745408] Policy Engine is not initialized yet!

We can see When CRSD starting up, it failed at initial EVENT Manager step. Let's check what's wrong with EVENT MANAGER.
Below is from evmd.log:

2012-08-29 16:12:18.636: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:26.660: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:34.676: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent
2012-08-29 16:12:42.711: [ COMMCRS][54275088]clsc_send: waiting for space to write 
on icon (0x9d6fdc0). 0/88 sent

EVMD keep reporting above error message. It means when try to write icon on “/prod/grid/11.2.0.2/grid/auth/evm”, it failed due to lack of space.

But the mount point have enough space, so the possible reason evmd still keep throwing above error could be either:
1. bug
2. in some moment in history, the mount point was lack of space and EVMD was trapped in that state, and even though space released later, EVMD not automatically recovered.

The solution is simple, i reload the EVMD, and CRSD also become fine automaticlly:

2012-08-29 20:42:44.276
[crsd(6075)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 20:52:58.684
[crsd(6075)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 20:52:58.850
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 20:53:03.131
[crsd(4699)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:03:17.894
[crsd(4699)]CRS-0810:Cluster Ready Service aborted due to failure to communicate with Event Management Service with error [1]. Details at (:CRSD00120:) in /prod/grid/11.2.0.2/grid/log/racnode001/crsd/crsd.log.
2012-08-29 21:03:18.609
[ohasd(26038)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode001'.
2012-08-29 21:03:22.962
[crsd(28965)]CRS-1012:The OCR service started on node racnode001.
2012-08-29 21:11:48.965
[evmd(21764)]CRS-1401:EVMD started on node racnode001. 
2012-08-29 21:11:52.774
[crsd(28965)]CRS-1201:CRSD started on node racnode001.

Then CRSD become normal:
racnode001 [+ASM1] /prod/grid/11.2.0.2/grid/log/racnode001/evmd > crsctl check crs
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

More......

September 2, 2012

Blocker=Waiter: Self Dead Lock caused within one session

We know very well about deadlock. We know deadlock happens when multiple sessions compete some resources.

And in some particular case, dead lock can also be caused within the same session, below is a example:

SQL> create index obj_ind4$ on obj$(name);
create index obj_ind4$ on obj$(name)
                          *
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-00060: deadlock detected while waiting for resource

Currently there is no lock on obj$, but when we try to create a index on it, it failed with dead lock issue.

Below is from the trace file:

*** 2012-07-02 09:13:45.155
*** SERVICE NAME:(SYS$USERS) 2012-07-02 09:13:45.154
*** SESSION ID:(135.46) 2012-07-02 09:13:45.154
DEADLOCK DETECTED
[Transaction Deadlock]
Current SQL statement for this session:
update obj$ set obj#=:6,type#=:7,ctime=:8,mtime=:9,stime=:10,status=:11,dataobj#=:13,flags=:14,oid$=:15,spare1=:16, spare2=:17 where owner#=:1 and name=:2 and namespace
=:3 and(remoteowner=:4 or remoteowner is null and :4 is null)and(linkname=:5 or linkname is null and :5 is null)and(subname=:12 or subname is null and :12 is null)
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-00000012-00000000        21     135     S             21     135          SX
session 135: DID 0001-0015-0000001D     session 135: DID 0001-0015-0000001D
Rows waited on:
Session 135: obj - rowid = 00000025 - AAAAAlAABAAAADhAAA
  (dictionary objn - 37, file - 1, block - 225, slot - 0)
Information on the OTHER waiting sessions:
End of information on OTHER waiting sessions.
===================================================

From above we can clearly see the blocker is session 21 as well as waiter, which means the deadlock is caused by the session 21 itself.

From the current SQL the reason is obviously:
to create a object, oracle need to insert a row of the object's detail into obj$ table. The Insert is an DML which require level 3 TM lock on obj$ table.

And since we are creating the index on obj$ table, the session also require exclusive lock on the table which means no DML is allowed on ojb$.
That's where the dead lock comes.



More......

August 7, 2012

SYSTEMSTATE DUMP: Shooting DB Hang, sqlplus Hang, self deadlock latch, PMON deadlock latch

One DB is hang. Try to login with "sqlplus / as sysdba" also fail, just hang there.

I use "-prelim" option to login and make a systemstate dump.

Below is the ass.awk output for systemstate dump:

System State 1
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'pmon timer'    <---- PMON is fine at this moment
3:  0: waiting for 'rdbms ipc message'
4:  0: waiting for 'VKTM Logical Idle Wait'
5:  0: waiting for 'rdbms ipc message'
6:  0: waiting for 'DIAG idle wait'
7:  0: waiting for 'rdbms ipc message'
8:  0: waiting for 'DIAG idle wait'
9:  0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'rdbms ipc message'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'rdbms ipc message'
27: 0: waiting for 'rdbms ipc message'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'rdbms ipc message'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'smon timer'
33: 0: waiting for 'rdbms ipc message'
34: 0: waiting for 'latch free'        [Latch 60009fc8]
35: 0: waiting for 'rdbms ipc message'
36:                                    [Latch 60009fc8]
37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000]
38:                                    [Latch 60009fc8]
39: 0: waiting for 'SQL*Net message from client'
40: 0: waiting for 'VKRM Idle'
41: 0: waiting for 'rdbms ipc message'
42: 0: waiting for 'rdbms ipc message'
43: 0: waiting for 'rdbms ipc message'
44: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000]
46:                                    [Latch 60009fc8]
47: 0: waiting for 'Streams AQ: qmn slave idle wait'
48:                                    [Latch 60009fc8]
49: 0: waiting for 'SQL*Net message from client'
50:                                    [Latch 60009fc8]
51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks'
52: 0: waiting for 'SQL*Net message from client'
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
              Latch 60009fc8    ??? Blocker
Enqueue PV-00000000-00000000    34: 34: is waiting for Latch 60009fc8
Enqueue PR-00000000-00000000    34: 34: is waiting for Latch 60009fc8

Object Names
~~~~~~~~~~~~
Latch 60009fc8   process allocation
Enqueue PV-00000000-00000000
Enqueue PR-00000000-00000000

Obviously the root blocker is latch: 60009fc8. But unluckily the systemstate dump don't have the infor that which session is holding the latch:
alpcispdb552[oracle]_pgbs> cat pgbs_diag_2254.trc|grep 60009fc8
      waiting for 60009fc8  process allocation level=7
        address=0x60009fc8, number=0x9, tries=0x0
          p1: 'address'=0x60009fc8
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7
      waiting for 60009fc8  process allocation level=7

Then let's choose any one blocked process to check:
PROCESS 34: MMON
  ----------------------------------------
  SO: 0x1972e95a8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x1972e95a8, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:34, ser:1, calls cur/top: 0x189db28e8/0x19f52b748
            flags : (0x2) SYSTEM
            flags2: (0x20),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 68
              last post received-location: kso2.h LINE:467 ID:ksoreq_reply
              last process to post me: 19f2ef058 165 0
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 1982dec38 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13713 ID:ksu_unreserve:
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    Process Group: DEFAULT, pseudo proc: 0x19734a4b8
    O/S info: user: oracle, term: UNKNOWN, ospid: 2339
    OSD pid info: Unix process pid: 2339, image: oracle@server552(MMON)

Above blocked process is MMON, pay attention to red part, it is waiting for 60009fc8, and latch's possible holder ospid: 26755.

Let’s check ospid 26755:

PROCESS 50:
  ----------------------------------------
  SO: 0x19f2ef058, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19f2ef058, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:50, ser:165, calls cur/top: (nil)/(nil)
            flags : (0x0) -
            flags2: (0x8000),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 0
              last post received-location: No post
              last process to post me: none
              last post sent: 0 0 0
              last post sent-location: No post
              last process posted by me: none
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13721 ID:ksudlp:
        Context saved from call: 6965620824
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    O/S info: user: oracle, term: UNKNOWN, ospid: 26755
    OSD pid info: Unix process pid: 26755, image: oracle@alpcispdb552)

We can see the Process 26755 is blocked by itself. A self deadlock latch issue.

Then i kill the Process 26755 from OS level with "kill -9" to get the latch released. But unluckly the DB is still hang.

I make a latest systemstate and check:

System State 1
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'latch free'        [Latch 60009fc8]  ---- PMON is blocked
3:  0: waiting for 'rdbms ipc message'
4:  0: waiting for 'VKTM Logical Idle Wait'
5:  0: waiting for 'rdbms ipc message'
6:
7:  0: waiting for 'rdbms ipc message'
8:  0: waiting for 'DIAG idle wait'
9:  0: waiting for 'rdbms ipc message'
10: 0: waiting for 'rdbms ipc message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'rdbms ipc message'
20: 0: waiting for 'rdbms ipc message'
21: 0: waiting for 'rdbms ipc message'
22: 0: waiting for 'rdbms ipc message'
23: 0: waiting for 'rdbms ipc message'
24: 0: waiting for 'rdbms ipc message'
25: 0: waiting for 'rdbms ipc message'
26: 0: waiting for 'rdbms ipc message'
27: 0: waiting for 'rdbms ipc message'
28: 0: waiting for 'rdbms ipc message'
29: 0: waiting for 'rdbms ipc message'
30: 0: waiting for 'rdbms ipc message'
31: 0: waiting for 'rdbms ipc message'
32: 0: waiting for 'smon timer'
33: 0: waiting for 'rdbms ipc message'
34: 0: waiting for 'latch free'        [Latch 60009fc8]
35: 0: waiting for 'rdbms ipc message'
36:                                    [Latch 60009fc8]
37: 0: waiting for 'enq: PV - syncstart'[Enqueue PV-00000000-00000000]
38:                                    [Latch 60009fc8]
39: 0: waiting for 'SQL*Net message from client'
40: 0: waiting for 'VKRM Idle'
41: 0: waiting for 'rdbms ipc message'
42: 0: waiting for 'rdbms ipc message'
43: 0: waiting for 'rdbms ipc message'
44: 0: waiting for 'Streams AQ: qmn coordinator idle wait'
45: 0: waiting for 'enq: PR - contention'[Enqueue PR-00000000-00000000]
46:                                    [Latch 60009fc8]
47: 0: waiting for 'Streams AQ: qmn slave idle wait'
48:                                    [Latch 60009fc8]
49: 0: waiting for 'SQL*Net message from client'
50:                                    [Latch 60009fc8]
51: 9: waited for 'Streams AQ: waiting for time management or cleanup tasks'
52: 0: waiting for 'SQL*Net message from client'
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
              Latch 60009fc8    ??? Blocker
Enqueue PV-00000000-00000000    34: 34: is waiting for Latch 60009fc8
Enqueue PR-00000000-00000000    34: 34: is waiting for Latch 60009fc8

Object Names
~~~~~~~~~~~~
Latch 60009fc8   process allocation
Enqueue PV-00000000-00000000
Enqueue PR-00000000-00000000

We can see pmon is also blocked on latch 60009fc8.

And below is the process dump for PMON:

PROCESS 2: PMON
  ----------------------------------------
  SO: 0x19935b370, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
   proc=0x19935b370, name=process, file=ksu.h LINE:12451, pg=0
  (process) Oracle pid:2, ser:1, calls cur/top: 0x19f525a88/0x19f525a88
            flags : (0xe) SYSTEM
            flags2: (0x0),  flags3: (0x0)
            intr error: 0, call error: 0, sess error: 0, txn error 0
            intr queue: empty
    ksudlp FALSE at location: 0
  (post info) last post received: 0 0 68
              last post received-location: kso2.h LINE:467 ID:ksoreq_reply
              last process to post me: 19f2edfd8 2 2
              last post sent: 0 0 26
              last post sent-location: ksa2.h LINE:282 ID:ksasnd
              last process posted by me: 1982dec38 1 6
    (latch info) wait_event=0 bits=0
        Location from where call was made: ksu.h LINE:13721 ID:ksudlp:
        Context saved from call: 6814711808
      waiting for 60009fc8  process allocation level=7
        Location from where latch is held: ksu.h LINE:13677 ID:ksuapc: proc addr
        Context saved from call: 6965578584
        state=busy [holder orapid=0] wlstate=free [value=0]
          gotten 57043 times wait, failed first 54 sleeps 0
          gotten 26253 times nowait, failed: 1
        possible holder pid = 0 ospid=26755
    Process Group: DEFAULT, pseudo proc: 0x19734a4b8
    O/S info: user: oracle, term: UNKNOWN, ospid: 1311
    OSD pid info: Unix process pid: 1311, image: oracle@server552(PMON)

We can see PMON is waiting on latch 60009fc8, blocked by the process that we just killed.

It is clear to us now:
process 26755 was in self deadlock state on latch 60009fc8 and blocked other processes. Later i killed process 26755 to release the latch, then pmon should go to clean the process and release the resource.
But before pmon start to work, pmon itself also require the same latch: 60009fc8 and hence blocked and can't progress.
So it is a typiclly pmon deadlock.

We had to bounce the DB at last.

More......

July 30, 2012

RMAN senior: duplicate database from tape without target DB, "RMAN-05548: The set of duplicated tablespaces is not self-contained"

Today, user require to restore a table from old backup.
The table is small, a few thousands rows. But the DB is big, 200gb. We don't have sufficient space to support restoring the whole DB.

We decide to resotre the table and its tablespace only, we will use "duplicate database skip tablespace" clause to skip irrelative tablespaces:

server001[oracle]_ORCLPROD> rman target / auxiliary  sys/oracle@temp2 catalog rmanadmin/rman_psswd@rmancatalog

Recovery Manager: Release 11.2.0.2.0 - Production on Sun Jul 29 04:23:34 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ORCLPROD (DBID=1650044020)
connected to recovery catalog database
connected to auxiliary database: TEMP2 (not mounted)

RMAN> run
{
set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')";
allocate auxiliary  channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
allocate auxiliary  channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
duplicate target database to temp2 nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7;
}

executing command: SET until clause
..........
..........
Violation: ORA-39906: Constraint SYS_C0050693 between table GETS_SO.GETS_SO_ATS_PROJECT in tablespace OTD and table GETS_SO.GETS_SO_SWPT_SSO_SHOP_ORD in tablespace SWPT_TABLES.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_BUSINESS_MST_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_BUSINESS_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_CUSTOMER_MST_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_CUSTOMER_MST in tablespace OTD.
Violation: ORA-39908: Index GETS_SO.GETS_SO_NRE_ESR_DETAIL_PK in tablespace OTX_INDEXES enforces primary constraints  of table GETS_SO.GETS_SO_NRE_ESR_DETAIL in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK_IE2 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_DSGN_TSK in tablespace OTD.
Violation: ORA-39907: Index GETS_SO.GETS_SO_NRE_ESR_RPT_TSK_IE1 in tablespace OTX_INDEXES points to table GETS_SO.GETS_SO_NRE_ESR_RPT_TSK in tablespace OTD.
..........
..........

released channel: ch1
released channel: ch2
released channel: ch3
released channel: ch4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 07/29/2012 05:13:00
RMAN-05501: aborting duplication of target database
RMAN-05548: The set of duplicated tablespaces is not self-contained

The red part is an enhancement in 11g. When we skipped some tablespaces, oracle will check dependency on remaining tablespace through TTS transport check procedure. If the check failed, then rman will refuse to restore.

How to skip that? The only method is to use no-target duplicate.
Since the tablespace checking relies on target database, so with this method, oracle can't check the dependency from target DB:

no target duplicate:
rman auxiliary / catalog catalog rmanadmin/rman_psswd@rmancatalog
set dbid 21111;
run
{
set until time "to_date('03-JUL-201212:00:00','dd-mm-yyyyhh24:mi:ss')";
allocate auxiliary  channel ch1 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
allocate auxiliary  channel ch2 type 'sbt_tape' parms 'ENV=(NB_ORA_SERVER=netbacksev1_b,NB_ORA_CLIENT=server001_b)';
duplicate database to temp backup location  'sbt_tape' nofilenamecheck skip tablespace USERS,TBS1,TBS2,TBS3,TBS4,TBS5,TBS6,TBS7;
}

More......

July 21, 2012

From 11.2.0.2.5, oracle don't support ADJUST SCN any more

Today, after incomplete recovery and opened a DB, i decided to increasing SCN for the DB.
I set "_allow_error_simulation=true", and use:

alter session set events 'IMMEDIATE trace name adjust_scn level 4490';
ALTER SESSION SET EVENTS '10015 TRACE NAME ADJUST_SCN LEVEL 4490';
at open/mount state for serveral times, but the SCN didn't get changed!

The alert.log shew that oracle even didn't try to turn the SCN.
Very wiered.

Then i decide to try "_minimum_giga_scn" once, i added "_minimum_giga_scn=7000" into init.ora and then startup the DB:

SYS @ apci > startup
ORA-01078: failure in processing system parameters
LRM-00101: unknown parameter name '_minimum_giga_scn'

Kidding? I removed the parameter and again i can startup DB normal.
Let's query the hidden parameter:

SYS @ apci > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

Let's check different version databases:
10.2.0.4, no problem as we expected:
SYS @ 10gDB> select ksppinm from x$ksppi where ksppinm like '%giga%';
KSPPINM
--------------------------------------------------------------------------------
_minimum_giga_scn

11.2.0.2.4, parameter still there:
SYS @ XXDB >select ksppinm from x$ksppi where ksppinm like '%giga%';
KSPPINM
--------------------------------------------------------------------------------
_minimum_giga_scn

11.2.0.2.5, from this version the parameter disapper:
SQL>  select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

11.2.0.2.6:
SYS @ DB1 > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

11.2.0.3:
SYS @ DB2 > select ksppinm from x$ksppi where ksppinm like '%giga%';
no rows selected

We can see oracle remove the ADJUST SCN function since 11.2.0.2.5, so we can't use oracle's function to adjust SCN in future.
In there any alternative way?

Yes, we can do first increasing the checkpoint_change# value in datafile header block to a appopraite value, and then re-create controlfile.
So the new controlfile will read our new value as current SCN.

Below is the step:

SYS @ test8 > select current_scn from v$database;
CURRENT_SCN
-----------
    4324729

BBED> set offset 484
        OFFSET          484

BBED> dump
 File: /d735/data02/oracle/test8/data/system01.dbf (1)
 Block: 1                Offsets:  484 to  995           Dba:0x00400001
------------------------------------------------------------------------
 3ffe4100 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

BBED> modify /x dddddddd
Warning: contents of previous BIFILE will be lost. Proceed? (Y/N) Y
 File: /d735/data02/oracle/test8/data/system01.dbf (1)
 Block: 1                Offsets:  484 to  995           Dba:0x00400001
------------------------------------------------------------------------
 dddddddd 00000000 59d60a2f 01000000 01000000 62010000 1000db8a 02000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

BBED> sum apply;
Check value for File 1, Block 1:
current = 0x5036, required = 0x5036

SYS @ test8 > startup nomount;
ORACLE instance started.

Total System Global Area  730714112 bytes
Fixed Size                  2230080 bytes
Variable Size             318769344 bytes
Database Buffers          322961408 bytes
Redo Buffers               86753280 bytes
SYS @ test8 > CREATE CONTROLFILE REUSE DATABASE "TEST8" RESETLOGS  NOARCHIVELOG
  2      MAXLOGFILES 200
  3      MAXLOGMEMBERS 4
 ...........
 ...........
 29    '/d735/data02/oracle/test8/data/user03_new.dbf'
 30  CHARACTER SET UTF8
 31  ;
Control file created.

SYS @ test8 > alter database open resetlogs;
Database altered.

SYS @ test8 > select current_scn from v$database;
CURRENT_SCN
-----------
 3722305178

Done.
More......

July 5, 2012

Instance evicted due to HAIP failure

Rac cluster, one server's instances abruptly evicted and re-joined the cluster after 10 minutes. Below is from DB's alert log:

Wed Jun 27 00:34:55 2012
Errors in file /q001/product/ora_base/diag/rdbms/ttpoltq/ttpoltq1/trace/ttpoltq1_ora_21759.trc  (incident=139278):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 169.254.25.210 not found. Check output from ifconfig command
.............
ORA-29740: evicted by instance number 3, group incarnation 60
LMON (ospid: 20361): terminating the instance due to error 29740
Instance terminated by LMON, pid = 20361

From red part we know there must be something wrong with interconnect.

After checking found interfaces on 10.0.0.* network were working fine.
Also GPNP logfile shew no abnormal information.

Then the issue must be caused by HAIP.
Let's check CSSD's logfile:

oracle $ cat ocssd.log|egrep -i 'haip|disk'
2012-06-27 00:31:44.417: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs
2012-06-27 00:31:52.433: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15920 msecs
2012-06-27 00:32:00.449: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 23940 msecs
2012-06-27 00:32:08.170: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 31660 msecs
2012-06-27 00:32:16.481: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 39970 msecs
2012-06-27 00:32:24.497: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 47980 msecs
2012-06-27 00:32:32.513: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 56000 msecs
2012-06-27 00:32:40.945: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 64430 msecs
2012-06-27 00:32:49.236: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 72720 msecs
2012-06-27 00:32:58.565: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 82040 msecs
2012-06-27 00:33:06.581: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 90060 msecs
2012-06-27 00:33:14.597: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 98070 msecs
2012-06-27 00:33:15.539: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 50% maximum time, 200000 ms, will be considered unusable in 99990 ms
2012-06-27 00:33:22.614: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 106090 msecs
2012-06-27 00:33:30.629: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 114100 msecs
2012-06-27 00:33:34.069: [    CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0
2012-06-27 00:33:34.069: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 1
2012-06-27 00:33:34.069: [    CSSD][1093089600]clssgmTermMember: Terminating member 1 (0x18afc30) in grock haip.cluster_interconnect
2012-06-27 00:33:34.069: [    CSSD][1093089600]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:33:34.070: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x18afc30) node number 1 state 0x0 grock type 2
2012-06-27 00:33:34.070: [    CSSD][1111198016]clssgmGrantLocks: 1-> new master (3/3) group haip.cluster_interconnect
2012-06-27 00:33:34.072: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.072: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c30034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(109)
2012-06-27 00:33:34.242: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab0611a70 with con 0x12622491, requested num 1, flags 0x4000100
2012-06-27 00:33:34.242: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab051c000) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:33:34.242: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:33:34.245: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.245: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c70034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(110)
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab0611a70), grock haip.cluster_interconnect, member 1
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab051c000) in global grock haip.cluster_interconnect
2012-06-27 00:33:34.562: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:33:34.562: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab051c000) node number 1 state 0x4 grock type 2
2012-06-27 00:33:34.564: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.565: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6c80034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(111)
2012-06-27 00:33:34.627: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab04022b0 with con 0x1262256c, requested num 1, flags 0x4000100
2012-06-27 00:33:34.627: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab04c0f80) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:33:34.627: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:33:34.629: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:33:34.630: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x6cb0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(112)
2012-06-27 00:33:39.647: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 123120 msecs
2012-06-27 00:33:47.663: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 131130 msecs
2012-06-27 00:33:55.680: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 139150 msecs
2012-06-27 00:34:04.115: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 147580 msecs
2012-06-27 00:34:05.658: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 75% maximum time, 200000 ms, will be considered unusable in 49880 ms
2012-06-27 00:34:11.712: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 155180 msecs
2012-06-27 00:34:19.728: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 163190 msecs
2012-06-27 00:34:28.314: [    CSSD][1105717568]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 171770 msecs
2012-06-27 00:34:35.732: [    CSSD][1109621056]clssnmvDiskCheck: (ORCL:ASM_DISK01) No I/O completed after 90% maximum time, 200000 ms, will be considered unusable in 19810 ms
2012-06-27 00:34:36.762: [    CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 180220 msecs
2012-06-27 00:34:43.044: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac27b800) node number 3 state 0x0 grock type 2
2012-06-27 00:34:43.044: [    CSSD][1111198016]clssgmGrantLocks: 3-> new master (2/2) group haip.cluster_interconnect
2012-06-27 00:34:43.046: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.047: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7200034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(113)
2012-06-27 00:34:43.291: [    CSSD][1111198016]clssgmAddMember: member (3/0x2aaaac5844a0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:34:43.291: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0xac5844a0
2012-06-27 00:34:43.293: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.293: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7400034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(114)
2012-06-27 00:34:43.636: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 3 (0x2aaaac5844a0) node number 3 state 0x0 grock type 2
2012-06-27 00:34:43.638: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(3)
2012-06-27 00:34:43.639: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74f0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(115)
2012-06-27 00:34:43.730: [    CSSD][1111198016]clssgmAddMember: member (3/0x1dba9d0) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:34:43.731: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(3/Remote) node(3) flags 0x0 0x1dba9d0
2012-06-27 00:34:43.733: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:43.733: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x74a0034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(116)
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmExitGrock: client 1 (0x2aaab04022b0), grock haip.cluster_interconnect, member 1
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmUnregisterPrimary: Unregistering member 1 (0x2aaab04c0f80) in global grock haip.cluster_interconnect
2012-06-27 00:34:45.457: [    CSSD][1105717568]clssgmUnreferenceMember: global grock haip.cluster_interconnect member 1 refcount is 0
2012-06-27 00:34:45.457: [    CSSD][1111198016]clssgmRemoveMember: grock haip.cluster_interconnect, member number 1 (0x2aaab04c0f80) node number 1 state 0x4 grock type 2
2012-06-27 00:34:45.459: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:34:45.459: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x7640034) of grock(haip.cluster_interconnect) received all acks, grock update sequence(117)
2012-06-27 00:43:28.566: [    CSSD][1105717568]clssgmJoinGrock: global grock haip.cluster_interconnect new client 0x2aaab43103e0 with con 0x126435d1, requested num 1, flags 0x4000100
2012-06-27 00:43:28.566: [    CSSD][1105717568]clssgmAddGrockMember: adding member to grock haip.cluster_interconnect
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmAddMember: Adding fencing for member 1, group haip.cluster_interconnect, death 1, SAGE 0
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmAddMember: member (1/0x2aaab46fbd20) added. pbsz(0) prsz(0) flags 0x0 to grock (0x2aaab00e3190/haip.cluster_interconnect)
2012-06-27 00:43:28.566: [    CSSD][1111198016]clssgmCommonAddMember: global group grock haip.cluster_interconnect member(1/Local) node(1) flags 0x0 0x423b6300
2012-06-27 00:43:28.568: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) is waiting for at least another ack from node(2)
2012-06-27 00:43:28.569: [    CSSD][1111198016]clssgmBroadcastGrockRcfgCmpl: RPC(0x2aa0035) of grock(haip.cluster_interconnect) received all acks, grock update sequence(118)
2012-06-27 00:43:44.470: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70)
2012-06-27 00:43:45.958: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 32 (0x2aaab4269a70)
2012-06-27 00:43:48.770: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340)
2012-06-27 00:43:48.771: [    CSSD][1105717568]clssgmExecuteClientRequest: VOTEDISKQUERY recvd from proc 34 (0x2aaab05c4340)
From above we can see the cluster started to countdown votedisk timeout:
2012-06-27 00:31:44.417: [ CSSD][1117505856]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7910 msecs

The long disk timeout threshold is 200 seconds in 11gR2.
During the countdown process, the cluster had a check on HAIP and belive HAIP already died:
2012-06-27 00:33:34.069: [ CSSD][1105717568]clssgmFenceClient: fencing client (0x19278b0), member 1 in group haip.cluster_interconnect, no share, death fence 1, SAGE fence 0

Cluster start to cleanup HAIP, and tried mutiple times to restart HAIP.

And later, the votedisk is online again, so votedisk timeout countdown stopped, and HAIP also re-brought up on-line, and all instances also be brought up, and all things back to normal then.

More......