早上刚来的时候,收到zabbix 数据库连接数增长的告警,同时应用负责人也说查询很慢、很卡
查看该时间段 最多的等待事件
SELECT event,COUNT(1) num
FROM V$ACTIVE_SESSION_HISTORY A
WHERE A.SAMPLE_TIME BETWEEN
TO_DATE('2025-01-02 09:00:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2025-01-02 11:00:00', 'YYYY-MM-DD HH24:MI:SS')
GROUP BY event order by 2 desc;
Latch属于System Lock, 用于保护SGA区中共享数据结构的一种串行化锁定机制。Latch的实现是与操作系统相关的,尤其和一个进程是否需要等待一个latch、需要等待多长时间有关。
Latch是Oracle提供的轻量级锁资源,是一种能够极快地被获取和释放的锁,能快速,短时间的锁定资源,防止多个并发进程同时修改访问某个共享资源,它只工作在SGA中,
通常用于保护描述buffer cache中block的数据结构。
Row cache objects latch的争夺也就意味着数据字典的争夺。如果有latch: row cache objects等待事件,那么会导致CPU很高,主机和数据库响应都很慢。
因为Row cache objects latch是用来保护数据字典缓冲区的,进程在装载、引用或者清除数据字典缓冲区中的对象时必须获得该latch。Latch不会造成阻塞,只会导致等待。 阻塞是一种系统设计上的问题,等待是一种系统资源争用的问题。
当依赖于公共同义词的SQL 被过度解析时,可能会导致这种问题。增加shared pool 通常可以解决这种latch。
在MOS上搜了一下,与latch:row cache objects 相关的等待,大部分与Oracle 的bug 相关和Rowcache及其相关Latch的视图主要有:v$latch_children,v$rowcache,v$latch_misses。
这个latch是对象的数据字典。也就是说有对象数据字典的争用,根据oracle文档介绍一般都有如下的数据字典:
DC_SEQUENCES
Caused by using sequences in simultaneous insert operations. =>
Consider caching sequences using the cache option. Especially important on RAC instances!
Bug 6027068 – Contention on ORA_TQ_BASE sequence -fixed in 10.2.0.5 and 11.2.0.1
DC_OBJECTS
Look for any object compilation activity which might require an exclusive lock, blocking other activities. If object compiles are occurring this can require an exclusive lock which will block other activity. Tune by examining invalid objects and dependencies with following SQL:
select * from dba_objects order by last_ddl_time desc;
select * from dba_objects where status = 'INVALID';
Can be a bug like the following ones: Bug 11070004 – High row cache objects latch contention w/ oracle text queries Bug 11693365 – Concurrent Drop table and Select on Reference constraint table hangs(deadlock) – fixed in 12.1 DC_SEGMENTS This is most likely due to segment allocation. Identify what the session holding the enqueue is doing and use errorstacks to diagnose.
DC_USERS
– This may occur if a session issues a GRANT to a user, and that user is in the process of logging on to the database.
– Excessive calls to dc_users can be a symptom of “set role XXXX”
– You can check the presents of massive login attempts, even the failed ones by analyzing listener.log (use OEM 12c-> All Metrics or by checking database AUDIT if available or using own tools).
– Bug 7715339 – Logon failures causes “row cache lock” waits – Allow disable of logon delay
DC_TABLESPACES
Probably the most likely cause is the allocation of new extents. If extent sizes are set low then the application may constantly be requesting new extents and causing contention. Do you have objects with small extent sizes that are rapidly growing? (You may be able to spot these by looking for objects with large numbers of extents). Check the trace for insert/update activity, check the objects inserted into for number of extents.
DC_USED_EXTENTS and DC_FREE_EXTENTS
This row cache lock wait may occur similar during space management operations where tablespaces are fragmented or have inadequate extent sizes. Tune by checking whether tablespaces are fragmented, extent sizes are too small, or tablespaces are managed manually.
DC_ROLLBACK_SEGMENTS
– This is due to rollback segment allocation. Just like dc_segments, identify what is holding the enqueue and also generate errorstacks.
Possible Bugs:
– Bug 7313166 Startup hang with self deadlock on dc_rollback_segments (Versions BELOW 11.2)
– Bug 7291739 Contention Under Auto-Tuned Undo Retention (Doc ID 742035.1)
DC_TABLE_SCNS
Bug 5756769 – Deadlock between Create MVIEW and DML – fixed in 10.2.0.5 ,11.1.07 and 11.2.0.1
DC_AWR_CONTROL
This enqueue is related to control of the Automatic Workload Repository. As such any operation manipulating the repository may hold this so look for processes blocking these.
II.解决方案:
通过如下分析是由于故障时段有大量硬解析,硬解析需要去获取数据字典资源,这需要获得latch,若硬解析量太大,会造成严重的latch争用,占用非常多的资源,导致CPU使用过高,从而表现出数据库ROW CACHE OBJECT等待事件,最终导致数据库系统缓慢。
根据分析,有如下2点建议:
1. 对业务侧进行SQL整改,将未使用绑定变量改成绑定变量,成为软解析,减少获取数据字典的次数,从而降低LATCH争用次数,从根本解决问题(推荐)
2. 对数据库参数 cursor_sharing 进行调整,可部分缓解这个问题,但不能从根本上解决问题。
3.优化查询语句,本次是对OGG同步的表添加组合索引,缓解了争用
III.分析过程:
1.数据库存在LATCH等待情况,看到这几个等待事件,这个数据库的内存很大,我怀疑是硬解析或者热块争用有关;
2.SGA使用率,Shared pool还有18G,排除内存不足
select POOL,round(BYTES/1024/1024) FREE_MB from v$sgastat a where a.NAME like 'free%';
3. 通过AWR发现Parse CPU to Parse Elapsed为4.61%,Non-Parse CPU为62.18%,正常情况下98%以上,该值表示SQL解析时间占比,越高越好,如果过低说明解析中等待资源时间太长。软解析比例也过低,正常在98%以上。应该就是硬解析的问题了
4. 通过分析数据库内部资源,发现数据字典的各种LATCH争用过高,(GETS表示请求该资源的次数,MISSES表示请求失败重新请求的次数,SLEEPS表示请求失败进入睡眠队列),成功率最底12%,正常应为98%。
查看row cache objects 的具体信息,查询LATCH 的GET量,以及成为率
col LATCH_NAME for a20
SELECT a.addr,a.latch#,a.child#,a.level#,a.name LATCH_NAME,a.gets,a.misses,round((1-a.misses/a.gets)*100,2) SUCESS_PCT,a.sleeps
FROM v$latch_children a
WHERE a.name='row cache objects' AND a.gets <>0
ORDER BY a.gets desc;
使用如下SQL检查row cache 是否高效。 pct_succ_gets 接近与100最好,如果接近于0,就说明需要优化,
如果pct_succ_gets 的值不接近与100,或者数据库出现大量的“Row Cache Objects” latch等待,可以考虑增加SHARE_POOL_SIZE 来缓解这种现象
SELECT parameter,
SUM(gets),
SUM(getmisses),
100 * SUM (gets-getmisses)/sum(gets)pct_succ_gets,
sum(modifications) updates
FROM V$ROWCACHE
WHERE gets >0
GROUP BY parameter;
5.通过以下查询语句,查看引起该争用的SQL语句
select sq.sql_text,sql_fulltext from (
select * from (
SELECT sql_id,event,COUNT(1) num
FROM V$ACTIVE_SESSION_HISTORY A
WHERE A.SAMPLE_TIME BETWEEN
TO_DATE('2025-01-02 09:00:00', 'YYYY-MM-DD HH24:MI:SS') AND
TO_DATE('2025-01-02 11:00:00', 'YYYY-MM-DD HH24:MI:SS')
AND event in( 'latch: row cache objects' ,'cursor: pin S wait on X') and user_id=141
GROUP BY sql_id,event
ORDER BY 3 DESC) where rownum<=15 ) a left join v$sql sq on a.sql_id=sq.sql_id where sql_text is not null order by num desc;
发现是对一个视图的查询,查看执行计划,发现联合查询中,有一个是对OGG投递过来的表进行全表扫描,评估了下,在源端的表添加组合索引,目标表自动同步了索引(OGG开启了DDL同步),争用没有了,告警恢复
查看该语句的执行计划
select * from table(dbms_xplan.display_cursor('sql_id_number',null,'advanced'))
或者
explain plan for select_sql;
select * from table(dbms_xplan.display('',null,'advanced'));
MOS参考:
Slow Performance with High Waits for ‘row cache lock’ With Possible Database Hang (Doc ID 2189126.1)
Bug 13902396 - Hash joins cause “row cache objects” latch gets and “shared pool” latch gets (disabled fix) (Doc ID 13902396.8)