[ORA-4031]large pool PX msg pool

2017-02-05 本文已影响0人胖熊猫l

0. Summary

1. 问题现象
2. 问题分析
.   2.1 查看SGA设置参数
.   2.2 查看large pool大小以及自动调整
.   2.3 并行参数查看
.   2.4 告警日志详细分析
.   2.5 shared pool大小查看
3. 问题处理建议

1. 问题现象

#### alert log ####

Sat Feb 04 02:08:41 2017
Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :SZ1X.IN_JS_CDR_HW_AC_TI 
Memory Notification: Library Cache Object loaded into SGA
Heap size 331660K exceeds notification threshold (51200K)
Details in trace file /app/oracle/diag/rdbms/noap/noap/trace/noap_j000_4031.trc
KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186
Sat Feb 04 02:09:20 2017
TABLE SZ1X.MOD_CDR_HW: ADDED INTERVAL PARTITION SYS_P1404388 (47883) VALUES LESS THAN (TO_DATE(' 2017-02-04 03:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
Sat Feb 04 02:14:22 2017 
Thread 1 advanced to log sequence 603901 (LGWR switch)
  Current log# 1 seq# 603901 mem# 0: /app/oracle/oradata/noap/redo01.log
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc  (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc  (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc  (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc  (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc
......

#### noap_p085_5172_i109601.trc ####

Dump continued from file: /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc
ORA-04031: ?T·¨·??? 2048024 ??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
  
========= Dump for incident 109601 (ORA 4031) ========μ??μ???è,
  
*** 2017-02-04 02:32:34.673
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=385pbhfh4g7rn) -----
 insert /*+append*/ into c_cdr_railway_huning 
  select /*+full(t) parallel(64)*/
  RELEASE_CAUSE AS o??Dêí·??-ò
  ACCESS_TIME AS ?óè?ê±?ì
......

告警日志有ORA-04031报错，从报错信息来看，直接原因是因为并行引起的large pool不足导致。

2. 问题分析

2.1 查看SGA设置参数

SQL> show parameter sga

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 32G
sga_target                           big integer 32G
SQL> show parameter db_cache

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_cache_advice                      string      ON
db_cache_size                        big integer 22G
SQL> show parameter shared_pool

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
shared_pool_reserved_size            big integer 510027366
shared_pool_size                     big integer 8G

当前数据库SGA设置为ASMM自动管理

2.2 查看large pool大小以及自动调整

SQL> show parameter large

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
large_pool_size                      big integer 0
use_large_pages                      string      TRUE

SQL> select t.*
  2    from (select name,
  3                 bytes / (1024 * 1024) "MB",
  4                 round(bytes / (select value
  5                                  from v$parameter t
  6                                 where t.name = 'shared_pool_size') * 100,
  7                       2) || '%' "USED%"
  8            from v$sgastat
  9           where pool = 'large pool'
 10           order by 2 desc) t
 11   where rownum < 20;

NAME                               MB USED%
-------------------------- ---------- -----------------------------------------
free memory                  119.8125 1.46%
PX msg pool                    7.8125 .1%
ASM map operations hashta        .375 0%

当前数据库SGA设置为ASMM自动管理，large pool没有设置最小值，目前使用是正常。因为使用的是自动管理，在组件进行调整的时候，也是有可能积压到large pool的使用的。

SQL> select start_time,
  2         component,
  3         oper_type,
  4         oper_mode,
  5         initial_size / 1024 / 1024 "INITIAL",
  6         final_size / 1024 / 1024 "FINAL",
  7         end_time
  8    from v$sga_resize_ops
  9   where component in ('large pool')
 10   order by start_time, component;

START_TIME          COMPONENT                 OPER_TYPE     OPER_MODE              INITIAL                FINAL END_TIME
------------------- ------------------------- ------------- --------- -------------------- -------------------- -------------------
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:03
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
30/01/2017 03:02:01 large pool                GROW          IMMEDIATE                  192                  256 30/01/2017 03:02:02
......
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 02:35:47
......
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:01:56 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 03:01:57
04/02/2017 03:04:24 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 03:04:24

可以发现large pool较频繁性的进行grow和shrink

2.3 并行参数查看

从报错的trc中看，sql使用的并行度(64)较高。查看并行相关的参数

SQL> show parameter cpu_count

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     16
SQL> show parameter parallel_max

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
parallel_max_servers                 integer     640

64设置的较大，该主机cpu count只有16，建议适当降低点并行度。

2.4 告警日志详细分析

Memory Notification: Library Cache Object loaded into SGA
Heap size 51201K exceeds notification threshold (51200K)

该信息代表内存中某个组件的需求空间超过阈值，这个阈值由_kgl_large_heap_warning_threshold来控制。这个特性在10gR2被引入，单独这个信息并不代表有问题，需要观察后续是否有4031的报错。

参考：

Memory Notification: Library Cache Object loaded into SGA / ORA-600 [KGL-heap-size-exceeded] (文档 ID 330239.1)

#### noap_j000_4031.trc ####

Memory Notification: Library Cache Object loaded into SGA
Heap size 73935K exceeds notification threshold (51200K)
            
LibraryHandle:  Address=0x855ade650 Hash=70548654 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
  ObjectName:  Name=alter table MOD_CDR_HW drop partition SYS_P1399962 
    FullHashValue=3aa1433897dd4d6fc458246c70548654 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=1884587604 OwnerIdn=83
  Statistics:  InvalidationCount=0 ExecutionCount=0 LoadCount=2 ActiveLocks=1 TotalLockCount=1 TotalPinCount=1
  Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
  Concurrency:  DependencyMutex=0x855ade700(0, 1, 0, 0) Mutex=0x855ade780(1011, 21, 0, 6)
  Flags=RON/PIN/TIM/PN0/DBN/[10012841] 
  WaitersLists:  
    Lock=0x855ade6e0[0x855ade6e0,0x855ade6e0] 
    Pin=0x855ade6c0[0x855ade6c0,0x855ade6c0] 
  Timestamp:  Current=02-04-2017 02:00:34 
  HandleReference:  Address=0x855ade820 Handle=(nil) Flags=[00]

触发这个信息的trc中记录了语句，即alert后面输出的语句：

KGL object name :alter table MOD_JS_CDR_HW drop partition SYS_P1404186

继续看large pool方面的报错

Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p085_5172.trc  (incident=109601):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109601/noap_p085_5172_i109601.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p034_5070.trc  (incident=101439):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_101439/noap_p034_5070_i101439.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p092_5186.trc  (incident=109832):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_109832/noap_p092_5186_i109832.trc
Sat Feb 04 02:32:34 2017
Errors in file /app/oracle/diag/rdbms/noap/noap/trace/noap_p051_5104.trc  (incident=107372):
ORA-04031: ?T·¨·??? 2048024 ×??úμ?12?í?ú′? ("large pool","unknown object","large pool","PX msg pool")
Incident details in: /app/oracle/diag/rdbms/noap/noap/incident/incdir_107372/noap_p051_5104_i107372.trc

large pool这部分输出，从前面SQL查询刚好是large pool的shrink操作。

04/02/2017 02:32:32 large pool                GROW          IMMEDIATE                  320                  384 04/02/2017 02:32:33
04/02/2017 02:35:47 large pool                SHRINK        DEFERRED                   384                  128 04/02/2017 02:35:47

参考：

Multiple ORA-4031 Errors Of Reducing Sizes For "PX msg pool" In The Large Pool (文档 ID 1515877.1)

和Bug:13072654 - ORA-4031 CANT ALLOC 14MB IN LARGE POOL, PX MSG POOL有关，该bug在11.2.0.2有one-off patch, 可以考虑应用，或者设置large pool的最小值，或者改动SGA管理为手工管理。

2.5 shared pool大小查看

因为有LCO方面的信息，查看shared pool当前使用的大小

SQL> select t.*
  2    from (select name,
  3                 bytes / (1024 * 1024) "MB",
  4                 round(bytes / (select value
  5                                  from v$parameter t
  6                                 where t.name = 'shared_pool_size') * 100,
  7                       2) || '%' "USED%"
  8            from v$sgastat
  9           where pool = 'shared pool'
 10           order by 2 desc) t
 11   where rownum < 20;

NAME                                         MB USED%
-------------------------- -------------------- -----------------------------------------
free memory                2971.667167663574219 36.28%
PRTMV                      2858.977592468261719 34.9%
SQLA                       1862.341529846191406 22.73%
PRTDS                      614.0089035034179688 7.5%
KQR M PO                   315.4219131469726563 3.85%
KGLH0                      199.7758560180664063 2.44%
dbktb: trace buffer                    81.90625 1%
FileOpenBlock                60.796417236328125 .74%
ASM extent pointer array   52.86400604248046875 .65%
db_block_hash_buckets               44.50390625 .54%
dbwriter coalesce buffer               32.03125 .39%
ASH buffers                                  32 .39%
KGLHD                      29.06992340087890625 .35%
kglsim object batch        19.32781219482421875 .24%
private strands                   17.5341796875 .21%
Checkpoint queue                     15.6328125 .19%
event statistics per sess           15.33984375 .19%
write state object          14.6377716064453125 .18%
ksunfy : SSO free list           14.32470703125 .17%

这里发现PRTMV这个组件比较陌生，并且占用了2.8G的空间，对比了其他库：

NAME                               MB USED%
-------------------------- ---------- -----------------------------------------
free memory                6021.38947 58.8%
SQLA                       1472.31024 14.38%
KGLH0                      1264.46631 12.35%
PRTMV                       219.83268 2.15%
KGLHD                      199.816628 1.95%
db_block_hash_buckets      178.003906 1.74%
dbktb: trace buffer        102.390625 1%
ASH buffers                        96 .94%
dbwriter coalesce buffer    80.078125 .78%
FileOpenBlock              71.1162643 .69%
KGLDA                       65.826004 .64%
Checkpoint queue           46.8984375 .46%
KKSSP                      40.4567947 .4%
private strands            25.9765625 .25%
dirty object counts array          24 .23%
event statistics per sess  22.8779297 .22%
ksunfy : SSO free list     21.7646484 .21%
parameter table block      19.9453812 .19%
KGLS                       19.5513763 .19%

从对比可以看出，这个值可能存在异常，搜索了下MOS，确实存在相关的bug:

Bug 19461270 - high PRTMV allocations in shared pool executing concurrent DML and DDLs on interval partitioned tables (文档 ID 19461270.8)

Description

Concurrent DDLs and DMLs happening on interval partitioned table that was created with deferred segment creation clause may do high PRTMV allocations.

Workaround

Do not run DDLs concurrently.

在使用interval分区的情况下，可能会触发，与当前问题现象较为吻合。

Bug 17037130 - Excess shared pool "PRTMV" memory use / ORA-4031 with partitioned tables (文档 ID 17037130.8)

Description

This bug is only relevant when using Partitioned Tables
SQL on a partitioned table may cause excess shared pool usage and
ultimately fail with ORA-4031.

Rediscovery Notes:
ORA-4031 with child cursor(s) having dependency table entries
referencing obsolete (OBS) multi-versioned objects.

Workaround
Flushing the shared_pool and avoiding DDLs during high load time
can help to avoid this issue.

3. 问题处理建议

以上分析，large pool的4031报错很可能和shrink large pool有关。另外shared pool方面也存在问题。

对于large pool的bug，这个库版本为11.2.0.2，未打PSU. 该bug在11.2.0.2有one-off patch，如果不应用patch，可以考虑使用以下手段规避

对large pool设置最小值避免频繁shrink，当前库设置为ASMM自动管理，db_cache(22g)和shared_pool(8g)已设置最小值，large_pool建议设置为200M.

alter system set large_pool_size=200M scope=spfile sid='*';

如果频繁影响到并行任务，建议打上one-off patch或者修改内存管理为手工管理。

并行任务中并行度64设置的较大，该主机cpu count只有16，建议适当降低点并行度。

对于shared pool的问题，当前数据库版本为11.2.0.2基版本没有打PSU，涉及的两个bug均没有在11.2.0.2以及linux平台下的one-off patch. 在无法立即升级到11.2.0.3或以上版本的情况下，建议：

从bug 19461270描述来看，该bug除了与interval分区有关，还和11g的新特性deferred segment creation特性有关，建议关闭这个特性。

alter system set deferred_segment_creation=false scope=spfile sid='*';

另一个bug 17037130从描述中和段延迟创建特性无关，建议按照第一步设置后持续观察，临时解决问题的方法是flush shared_pool或者避免在高负载时间段进行ddl.
对于当前已经使用的PRTMV组件，如果需要释放，建议可以找业务空闲的时间段手工flush shared_pool释放。