ORA-12500内存耗尽一例

3月8日下午发现主机130.31.1.234无法登录,尝试登录Oracle,系统返回ORA-12500错误(TNS:listener failed to start a dedicated server process)。可能引起该错误的原因有多种,包括以下:

Oracle服务进程使用的session或process数达到了参数设置的上限,导致无法再分配新的服务进程。

系统资源耗尽,Oracle在启动新进程时调用的系统调用fork函数因资源不足而出错。

AIX下sys0对象上的属性maxuproc代表用户可以使用的最大进程数,若用户进程数接近该设定值可能导致Oracle无法启动新进程。

因主机无法远程登录,故在晚间进行了重启。重启后查看Oracle告警日志发现记录:“skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3”,该段信息表示模块skgpspawn在fork一个新进程是出现了错误;由此可以判断不是由于session或process数达到参数设置的上限,因为若是session或process数不足,Oracle应当显示地返回ORA-00018(maximum number of sessions exceeded)或ORA-00020(maximum number of processes (%s) exceeded)错误。

通过在主机上查看sys0对象属性,发现maxuproc参数的设置值达到了4096,故基本可以排除因该参数不当引起连接问题的可能性。

系统资源耗尽将导致Oracle监听器无法为新的连接分配新的服务进程,而老的服务进程上的内存资源等可能一直没有得到释放;statspack是Oracle9i中反映Oracle运行性能的工具,以JOB形式在后台运行,目前设置为每两小时运行一次快照。分析快照发现,在主机重启前最后一次快照为下午14:15分开始的,之后系统进入资源紧张阶段,Oracle无法分配新的JOB(j00n)进程,故最后的快照发生在系统出现问题之前。

分析快照内容,在12:06到14:15之间,数据库参数没有改动的记录,sga_max_size设置为10GB, pga_aggregate_target设置为4GB,考虑到Oracle在启用RAC特性后SGA的实际内存使用量将会超过sga_max_size的设置,故Oracle总的内存最大使用量应控制在20GB内。而目标主机的实际物理内存达到了64GB,且专业计费系统一直以来运行良好,故可以排除因Oracle内存参数设置不当,而造成了本次问题的出现。

进一步分析快照发现这一阶段内Oracle数据库高速缓冲的命中率buffer hit为55.90,这个值要较平时水平低很多,可以判断该阶段内数据库可能执行了一些不同于日常业务的操作,这些操作引起较大的物理读表现为缓冲池的命中率明显降低。分析等待事件可以发现,db file scattered read事件即数据库多块物理读是这一阶段的主要等待事件,进一步印证了上述的判断。

通过对数据库快照的分析,证实在连接问题发生之前的短暂时间内,在P6702实例上的确有过引起较大物理读的操作,但实际Oracle使用的内存受到sga_max_size与pga_aggregate_target参数的限制应控制在20GB的范围内,且专业计费系统数据库使用裸设备数据文件,不存在过度使用文件系统缓存的可能,故可以排除由Oracle数据库导致系统资源耗尽的可能性。

因为没有该阶段内系统内存使用量的日志文件,故无法了解到目标主机上当时其他服务的实际内存使用量,但可以排除是问题因Oracle引起的。

Comments

  1. Applies to:
    Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 10.2.0.5 – Release: 9.2 to 10.2
    Information in this document applies to any platform.
    ***Checked for relevance on 25-Jul-2010***
    Symptoms
    New connections not allowed and the following error appears in the alerlog

    ERROR
    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn5
    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

    Occurs intermittently.
    Cause
    Insufficient SWAP memory allocated.

    This problem has been investigated in the following bug which is closed as not bug

    bug 4189338 INSTANCES FAILED WITH MULTIPLE ERROR MESSAGES
    Solution
    SET swap memory to ( 1 – 2 ) the size of the physical memory installed depending on the Installation and Configuration Requirements as clarified in the following note

    Note 169706.1 Oracle Database on AIX,HP-UX,Linux,MacOSX,Solaris,Tru64 Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 10.2)

    Resolution for this same issue was seen by increasing the OS parameter for maximum number of processes per user from its default value. (EX: default on AIX is 128; increased to 1280 resolved issue)

  2. Hdr: 4189338 9.2.0.6.0 RDBMS 9.2.0.6.0 PERFORMANCE PRODID-5 PORTID-212
    Abstract: INSTANCES FAILED WITH MULTIPLE ERROR MESSAGES

    *** 02/16/05 12:02 pm ***
    TAR:
    —-
    4180379.994

    PROBLEM:
    ——–
    2 instances on server (DDWP and MSRP) failed after background processes
    terminated. Alert log segments are below.
    DDWP alert log segment
    Wed Jan 19 05:38:16 2005

    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

    Wed Jan 19 05:38:38 2005

    LGWR: terminating instance due to error 472

    Instance terminated by LGWR, pid = 21221448

    MSRP alert log segment
    Wed Jan 19 05:38:18 2005

    found dead shared server ‘S000’, pid = (11, 1)

    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

    failed to start shared server ‘S000’, oer=27142

    found dead dispatcher ‘D000’, pid = (12, 1)

    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

    failed to start dispatcher ‘D000’ for network
    ‘(ADDRESS=(PARTIAL=YES)(PROTO =T
    CP))’, oer=27142

    Restarting dead background process CJQ0

    Wed Jan 19 05:38:20 2005

    Errors in file /ora01/app/oracle/admin/msrp/bdump/msrp_pmon_20840532.trc:

    ORA-443: background process “CJQ0” did not start

    ORA-27300: OS system dependent operation:fork failed with status: 12

    ORA-27301: OS failure message: Not enough space

    ORA-27302: failure occurred at: skgpspawn3

    Wed Jan 19 05:39:18 2005

    DBW1: terminating instance due to error 472

    Instance terminated by DBW1, pid = 20951098

    DIAGNOSTIC ANALYSIS:
    ——————–
    RFA 14438 filed – issue felt to be a bug
    IBM ticket logged to verify that this is not a hardware issue.
    – IBM verified the hardware is okay and there are no hardware problem

    Swap: {
    System Configuration: lcpu=16 mem=128000MB
    kthr memory page faults cpu
    —– ———– ———————— ———— ———–
    r b avm fre re pi po fr sr cy in sy cs us sy id wa
    3 9 13474399 15601 0 0 0 132 638 0 2842 39907 5010 14 2 63 21

    ALERT.LOG:
    ——————

    Wed Jan 19 05:38:16 2005
    skgpspawn failed:category = 27142, depinfo = 12, op = fork, loc = skgpspawn3

    The following errors were not shown in the RDA, not sure if they are not in
    the other database Alert.log?
    ORA-443: background process “CJQ0” did not start

    ORA-27300: OS system dependent operation:fork failed with status: 12

    ORA-27301: OS failure message: Not enough space

    ORA-27302: failure occurred at: skgpspawn3

    – currently 128 GB of RAM on the server, and another 30 GB of swap space.

    WORKAROUND:
    ———–
    None

    RELATED BUGS:
    ————-

    REPRODUCIBILITY:
    —————-

    TEST CASE:
    ———-

    STACK TRACE:
    ————

    SUPPORTING INFORMATION:
    ———————–

    24 HOUR CONTACT INFORMATION FOR P1 BUGS:
    —————————————-

    DIAL-IN INFORMATION:
    ——————–

    IMPACT DATE:
    ————
    Effecting production att
    – I highly suggest reviewing the uploaded files on ESS30
    – Also highly suggest reviewing the SR for all of the relevent information
    about this issue. There have been many steps implemented to try and resolve
    this issue that have failed.

  3. teapot says

    考虑到Oracle在启用RAC特性后SGA的实际内存使用量将会超过sga_max_size的设置
    这是什么意思? metalink有相关说明?
    TKS

    • admin says

      这个文档写的不是最清楚,数据库版本是9.2.0.6

      之前诊断该9i RAC系统存在SGA内存泄露的问题。 Mos上 9i RAC Memory Leak 相关的Bug:

      Bug 3910149 Very small SGA memory leak in RAC environments:
      Slow SGA memory leak in an internal KGL descriptor (indicated by the
      “Permanent space allocated for KGL handle dependents” entry in library
      cache dump).

      Bug 3420610 RAC “BAST” leak can lead to instance hang:
      There is a rare case where Oracle does not release the BAST context after
      rejecting a ping, and this can lead to a BAST context leak. If
      this situation happens frequently enough and over time it is possible to
      run out of BAST context entries in which case LMS cannot process
      any incoming pings leading to a hang scenario.

      Bug 5043858 Server accept handle leak can crash RAC instance:
      RAC instance may crash due to server accept handle leak as
      a close message may not be sent on disconnect.

  4. admin says

    What Is GCS_RESOURCES And How Do We Control Them?

    Applies to:
    Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 10.2.0.1
    Information in this document applies to any platform.
    Goal
    Question 1: What is gcs_resources ?

    Question 2: What happens if it exceeds its limit?

    Question 3: How do we adjust the limit?
    Solution

    Answer 1: gcs_resources is global cache resources which are essentially structures we use to protect cache structures (buffers). Each instance “masters” cache resources and the implementation of this mastership is done via a gc resource structure

    Answer 2 :Nothing – this is simply stating how high the value has reached for a particular resource. If we need more gcs_resources for some reason, they are obtained from the shared pool.

    Answer 3 : RAC specific memory is allocated at the time of SGA creation, mostly in the shared pool. However, the memory does not count in the SHARED_POOL_SIZE as defined in the init.ora parameter file.Therefore, when migrating Oracle from single instance to RAC, you do not need to adjust the SHARED_POOL_SIZE parameter to accommodate the additional memory that RAC uses; Oracle does that automatically. Note that the memory for the KCL global cache lock elements will be allocated in the buffer cache, and not in the shared pool.

    You can monitor the exact RAC memory resource usage of the shared pool by querying V$SGASTAT for GCS, GES and KCL related entries. Also, the current and maximum number of GCS resources / shadows (resource names gcs_resources and gcs_shadows) and GES resources / enqueues (resource names ges_ress and ges_locks) can be obtained from V$RESOURCE_LIMIT. In case we are running out of reserved memory for GCS / GES related components, Oracle will dynamically allocate memory from the free memory pool in the shared pool. However, it is recommended to carefully calculate the resources and enqueues as needed since dynamic memory allocation from the shared pool is an expensive operation.

    We recommend you monitor for more time to see if this limit is reach and it is a constant situation, then you could considered to increase the value of gcs_resources by instance.

    SQL> alter system set "_gcs_resources"=[value] scope=spfile sid='[sid]';
    
  5. admin says

    LCK temporarily stuck waiting for latch ‘Child row cache objects’

    Applies to:
    Oracle Server – Enterprise Edition – Version: 10.2.0.1 to 11.1.0.7
    Information in this document applies to any platform.
    Symptoms

    One RAC instance hangs from time to time. The hang resolves itself after a while until it hangs again after some time.

    1. alert.log info

    => alert log of the hanging instance show on one node that the LCK process is
    blocked on different enqueues and hanging messages are reported afterwards
    (e.g. ORA-3136 and WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK)

    LCK0 started with pid=10, OS id=7217
    Wed Feb 11 12:25:36 2009

    Mon Feb 16 19:01:24 2009
    GES: Potential blocker (pid=7217) on resource IV-0051D934-1013071D;
    enqueue info in file /u01/oracle/admin/SNM/bdump/snm2_lmd0_7006.trc and
    DIAG trace file
    Mon Feb 16 19:15:59 2009
    GES: Potential blocker (pid=7217) on resource CI-00000046-00000002;
    enqueue info in file /u01/oracle/admin/SNM/bdump/snm2_lmd0_7006.trc and
    DIAG trace file

    Mon Feb 16 19:26:16 2009
    > />> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=25
    System State dumped to trace file
    /u01/oracle/admin/SNM/bdump/snm2_mmon_7032.trc

    Tue Feb 16 19:35:22 2009
    WARNING: inbound connection timed out (ORA-3136)

    => alert.log on the other nodes show CI crossinstance messages locks

    Mon Feb 16 19:27:53 2009
    GES: Potential blocker (pid=13607) on resource CI-00000046-00000005;
    enqueue info in file /u01/oracle/admin/SNM/udump/snm1_ora_8418.trc and
    DIAG trace file

    2. Systemstate dumps show the LCK process is blocked in the ‘Child row cache objects’ latch
    and plenty of processes are waiting on it (waiter count high). The LCK stack contains
    terms like ‘sskgpwwait kslges kqrbip kqrbfr’.

    (latch info) wait_event=0 bits=0
    Location from where call was made: kqrbip:
    waiting for 5795df1a0 Child row cache objects level=4 child#=3
    Location from where latch is held: kghfrunp: clatch: wait:
    Context saved from call: 0
    state=busy, wlstate=free
    waiters [orapid (seconds since: put on list, posted, alive check)]:
    249 (1, 1234808779, 1)

    161 (1, 1234808779, 1)
    waiter count=34
    gotten 108711869 times wait, failed first 40623351 sleeps 1869257
    gotten 14091082 times nowait, failed: 43699893
    possible holder pid = 255 ospid=3663
    on wait list for 5795df1a0

    3. awrrpt info

    => After the hang, the awrrpt show plenty of ‘KQR L PO’ memory is freed during the hang

    Pool Name Begin MB End MB % Diff
    —— —————————— ————– ————– ——-
    shared KQR L PO 2,118.6 1,005.9 -52.52
    shared free memory 932.9 2,054.4 120.21

    => The rowcache (dictionary cache, the dc_*) usage decrease, too

    dc_histogram_defs 332,935
    dc_object_ids 57,894
    dc_objects 160,639
    dc_segments 382,909

    => big cursors are entering the shared pool regularly
    Cause
    The shared pool is stressed and memory need to be freed for the new cursors. As a consequence, the dictionary cache is reduced in size by the LCK process causing a temporal hang of the instance since the LCK can’t do other activity during that time. Since the dictionary cache is a memory area protected clusterwide in RAC, the LCK is responsible to free it in collaboration with the dictionary cache users (the sessions using cursors referenced in the dictionary cache). This process can be time consuming when the dictionary cache is big.
    Solution

    a. reduce the stress on the shared pool

    => by increasing it above the automatically reached value with dynamic sga, e.g.
    when sga_target is set to 16G and the shared_pool_size was 6G during the hang time, set it to e.g. 8G.

    => by reducing the number of big cursors entering the shared pool, e.g. cursors using more than 1M sharable_mem e.g. via binding
    select sql_text from v$sqlarea where sharable_mem > 1M;

    b. reduce the dictionary cache usage in order to reduce the size of the dictionary cache, e.g.

    => when dc_histogram_defs is too high, it can point towards histograms calculations on all columns
    of the tables. histograms should only be calculated on indexed columns
    => when dc_segments is high compared to dc_object_ids, it can point towards excessive partitioning usage. Reducing the partitions/subpartition usage will help reduce the dictionary cache usage to manage it.

    c. set _enable_shared_pool_durations = false to avoid that one duration (a memory area in the shared pool used for a specific usage) need to give all space required for that usage, i.e. in case the duration containing the dictionary cache need to free memory, then that duration is extra stressed since no other type of memory from other durations can be used. Setting it to false make that any type of memory can be used to free space (i.e. any type of memory in the subpool). As a consequence, the number of subpools will be reduced by the factor of the number of durations (4 in 10gR2). Hence tuning the _kghdsidx_count is advisable, e.g. increasing it to have manageable subpool sizes (see note:396940.1).
    d. check patch:8666117 has been applied. This patch speedup the processing to free memory.

  6. admin says

    High memory allocation in GES resource Cache and ORA-4031 errors

    Applies to:
    Oracle Server – Enterprise Edition – Version: 11.1.0.6 to 11.2.0.2 – Release: 11.1 to 11.2
    Information in this document applies to any platform.
    Symptoms

    Under certain workload conditions, with high enqueue activites, the GES resource cache memory allocation in the shared pool can grow excessive.

    The usual resource cleanup mechanisms for GES cache by the LMON background process would not be sufficient enough in this case and we might see ORA-4031 errors and possible instance crashes due to this.

    Cause

    This issue is corrected in the following bug fix.

    Bug 10042937 — HIGH MEMORY GROUP IN GES_CACHE_RESS AND ORA-4031 ERRORS

    You may want to consider implement the solutions given, if you are seeing the following symptoms

    o The memory of the ges_cache_ress could show high growth over a period of time. The ges_cache_ress captured in v$resource_limit would look as follows

    select resource_name, current_utilization, max_utilization, initial_allocation from v$resource_limit where resource_name like ‘%ges_ress%’;

    ———————————————————————————————————————————————————–
    RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION
    ———————————————————————————————————————————————————–
    ges_ress 17717 19354 29912
    ges_ress 416646 416647 29912
    ges_ress 1357968 1358021 29912
    ges_ress 2171240 2171240 29912
    ges_ress 2954514 2954518 29912
    ges_ress 3698162 3698166 29912
    ges_ress 4369513 4369535 29912
    ———————————————————————————————————————————————————–

    o At the time same time, we can see the sga growth in v$sgatat

    select pool, name,bytes from v$sgastat where name like ‘%ges resource%’;
    —————————————————————————–
    POOL NAME BYTES
    —————————————————————————–
    shared pool ges resource 12585776
    shared pool ges resource 120158976
    shared pool ges resource 389778368
    shared pool ges resource 624732080
    shared pool ges resource 850151288
    shared pool ges resource 1067068464
    shared pool ges resource 1257971960
    —————————————————————————–

    o ORA-4031 errors in the ges resource cache.

    ORA-4031: unable to allocate 5792 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(1,1)”,”ges resource “)
    Thu Jul 08 11:18:53 2010
    Errors in file /home/orabase/diag/rdbms/rac111/rac1111/trace/rac1111_ora_7828.trc (incident=1515626)

    Solution
    bug 10042937 is fixed in 11.2.0.3, the solution is to apply patch 10042937. This bug fix has made the ges resource cleanup by the LMON process more aggressive, to handle workload conditions with high enqueue activities. With this fix, LMON would cleans up more resources in each iteration.

    Please open a service request with support, if you are seeing this issue on other platform, to get the fix for this.

    The workaround is to increase the “_lm_cache_res_cleanup” parameter value to a higher value (Default is 25) and check whether the problem reproduces again.

    References
    NOTE:10042937.8 – Bug 10042937 – High memory group in ges_cache_ress and ora-4031 errors.

  7. admin says

    Bug 9026008 “GES_RESOURCES” may increase over time with high enqueue activity

    Affects:

    Product (Component) Oracle Server (Rdbms)
    Range of versions believed to be affected Versions BELOW 11.2.0.2
    Versions confirmed as being affected

    11.2.0.1
    11.1.0.7

    Platforms affected Generic (all / most platforms affected)

    Fixed:

    This issue is fixed in

    12.1 (Future Release)
    11.2.0.2 (Server Patch Set)
    11.2.0.1 Bundle Patch 2 for Exadata Database

    Symptoms:

    Related To:

    Error May Occur
    Leak (Memory Leak / Growth)
    ORA-4031

    RAC (Real Application Clusters) / OPS
    _lm_cache_allocated_res_ratio
    _lm_cache_res_cleanup

    Description

    Memory used for “ges resource” may increase over time.

    Rediscovery Notes:
    From v$resource_limit, check ges_resources and ges_cached_resources.
    If these values continually increase when there is high enqueue activity
    you may be hitting this issue.

    Workaround
    Possibly increasing the _lm_cache_res_cleanup parameter (default is 25 on 11.2).
    If that does not help this fix may be required.

    NOTES:
    Most “ges resource” memory increases do not require this fix but in
    extreme cases where enqueue activity is very high this fix may be necessary.
    The fix introduces an underscore parameter _lm_cache_allocated_res_ratio
    (or a spare parameter if a one-off patch was applied). This parameter defines
    the percentage of cached resources being cleanup. The default value of this
    parameter is 50 but if the fix is applied this value can be decreased to speed
    up the frequency of cached resource cleanup. If this fix (and parameter) does
    not improve the situation, the fix for bug 10042937 may be needed.

    Also see bug 10042937 which may be useful alongside this fix to help
    limit memory usage.

Comment

*

沪ICP备14014813号-2

沪公网安备 31010802001379号