记一次监听异常处理

客户抱怨最近监听经常异常并无法注册服务,导致前台进程连接失败。
根据客户的提到最后一次的大致时间,我们来看一下监听日志情况:

[.......]
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.85)(PORT=57408)) * establish * app2012 * 0
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.22)(PORT=1521)))      <<<<<<
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.21)(PORT=1521)))      <<<<<<
Listener completed notification to CRS on stop                    <<<<<
21-JUN-2017 15:35:14 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=TRFFAPP2)(USER=oracle))(COMMAND=stop)(ARGUMENTS=64)(SERVICE=LISTENER_TRFFAPP2)(VERSION=169870592)) * stop * 0
21-JUN-2017 15:35:14 * (CONNECT_DATA=(SID=app20122)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.58)(PORT=64878)) * establish * app20122 * 0

 

TNSLSNR for HPUX: Version 10.2.0.5.0 - Production on 21-JUN-2017 15:35:17

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

System parameter file is /oracle/app/oracle/product/10.2.0/dbhome_1/network/admin/listener.ora
Log messages written to /oracle/app/oracle/product/10.2.0/dbhome_1/network/log/listener_trffapp2.log
Trace information written to /oracle/app/oracle/product/10.2.0/dbhome_1/network/trace/listener_trffapp2.trc
Trace level is currently 0

Started with pid=2260
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.22)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.21)(PORT=1521)))
Listener completed notification to CRS on start                  <<<<<<

TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
21-JUN-2017 15:35:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.94)(PORT=56715)) * establish * app2012 * 12514
TNS-12514: TNS:listener does not currently know of service requested in connect descriptor    <<<<<<<<<
21-JUN-2017 15:35:17 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=app2012)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.231.3.109)(PORT=60547)) * establish * app2012 * 12514
 
从上面可以看到,21-JUN-2017 15:35:14之前监听一直正常,但是突然日志记录监听不再了No longer listening on:.....,
然后是监听通知CRS stop的消息:
 Listener completed notification to CRS on stop。
接下来是监听重启的信息:Started with pid=2260
最后是 TNS-12514报错信息,说监听无法处理请求的服务,这也印证了说客户说前台连接不上的说法。
但是这里并没有给出原因为啥监听突然不再了,之前没有任何报错,alert 日志里面也没有任何信息,
那我们就先看一下监听报错时段的请求情况,
 147 21-JUN-2017 15:27
  127 21-JUN-2017 15:31
   95 21-JUN-2017 15:32
 1217 21-JUN-2017 15:33   <<<high
 1443 21-JUN-2017 15:34   <<<high
12265 21-JUN-2017 15:35   <<issue time  very high
10607 21-JUN-2017 15:36
可以看到出现问题前监听请求量先降低,然后突然升高,出现问题时间点达到最高。
这个看起来更像是前台进程引发了连接风暴,既然新连接这么高,我们在看一下os 层面的情况:
但是因为当时客户并没有部署oswatcher,并且问题已经人为处理完毕,所以让其尽快部署。
过几天之后,问题再次发生,我们看一下这次情况:
   608 27-JUN-2017 09:23
   333 27-JUN-2017 09:24
   809 27-JUN-2017 09:25  <<<<
  1306 27-JUN-2017 09:26   <<<<<
  1965 27-JUN-2017 09:27   <<<<
     4 27-JUN-2017 09:28
  1070 27-JUN-2017 09:29   <<<<
  1055 27-JUN-2017 09:30   <<<

发生问题前后仍然出现大量新请求,我们看一下os watcher 输出情况:
zzz ***Tue Jun 27 09:27:12 EAT 2017
        procs           memory                   page                              faults       cpu
   r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
  95    39     0  64868005  88012273    0    0     2    0     0    0     0  94556 519483 34750   8  3 89           <<<<RunQ 95, blockQ 39
  95    39     0  64868005  88000105    0    0     0    0     0    0     0  149539 1720220 79082  45  7 48
  95    39     0  64868005  87979642    0    0     0    0     0    0     0  147551 1853425 77468  47  5 47
zzz ***Tue Jun 27 09:27:29 EAT 2017
        procs           memory                   page                              faults       cpu
   r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
  86    61     0  66285766  87435149    0    0     2    0     0    0     0  94556 519489 34750   8  3 89
  86    61     0  66285766  87354916    0    0     0    0     0    0     0  141514 2691511 78742  33  8 59
  86    61     0  66285766  87287447    0    0     0    0     0    0     0  142313 2680929 80039  32  9 59
zzz ***Tue Jun 27 09:27:46 EAT 2017
        procs           memory                   page                              faults       cpu
   r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
 171    61     0  80397655  86770380    0    0     2    0     0    0     0  94556 519494 34750   8  3 89        <<<RunQ 171, blockQ 61
 171    61     0  80397655  86733359    0    0     0    0     0    0     0  153462 2252976 99530  62  8 30
 171    61     0  80397655  86709517    0    0     0    0     0    0     0  153881 2250919 97705  59  8 33
 
当时OS 层面出现大量的RunQ和BlockQ,远超过cpu数量,CRS层面日志也在不断的报警,cpu 使用率过高
++++++++++++++++engine_A.log
2017/06/27 09:00:30 VCS WARNING V-16-1-50086 CPU usage on trffapp2 is 83%
2017/06/27 09:12:00 VCS INFO V-16-1-50086 CPU usage on trffapp2 is 66%
2017/06/27 09:17:36 VCS INFO V-16-1-50086 CPU usage on trffapp2 is 62%
2017/06/27 09:18:06 VCS WARNING V-16-1-50086 CPU usage on trffapp2 is 80%
2017/06/27 09:20:36 VCS NOTICE V-16-1-50086 CPU usage on trffapp2 is 74%
2017/06/27 09:21:06 VCS CRITICAL V-16-1-50086 CPU usage on trffapp2 is 91%
2017/06/27 09:22:01 VCS ERROR V-16-2-13027 (trffapp2) Resource(CSSD_trapp) - monitor procedure did not complete within the expected time.

显然是OS系统资源利用率突然过高引发了监听异常问题,接下来我们分析一下awr 报告
SQL ordered by Gets
947,020,215 43 22,023,725.93 14.71 33793.03 107433.92 7sgm3npmzvwvj JDBC Thin Client  SELECT * FROM (SELECT pagetabl...
946,702,135 74 12,793,272.09 14.70 37181.04 122613.98 5aqranh8b8hkf JDBC Thin Client  SELECT * FROM (SELECT pagetabl...
475,885,893 33 14,420,784.64 7.39 18047.45 54492.09 50uhrdj7uant0 JDBC Thin Client  SELECT * FROM (SELECT pagetabl...
371,382,132 24 15,474,255.50 5.77 15852.55 48278.23 863b3ba2bx6h2 JDBC Thin Client  SELECT * FROM (SELECT pagetabl...
317,142,383 24 13,214,265.96 4.93 13056.82 43185.59 f4q2kfd3q1uh6 JDBC Thin Client  SELECT * FROM (SELECT pagetabl...
 
看到排名靠前的sql基本都是一个逻辑,而且前两个SQL的buffer gets异常高,单个进程达到7225GB的读取,ash 显示也有大量的on cpu进程,所以需要优化cpu top排名
靠前的sql可以缓解问题,经过限制这个sql的调用,问题得到缓解,问题没有再次发生。

猜你喜欢

转载自www.cnblogs.com/DataArt/p/10018844.html