报警内容
##### NBA统一监控告警平台-DB: [触发]
> `告警集群:` jq-prod-db-prometheus
> `告警描述:` 数据库Oracle回滚空间使用超过10G
> `告警环境:` prod
> `告警类型:` oracle_prod
> `告警实例:` NBA-rac01
> `告警IP:` 11.11.11.11
> `实例别名:` NBA-rac01
> `当前数值:` 11.91G
> `告警级别:` 严重
> `触发时间:` 2021-03-03 04:01:59
> `持续时间:` 10s
> `告警次数:` 1
> `告警应用:`NBA_oracle
> `应用负责人:`
处理过程
- 登录
11.11.11.11
进入oracle,并设置列宽(如果使用plsql则不用设置)
[bzops@nba01 ~]$ sudo -i
[root@nba01 ~]# su - oracle
[oracle@nba01 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Tue Mar 2 15:05:36 2021
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL>
SQL> col sample_time for a30
col event for a30
col program for a30
col machine for a40
set lines 200 pages 999
- 查询undo表空间的使用历史记录,发现在
2021/3/3 3:58:39
确实使用率很高,活动undo段总计达到了17G
select begin_time,end_time,undoblks*8/1024/1024 undog,maxqueryid,activeblks*8/1024/1024 activeundo,inst_id from gv$undostat
BEGIN_TIME END_TIME UNDOG MAXQUERYID ACTIVEUNDO INST_ID
2021/3/3 4:48:39 2021/3/3 4:58:39 0.00018310546875 89w8y2pgn25yd 0.2481689453125 2
2021/3/3 4:38:39 2021/3/3 4:48:39 0.00014495849609375 0rc4km05kgzb9 0.2481689453125 2
2021/3/3 4:28:39 2021/3/3 4:38:39 0.001434326171875 89w8y2pgn25yd 0.2481689453125 2
2021/3/3 4:18:39 2021/3/3 4:28:39 9.1552734375E-5 0rc4km05kgzb9 0.2481689453125 2
2021/3/3 4:08:39 2021/3/3 4:18:39 0.00020599365234375 0rc4km05kgzb9 0.2481689453125 2
2021/3/3 3:58:39 2021/3/3 4:08:39 9.60930633544922 89w8y2pgn25yd 17.6923828125 2
2021/3/3 3:48:39 2021/3/3 3:58:39 8.96279907226563 89w8y2pgn25yd 4.6112060546875 2
2021/3/3 3:38:39 2021/3/3 3:48:39 0.00026702880859375 0rc4km05kgzb9 0.2481689453125 2
2021/3/3 3:28:39 2021/3/3 3:38:39 0.00115966796875 89w8y2pgn25yd 0.2481689453125 2
- 根据上面定位到的时间查询对应时间的活动会话记录,
XID
用来过滤没有事务的会话
发现这段时间oracle@scm03db02 (J000)
进程调用程序91522
的sql7p9061n7t9800
执行了很久,基本可以确定是这个sql导致的undo使用率高
oracle@scm03db02 (J000)
J开头的进程说明是oracle的job进程,所以可以判断出这是一个运行中的job
SQL> select sample_time,session_id,program,sql_id,plsql_entry_object_id,event,blocking_session from dba_hist_active_sess_history
where xid is not null and sample_time between timestamp '2021-03-03 03:57:00' and timestamp '2021-03-03 04:01:00';
SAMPLE_TIME SESSION_ID PROGRAM SQL_ID PLSQL_ENTRY_OBJECT_ID EVENT BLOCKING_SESSION
------------------------------ ---------- ------------------------------ ------------- --------------------- ------------------------------ ----------------
03-MAR-21 04.00.14.255 AM 27888 JDBC Thin Client 4d4pmm4ayqw17 91533
03-MAR-21 04.00.04.125 AM 18788 JDBC Thin Client 95zdtpfm6x7h1 91526
03-MAR-21 04.00.14.255 AM 29133 JDBC Thin Client 6c4fa820c91hk
03-MAR-21 04.00.14.255 AM 3786 JDBC Thin Client 97a43gbatv8m9
03-MAR-21 04.00.44.671 AM 3786 JDBC Thin Client 4j071dbgbxcgv
03-MAR-21 04.00.27.486 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 04.00.58.066 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 03.58.46.227 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 03.59.16.527 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 03.59.26.626 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 03.59.36.726 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current block 2-way
03-MAR-21 03.58.56.327 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc cr grant 2-way
03-MAR-21 03.59.57.216 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 04.00.07.296 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 04.00.17.396 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 04.00.37.576 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 04.00.47.956 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.57.34.997 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.58.05.557 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.58.15.647 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.58.25.747 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.58.35.837 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.59.06.427 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.59.47.116 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.57.10.818 AM 24730 JDBC Thin Client 6c4fa820c91hk
03-MAR-21 04.00.04.125 AM 3786 JDBC Thin Client aaxs4c2a89wja
03-MAR-21 03.57.41.218 AM 2202 JDBC Thin Client cfs24zcws20d2
03-MAR-21 03.57.14.806 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522
03-MAR-21 03.57.45.377 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 db file sequential read
03-MAR-21 03.57.55.477 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 db file sequential read
03-MAR-21 03.57.04.696 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current grant 2-way
03-MAR-21 03.57.24.906 AM 18781 oracle@scm03db02 (J000) 7p9061n7t9800 91522 gc current grant 2-way
- 查询具体的
object_id
和sql_id
,查出来是每天03:50运行的清理jobSP_DELETE_LOG
其中的DELETE FROM T1 L WHERE L.CREATE_TIME < TRUNC(SYSDATE) - 90
删除的量太大,导致了undo使用太多
SQL> select object_name,object_type from dba_objects where object_id=91522;
OBJECT_NAME OBJECT_TYPE
------------------------------ -------------------
SP_DELETE_LOG PROCEDURE
SQL> set long 20000
select sql_text from dba_hist_sqltext where sql_id='7p9061n7t9800';
SQL>
SQL_TEXT
--------------------------------------------------------------------------------
DELETE FROM T1 L WHERE L.CREATE_TIME < TRUNC(SYSDATE) - 90
SQL> select job_name,repeat_interval,next_run_date from dba_scheduler_jobs where ENABLED='TRUE' and job_action like '%SP_DELETE_LOG%';
JOB_NAME REPEAT_INTERVAL NEXT_RUN_DATE
JOB_DELETE_LOG FREQ=DAILY;BYHOUR=3;BYMINUTE=50 06-MAR-21 03.50.00.700000 AM ASIA/SHANGHAI
解决办法