【故障】EXP-00056: ORACLE error 1466 encountered

EXP-00056: ORACLE error 1466 encountered

ORA-01466: unable to read data - table definition has changed

Sense:2016年4月创金合信基金DB-EXP导出表过程后台日志出现如下告警和错误：

EXP-00056: ORACLE error 1466 encountered
ORA-01466: unable to read data - table definition has changed
. . exporting table TO32_T0
. . exporting table TO32_T0_DETAIL 314 rows exported
. . exporting table TO32_T0_JYKY
. . exporting table TO32_T0_TMP 155 rows exported
. . exporting table TO32_T0_ZLZY
. . exporting table TO32_T1
. . exporting table TO32_T1_DETAIL
EXP-00056: ORACLE error 1466 encountered
ORA-01466: unable to read data - table definition has changed
. . exporting table TO32_T1_JYKY
. . exporting table TO32_T1_TMP
EXP-00056: ORACLE error 1466 encountered
ORA-01466: unable to read data - table definition has changed

why?
SMON进程维护SMON_SCN_TIME数据字典基表，导致在EXP导出数据的过程中发生针对数据字典基表的DDL操作。因为在EXP的过程中，export导出程序会首先读取部分数据字典基表来定位要导出的数据的概况。但是，很可能在导出过程的某个时间点SMON会去维护SMON_SCN_TIME数据字典基表，使得export程序读取的基表发生DML变化，由此引发ORACLE ORA-1466错误。

针对运维人员和部分DBA的疑问在于该错误时常出现，但是并不是规律性的常规性错误。原因在于SMON维护SMON_SCN_TIME等数据字典基表的时间点不是常规性的，而是根据一定的频率和数据库的统计状态（SCN）进行维护操作。故此，当频率和数据库状态两者完全符合，SMON就会发起维护数据字典基表的任务。所以DBA仅仅从EXP的导出日志查看相关表的DDL，无异于缘木求鱼。
在创金合信数据库版本Release 11.2.0.4.0 Production Version，我们可以通过Oracle内部错误guide工具来初步的判定出现故障或者问题的原因：

01466, 00000, "unable to read data - table definition has changed"

根据guide ORA-信息可以验证我们基于直观的EXP日志文件的故障诊断猜测，我们猜对了：）

SQL*Plus: Release 11.2.0.4.0 Production on Thu Mar 24 05:39:19 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
SQL> conn /as sysdba
Connected.
SQL> !oerr ora 1466
01466, 00000, "unable to read data - table definition has changed"
// *Cause: Query parsed after tbl (or index) change, and executed
// w/old snapshot
// *Action: commit (or rollback) transaction, and re-execute

SMON（系统监视器）后台进程的作用还有不为人知的维护SMON_SCN_TIME数据字典基表

SMON_SCN_TIME基表用于记录过去时间段中SCN(system change number)与具体的时间戳(timestamp)之间的映射关系，因为Oracle是采样记录这种映射关系，所以SMON_SCN_TIME可以较为粗糙地(不精确地)定位某个SCN的时间信息。实际的SMON_SCN_TIME映射是一张cluster table簇表。
SMON_SCN_TIME时间映射表最大的用途是为闪回类型的查询(flashback type queries)提供一种将时间映射为SCN的途径(The SMON scn time mapping is mainly for flashback type queries to map a time to an SCN)。
Metalink文档<Error ORA-01466 while executing a flashback query. [ID 281510.1]>介绍了SMON更新SMON_SCN_TIME的规律：

在版本10g中SMON_SCN_TIME每6秒钟被更新一次(In Oracle Database 10g, smon_scn_time is updated every 6 seconds hence that is the minimum time that the flashback query time needs to be behind the timestamp of the first change to the table.)
在版本9.2中SMON_SCN_TIME每5分钟被更新一次(In Oracle Database 9.2, smon_scn_time is updated every 5 minutes hence the required delay between the flashback time and table properties change is at least 5 minutes.)

另外从10g开始SMON也会清理SMON_SCN_TIME中的记录，SMON后台进程会每5分钟被唤醒一次，检查SMON_SCN_TIME在磁盘上的映射记录总数，若总数超过144000条，则会使用以下语句删除最老的一条记录(time_mp最小): 若仅仅删除一条记录不足以获得足够的空间，那么SMON会反复多次执行以上DELETE语句。

delete from smon_scn_time
where thread = 0
and time_mp = (select min(time_mp) from smon_scn_time where thread = 0)

SMON 维护SMON_SCN_TIME的触发场景

虽然Metalink文档<Error ORA-01466 while executing a flashback query. [ID 281510.1]>指出了在10g中SMON会以每6秒一次的频率更新SMON_SCN_TIME基表，但是实际观测可以发现更新频率与SCN的增长速率相关，在较为繁忙的实例中SCN上升极快时SMON可能会以6秒一次的最短间隔频率更新，但是在空闲的实例中SCN增长较慢，但仍会以每5或10分钟一次的频率更新，例如：

[oracle@datar ~]$ ps -ef|grep smon|grep -v grep
oracle 3922 1 0 Mar23 ? 00:00:09 ora_smon_ora11g

[oracle@datar ~]$ ps -ef|grep smon|grep -v grep
oracle 3922 1 0 Mar23 ? 00:00:09 ora_smon_ora11g
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production

SQL> select * from global_name;
GLOBAL_NAME
--------------------------------------------------------------------------------
ORA11G

SQL> oradebug setospid 3922
Oracle pid: 13, Unix process pid: 3922, image: oracle@datar (SMON)

SQL> oradebug event 10500 trace name context forever,level 10 : 10046 trace name context forever,level 12;
Statement processed.

SQL> oradebug tracefile_name;
/home/oracle/product/diag/rdbms/ora11g/ora11g/trace/ora11g_smon_3922.trc
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

在ora11g_smon_3922.trc跟踪文件中找出SMON进程insert数据到SMON_SCN_TIME映射表的记录如下，我们可以发现其更新频率:

PARSING IN CURSOR #140375269773744 len=141 dep=1 uid=0 oct=2 lid=0 tim=1458763578296250 hv=973751600 ad='a9899260' sqlid='9wncfacx0nj9h'
insert into smon_scn_time (thread, time_mp, time_dp, scn, scn_wrp, scn_bas, num_mappings, tim_scn_map) values (0, :1, :2, :3, :4, :5, :6, :7)
END OF STMT
PARSE #140375269773744:c=999,e=847,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,plh=0,tim=1458763578296045

可以通过以上INSERT语句的TIME_DP绑定变量值发现其更新SMON_SCN_TIME的时间规律，一般为5或10分钟一次。这说明SMON_SCN_TIME的更细频率与数据库实例的负载有关，更为贴切的说是与SCN的生成速度有关，其最短的间隔是每6秒一次，最长的间隔为10分钟一次。具体请见附件trace文件。
SMON维护SMON_SCN_TIME时相关的Stack CALL，ktf_scn_time是更新SMON_SCN_TIME的主要函数,但是11G的11.2.0.4版本的SMON跟踪文件中并没有相关的函数记录，如果使用10046高级别的跟踪事件，我们会在相关的追踪文件中发现处理函数的调用状态。
根据My Oracle Support支持文档，可以发现因SMON_SCN_TIME更新而引起的数据库错误有十分经典的案例。例如以下的两则，来源于MOS官方支持：
由于SMON_SCN_TIME的更新频率问题可能引起ORA-01466错误，详见:Error ORA-01466 while executing a flashback query. [ID 281510.1]
由于SMON_SCN_TIME数据不一致可能引起ORA-00600[6711]或频繁地执行”delete from smon_scn_time”删除语句，详见：ORA-00600[6711]错误一例 High Executions Of Statement “delete from smon_scn_time…” [ID 375401.1]
SMON 还可能使用以下SQL语句维护SMON_SCN_TIME字典基表:

select smontabv.cnt,
         smontab.time_mp,
         smontab.scn,
         smontab.num_mappings,
         smontab.tim_scn_map,
         smontab.orig_thread
from smon_scn_time smontab,
        (select max(scn) scnmax,
                  count(*) + sum(NVL2(TIM_SCN_MAP, NUM_MAPPINGS, 0)) cnt
                  from smon_scn_time
        where thread = 0) smontabv
where smontab.scn = smontabv.scnmax
   and thread = 0

insert into smon_scn_time (thread,
                                        time_mp,
                                        time_dp,
                                        scn,
                                        scn_wrp,
                                        scn_bas,
                                        num_mappings,
                                        tim_scn_map)
                              values (0, :1, :2, :3, :4, :5, :6, :7)

查询映射表的统计状态如下所示：

update smon_scn_time
      set orig_thread = 0,
           time_mp = :1,
           time_dp = :2,
           scn = :3,
           scn_wrp = :4,
           scn_bas = :5,
           num_mappings = :6,
           tim_scn_map = :7
where thread = 0
   and scn = (select min(scn) from smon_scn_time where thread = 0)

delete from smon_scn_time
where thread = 0
and scn = (select min(scn) from smon_scn_time where thread = 0)

如何禁止SMON更新SMON_SCN_TIME基表

可以通过设置诊断事件events=’12500 trace name context forever, level 10’来禁止SMON更新SMON_SCN_TIME基表(Setting the 12500 event at system level should stop SMON from updating the SMON_SCN_TIME table.):

SQL> alter system set events ‘12500 trace name context forever,level 10’;
System altered.

一般我们不推荐禁止SMON更新SMON_SCN_TIME基表，因为这样会影响flashback Query闪回查询的正常使用，但是在某些异常恢复的场景中SMON_SCN_TIME数据讹误可能导致实例的Crash，那么可以利用以上12500事件做到不触发SMON_SCN_TIME被更新。
如何手动清除SMON_SCN_TIME的数据

因为SMON_SCN_TIME不是bootstrap自举核心对象，所以DBA可以手动更新该表上的数据、及重建其索引。尤其需要注意的是，当SMON_SCN_TIME表中的数据和其索引中的数据不一致时会造成一些比较严重的后果，所以其索引的状态是否一致也是需要我们关注的核心问题，例如delete语句无法删除表中的记录问题，此时可以通过重新创建索引来解决。

SQL> drop index smon_scn_time_bak_tim_idx;
索引已删除。

SQL> drop index smon_scn_time_bak_scn__idx;
索引已删除。

SQL> create unique index smon_scn_time_bak_tim_idx on smon_scn_time_bak(time_mp);
索引已创建。

SQL> create unique index smon_scn_time_bak_scn_idx on smon_scn_time_bak(scn);
索引已创建。

SQL> analyze table smon_scn_time_bak validate structure cascade;
表已分析。

可以在设置了12500事件后手动删除SMON_SCN_TIME上的记录，重启实例后SMON会继续正常更新SMON_SCN_TIME。除非是因为SMON_SCN_TIME表上的记录与索引smon_scn_time_tim_idx或smon_scn_time_scn_idx上的不一致造成DELETE语句无法有效删除该表上的记录:文档<LOCK ON SYS.SMON_SCN_TIME [ID 747745.1]>说明了该问题，否则我们没有必要手动清除SMON_SCN_TIME的数据。
具体的操作步骤如下所示：

--set oracle trace event 12500
SQL> alter system set events ‘12500 trace name context forever,level 10’;
--delete records in smon_scn_time
SQL> delete from smon_scn_time;
SQL> alter system set events ‘12500 trace name context forever off’;
SQL> commit;
SQL> shutdown immediate;
SQL> startup;

【故障】EXP-00056: ORACLE error 1466 encountered

猜你喜欢