INNODB记录格式

一序

本文是基于《MYSQL运维内参》第9章整理而成。源码太复杂，看一遍没有懂，先整理下来。

之前的文章已经整理过索引及页面格式，查找物理记录的过程。本篇介绍物理记录的存储格式。

用命令 show table status like 'XX' 可以看到，row_format就是当前表使用的行存储格式。

InnoDB存储引擎提供了compact(5.1后的默认格式)和redundant两个格式来存放行记录数据。redundant格式是为了兼容之前的版本而保留。因为redundant属于渐渐被抛弃的格式，本文的讨论中我们默认使用Compact格式。这是一种紧凑的可节省空间的格式。对于B+ 树存储的方式而言，一个页面存储的数据越多，一次IO就可以处理更多的记录，性能就会越高。

二格式

|---------------------extra_size-----------------------------------------|---------fields_data------------|

|--columns_lens---|---null lens----|------fixed_extrasize(5)------|--col1---|---col2---|---col2----|

变长字段长度列表：如果列的长度小于255字节，用1字节表示；如果大于255个字节，用2字节表示

NULL标志位：表明该行数据是否有NULL值。占一个字节。

记录头信息：固定占用5字节,每位的含义见下表：

名称	大小(bit)	描述
()	1	未知
()	1	未知
delete_flag	1	该行是否已被删除
min_rec_flag	1	为1，如果该记录是预先被定义为最小的记录
n_owned	4	当前slot拥有的记录数，参见之前文章
heap_no	13	索引堆中该记录的排序记录
record_type	3	记录类型，000表示普通，001表示B+树节点指针，010表示infimum （最小），011表示supermum（最大），1xx表示保留
next_record	16	页中下一条记录的相对位置，用来将页面中所有记录串起来，形成单链表。重用记录空间也是用这个指针。
total	40

在每个列的存储数据中，NULL不占该部分任何空间。此外还有两个隐藏列，事务ID列和回滚指针列，分别为6字节和7字节。若innodb表没有定义主键，每行还会增加一个6字节的rowid列。

通常需要查看ibd文件才能看到这些行格式.如使用hexdump命令或者进入到目录，如 cd /home/soft/mysql/data/test/

在用vim查看，是16进制的。我没有测试库权限。帖不了图。只能网上看看别贴的。为了更好的理解，从网上找了贴出来：

以下代码来自：https://www.cnblogs.com/abclife/p/5121677.html

mysql> create table yb1(
    -> t1 varchar(10),
    -> t2 varchar(10),
    -> t3 char(10),
    -> t4 varchar(10)
    -> ) row_format=compact;
Query OK, 0 rows affected (0.01 sec)
 
mysql> insert into yb1 values('a','bb','bb','ccc');
Query OK, 1 row affected (0.00 sec)
 
mysql> insert into yb1 values('d','ee','ee','fff');
Query OK, 1 row affected (0.00 sec)
 
mysql> insert into yb1 values('d',NULL,NULL,'fff');
Query OK, 1 row affected (0.01 sec)
 
mysql> select * from yb1\G
*************************** 1. row ***************************
t1: a
t2: bb
t3: bb
t4: ccc
*************************** 2. row ***************************
t1: d
t2: ee
t3: ee
t4: fff
*************************** 3. row ***************************
t1: d
t2: NULL
t3: NULL
t4: fff
3 rows in set (0.00 sec)

 hexdump -C -v yb1.ibd  > yb1.txt
然后分析：
0000c070  73 75 70 72 65 6d 75 6d  03 0a 02 01 00 00 00 10  |supremum........|
0000c080  00 2d 00 00 01 fa dd d5  00 00 00 00 39 ea a3 00  |.-..........9...|
0000c090  00 01 e8 01 10 61 62 62  62 62 20 20 20 20 20 20  |.....abbbb      |
0000c0a0  20 20 63 63 63 03 0a 02  01 00 00 00 18 00 2b 00  |  ccc.........+.|
0000c0b0  00 01 fa dd d6 00 00 00  00 39 eb a4 00 00 01 e9  |.........9......|
0000c0c0  01 10 64 65 65 65 65 20  20 20 20 20 20 20 20 66  |..deeee        f|
0000c0d0  66 66 03 01 06 00 00 20  ff 96 00 00 01 fa dd d7  |ff..... ........|
0000c0e0  00 00 00 00 39 f0 a7 00  00 01 ea 01 10 64 66 66  |....9........dff|
0000c0f0  66 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |f...............|
0000c100  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0000c110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0000c120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 
第一行记录从0000c078开始：
03 0a 02 01 /* 变长字段长度列表，逆序的(03表示t4字段的值ccc的长度，0a表示t3字段的长度，02表示t2字段的值bb的长度，01表示t1字段的值a的长度)*/
00 /* NULL标志位，第一行没有NULL值 */
00 00 10 00 2d /* record header，固定5字节 */
00 00 01 fa dd d5 /* rowid，innodb自动创建，6个字节 */
00 00 00 00 39 ea /* 事务ID */
a3 00 00 01 e8 01 10 /* 回滚指针 */
61 /* 列1数据'a' */
62 62 /* 列2数据'bb' */
62 62 20 20 20 20 20 20 20 20  /* 列3数据'bb' (固定长度未完全使用时，使用0x20表示) */
63 63 63 /* 列4数据'ccc' */
 
第二行记录从0000c0a5开始：
03 0a 02 01 /* 变长字段长度列表，逆序的(03表示t4字段的值fff的长度，0a表示t3字段的长度，02表示t2字段的值ee的长度，01表示t1字段的值d的长度)*/
00 /* NULL标志位，第一行没有NULL值 */
00 00 18 00 2b /* record header，固定5字节 */
00 00 01 fa dd d6 /* rowid，innodb自动创建，6个字节 */
00 00 00 00 39 eb /* 事务ID */
a4 00 00 01 e9 01 10 /* 回滚指针 */
64 /* 列1数据'd' */
65 65 /* 列2数据'ee' */
65 65 20 20 20 20 20 20 20 20  /* 列3数据'ee' (固定长度未完全使用时，使用0x20表示) */
66 66 66 /* 列4数据'fff' */
record header的最后两个字节代表是下一个记录的偏移量，本例中是0x2b(即43)
 
第三行记录从0000c0d2开始：
03 01 /* 变长字段长度列表，逆序的(03表示t4字段的值fff的长度，01表示t1字段的值d的长度)*/
06 /* NULL标志位，第三行有NULL值。06换成二进制为00000110，表示第2,3列是null */
00 00 20 ff 96 /* record header，固定5字节 */
00 00 01 fa dd d7 /* rowid，innodb自动创建，6个字节 */
00 00 00 00 39 f0 /* 事务ID */
a7 00 00 01 ea 01 10 /* 回滚指针 */
64 /* 列1数据'd' */
64 64 64 /* 列4数据'fff' */

三从源码了解实现原理

上面通过一个最普遍的插入操作来跟踪Innodb的记录格式，下面结合源码来分析实现插入过程。在MYSQL中，行格式有三种存储方式，如下：

server层的格式，与存储引擎无关，也是row模式下binlog存储所使用的格式。
索引元组格式：这是INNODB存储引擎在存取记录时一种记录格式的中间状态，它是INNODB在内存中存储所有列数据的数据结构，同一个表中，不同索引对应的元组不同，这种元组格式与索引是一一对应的。
物理存储格式: 就是一条记录在物理页面中的存储格式。也是本文重点介绍的compact格式。与索引元组格式是一一对应的。每次保存数据，都是先从server层的格式转换为索引元组格式，再从索引元组格式转换为页面上的索引物理存储格式。

因为在插入时，系统得到的是公共的mysql记录格式record，现在它没有涉及到任何的存储引擎，那么这里不管当前这个表对应的存储引擎是什么，记录格式是一样的，对于插入，mysql函数对应的是ha_write_row，具体到Innodb存储引擎，实际调用的函数是ha_innobase::write_row函数，那么在这里，Innodb首先会将接收到的record记录转换为它自己的一个元组tuple（索引元组格式），这其实是与record对应的innodb的表示方式，它是一个内存的记录，逻辑的记录，那么在系统将其真正的写入到页面之前，这条记录的存在方式都是这个tuple，那么下面主要是从源码的角度研究Innodb是如何将一个tuple转换为它的物理的存储记录的，主要研究代码的实现逻辑及记录的格式。书上这里侧重于介绍转换格式，所以对于插入细节没做介绍，我整理一个图，供参考：

可见插入过程还是比较复杂的，关键的有开启事务状态，定位，插入undo日志，page_cur_tuple_insert 包含了转换格式及实际插入，插入binlog日志等，后面的还有commit流程没有画。很多函数都是很复杂的。

实现在某一个页面插入一个元组（一条记录）操作的函数是page_cur_tuple_insert，它的参数就是一个dtuple_t*类型的tuple，在这里，它首先要分配一片空间来存储将要转换过来的物理记录，所以这里需要先计算空间的大小，计算方法如下：

1. 首先每条记录都要包括下面2个部分：REC_N_NEW_EXTRA_BYTES + UT_BITS_IN_BYTES(n_null)，前面表示的是这种格式的固定长度的extra部分，这部分用来存储什么内容后面会给出，后面表示的是所有字段中哪些字段的值是null，当然这里只存储那些nullable属于的字段，如果创建表的时候指定是not null的话，这里就不会被存储，那么这里是用一个位来表示一个字段的null属性。那么上面这部分被系统代码命名为extra_size变量值。

2. 统计每一个列中数据的长度，在统计这个信息的时候，又有多种情况，主要分定长字段和变长字段，对于定长字段，它的长度直接就是数据类型的长度，比如int类型的那就是4个字节，rowid列就是6个字节等，没有其它附加长度。对于变长字段而言，除了数据内容本身的长度外，还需要计算其数据长度的存储空间，如果字段的字义长度大于255个字节，或者字段的数据类型为BLOB的，那么需要用2个字节来存储这个字段的长度；如果定义长度小于128个字节，或者小于256个字节，但类型不是BLOB类型的，那么这个字段的数据长度用一个字节来存储，除上面2种情况之外，都用2个字节来存储。那么在这一部分中，用来存储变长字段数据的长度的空间的长度也是被Innodb计算为extra_size的。

所以现在可以知道，一个innodb的记录包括2个部分，一部分是extra_size，另一部分是数据内容，那么这2部分的总长度就是上面计算出来的结果，这里把它定义为record_size。

接下来，申请空间，进行元组到记录的转换工作。

转换函数为rec_convert_dtuple_to_rec_new，参数有申请好的记录空间buf，元组和索引的内存结构。

首先这里有一个操作是rec = buf + extra_size，变量rec表示的是数据内容的存储开始位置。extra_size就是上面计算出来的2个数据部分。

那么真正执行转换的是接下来调用的rec_convert_dtuple_to_rec_comp函数，下面是其原型：

/*********************************************************//**
Builds a physical record out of a data tuple and
stores it beginning from the start of the given buffer.
@return pointer to the origin of physical record */
rec_t*
rec_convert_dtuple_to_rec(
/*======================*/
	byte*			buf,	/*!< in: start address of the
					physical record */
	const dict_index_t*	index,	/*!< in: record descriptor */
	const dtuple_t*		dtuple,	/*!< in: data tuple */
	ulint			n_ext)	/*!< in: number of
					externally stored columns */
{
	rec_t*	rec;

	ut_ad(buf != NULL);
	ut_ad(index != NULL);
	ut_ad(dtuple != NULL);
	ut_ad(dtuple_validate(dtuple));
	ut_ad(dtuple_check_typed(dtuple));

	if (dict_table_is_comp(index->table)) {
		rec = rec_convert_dtuple_to_rec_new(buf, index, dtuple);
	} else {
		rec = rec_convert_dtuple_to_rec_old(buf, dtuple, n_ext);
	}

#ifdef UNIV_DEBUG
	{
		mem_heap_t*	heap	= NULL;
		ulint		offsets_[REC_OFFS_NORMAL_SIZE];
		const ulint*	offsets;
		ulint		i;
		rec_offs_init(offsets_);

		offsets = rec_get_offsets(rec, index,
					  offsets_, ULINT_UNDEFINED, &heap);
		ut_ad(rec_validate(rec, offsets));
		ut_ad(dtuple_get_n_fields(dtuple)
		      == rec_offs_n_fields(offsets));

		for (i = 0; i < rec_offs_n_fields(offsets); i++) {
			ut_ad(!dfield_is_ext(dtuple_get_nth_field(dtuple, i))
			      == !rec_offs_nth_extern(offsets, i));
		}

		if (UNIV_LIKELY_NULL(heap)) {
			mem_heap_free(heap);
		}
	}
#endif /* UNIV_DEBUG */
	return(rec);
}

/*********************************************************//**
Builds a new-style physical record out of a data tuple and
stores it beginning from the start of the given buffer.
@return pointer to the origin of physical record */
static
rec_t*
rec_convert_dtuple_to_rec_new(
/*==========================*/
	byte*			buf,	/*!< in: start address of
					the physical record */
	const dict_index_t*	index,	/*!< in: record descriptor */
	const dtuple_t*		dtuple)	/*!< in: data tuple */
{
	ulint	extra_size;
	ulint	status;
	rec_t*	rec;
     
	status = dtuple_get_info_bits(dtuple) & REC_NEW_STATUS_MASK;
    /*计算记录头大小：extra_size*/
	rec_get_converted_size_comp(
		index, status, dtuple->fields, dtuple->n_fields, &extra_size);
    /*buf整个记录起始位置，+extra_size表示存储第一列的位置*/
	rec = buf + extra_size;
    /*将元组的dtuple格式转换为compact格式*/
	rec_convert_dtuple_to_rec_comp(
		rec, index, dtuple->fields, dtuple->n_fields, NULL,
		status, false);

	/* Set the info bits of the record */
	rec_set_info_and_status_bits(rec, dtuple_get_info_bits(dtuple));

	return(rec);
}

真正执行转换的是rec_convert_dtuple_to_rec_comp。

/*********************************************************//**
Builds a ROW_FORMAT=COMPACT record out of a data tuple. */
UNIV_INLINE
void
rec_convert_dtuple_to_rec_comp(
/*===========================*/
	rec_t*			rec,	/*!< in: origin of record */
	const dict_index_t*	index,	/*!< in: record descriptor */
	const dfield_t*		fields,	/*!< in: array of data fields */
	ulint			n_fields,/*!< in: number of data fields */
	const dtuple_t*		v_entry,/*!< in: dtuple contains
					virtual column data */
	ulint			status,	/*!< in: status bits of the record */
	bool			temp)	/*!< in: whether to use the
					format for temporary files in
					index creation */
{
	const dfield_t*	field;
	const dtype_t*	type;
	byte*		end;
	byte*		nulls;
	byte*		lens;
	ulint		len;
	ulint		i;
	ulint		n_node_ptr_field;
	ulint		fixed_len;
	ulint		null_mask	= 1;
	ulint		n_null;
	ulint		num_v = v_entry ? dtuple_get_n_v_fields(v_entry) : 0;

	ut_ad(temp || dict_table_is_comp(index->table));

	if (temp) {
		ut_ad(status == REC_STATUS_ORDINARY);
		ut_ad(n_fields <= dict_index_get_n_fields(index));
		n_node_ptr_field = ULINT_UNDEFINED;
		nulls = rec - 1;
		if (dict_table_is_comp(index->table)) {
			/* No need to do adjust fixed_len=0. We only
			need to adjust it for ROW_FORMAT=REDUNDANT. */
			temp = false;
		}
	} else {
		ut_ad(v_entry == NULL);
		ut_ad(num_v == 0);
        /*计算null位置，REC_N_NEW_EXTRA_BYTES 为固定5字节*/
		nulls = rec - (REC_N_NEW_EXTRA_BYTES + 1);

		switch (UNIV_EXPECT(status, REC_STATUS_ORDINARY)) {
		case REC_STATUS_ORDINARY:
			ut_ad(n_fields <= dict_index_get_n_fields(index));
			n_node_ptr_field = ULINT_UNDEFINED;
			break;
		case REC_STATUS_NODE_PTR:
			ut_ad(n_fields
			      == dict_index_get_n_unique_in_tree_nonleaf(index)
				 + 1);
			n_node_ptr_field = n_fields - 1;
			break;
		case REC_STATUS_INFIMUM:
		case REC_STATUS_SUPREMUM:
			ut_ad(n_fields == 1);
			n_node_ptr_field = ULINT_UNDEFINED;
			break;
		default:
			ut_error;
			return;
		}
	}

	end = rec;

	if (n_fields != 0) {
		n_null = index->n_nullable;
        /*计算可变长度位置。通过nulls位置计算*/
		lens = nulls - UT_BITS_IN_BYTES(n_null);
		/* clear the SQL-null flags */
        /*lens+1为nulls标示开始的位置，nulls-lens为nulls存储的长度*/
		memset(lens + 1, 0, nulls - lens);
	}

	/* Store the data and the offsets */
    /*将每一列的数据，一级nulls和len存储到对应的位置*/
	for (i = 0; i < n_fields; i++) {
		const dict_field_t*	ifield;
		dict_col_t*		col = NULL;

		field = &fields[i];

		type = dfield_get_type(field);
		len = dfield_get_len(field);
         /*索引指针，使用固定4字节REC_NODE_PTR_SIZE表示*/
		if (UNIV_UNLIKELY(i == n_node_ptr_field)) {
			ut_ad(dtype_get_prtype(type) & DATA_NOT_NULL);
			ut_ad(len == REC_NODE_PTR_SIZE);
			memcpy(end, dfield_get_data(field), len);
			end += REC_NODE_PTR_SIZE;
			break;
		}
         /*计算null信息*/
		if (!(dtype_get_prtype(type) & DATA_NOT_NULL)) {
			/* nullable field */
			ut_ad(n_null--);

			if (UNIV_UNLIKELY(!(byte) null_mask)) {
				nulls--;
				null_mask = 1;
			}

			ut_ad(*nulls < null_mask);

			/* set the null flag if necessary */
			if (dfield_is_null(field)) {
				*nulls |= null_mask;
				null_mask <<= 1;
				continue;
			}

			null_mask <<= 1;
		}
		/* only nullable fields can be null */
		ut_ad(!dfield_is_null(field));

		ifield = dict_index_get_nth_field(index, i);
		fixed_len = ifield->fixed_len;
		col = ifield->col;
		if (temp && fixed_len
		    && !dict_col_get_fixed_size(col, temp)) {
			fixed_len = 0;
		}
         /*实现列大小跟存储其长度字节的动态匹配*/
		/* If the maximum length of a variable-length field
		is up to 255 bytes, the actual length is always stored
		in one byte. If the maximum length is more than 255
		bytes, the actual length is stored in one byte for
		0..127.  The length will be encoded in two bytes when
		it is 128 or more, or when the field is stored externally. */
		if (fixed_len) {
         /*固定长度*/
#ifdef UNIV_DEBUG
			ulint	mbminlen = DATA_MBMINLEN(col->mbminmaxlen);
			ulint	mbmaxlen = DATA_MBMAXLEN(col->mbminmaxlen);

			ut_ad(len <= fixed_len);
			ut_ad(!mbmaxlen || len >= mbminlen
			      * (fixed_len / mbmaxlen));
			ut_ad(!dfield_is_ext(field));
#endif /* UNIV_DEBUG */
		} else if (dfield_is_ext(field)) {
            /*行外存储，2字节*/
			ut_ad(DATA_BIG_COL(col));
			ut_ad(len <= REC_ANTELOPE_MAX_INDEX_COL_LEN
			      + BTR_EXTERN_FIELD_REF_SIZE);
			*lens-- = (byte) (len >> 8) | 0xc0;
			*lens-- = (byte) len;
		} else {
			/* DATA_POINT would have a fixed_len */
			ut_ad(dtype_get_mtype(type) != DATA_POINT);
			ut_ad(len <= dtype_get_len(type)
			      || DATA_LARGE_MTYPE(dtype_get_mtype(type))
			      || !strcmp(index->name,
					 FTS_INDEX_TABLE_IND_NAME));
              /*列长度<128,使用1字节存储其长度*/
              /*<256,且不是blob*/
			if (len < 128 || !DATA_BIG_LEN_MTYPE(
				dtype_get_len(type), dtype_get_mtype(type))) {

				*lens-- = (byte) len;
			} else {
                /*其他，2字节长度*/
				ut_ad(len < 16384);
				*lens-- = (byte) (len >> 8) | 0x80;
				*lens-- = (byte) len;
			}
		}
         /*元组列信息，写入到compact记录对应列中，len为其对应长度*/
		memcpy(end, dfield_get_data(field), len);
		end += len;
	}

	if (!num_v) {
		return;
	}

	/* reserve 2 bytes for writing length */
	byte*	ptr = end;
	ptr += 2;

	/* Now log information on indexed virtual columns */
	for (ulint col_no = 0; col_no < num_v; col_no++) {
		dfield_t*       vfield;
		ulint		flen;

		const dict_v_col_t*     col
			= dict_table_get_nth_v_col(index->table, col_no);

		if (col->m_col.ord_part) {
			ulint   pos = col_no;

			pos += REC_MAX_N_FIELDS;

			ptr += mach_write_compressed(ptr, pos);

			vfield = dtuple_get_nth_v_field(
				v_entry, col->v_pos);

			flen = vfield->len;

			if (flen != UNIV_SQL_NULL) {
				/* The virtual column can only be in sec
				index, and index key length is bound by
				DICT_MAX_FIELD_LEN_BY_FORMAT */
				flen = ut_min(
					flen,
					static_cast<ulint>(
					DICT_MAX_FIELD_LEN_BY_FORMAT(
						index->table)));
			}

			ptr += mach_write_compressed(ptr, flen);

			if (flen != UNIV_SQL_NULL) {
				ut_memcpy(ptr, dfield_get_data(vfield), flen);
				ptr += flen;
			}
		}
	}

	mach_write_to_2(end, ptr - end);
}

部分有注释，一下子看不懂，可以结合上面的格式图来理解，在rec_convert_dtuple_to_rec_comp主要就是通过rec来计算nulls的位置，lens位置.从后面向前面填写数据。可以结合上面的例子来理解。在memset(lens + 1, 0, nulls - lens);清理掉null flag后初始化工作算是完了，下面就是根据每一个字段来填写record记录了。接下来是处理null信息的，前提是列先必须是没有定义not null属性，所以nulls空间只存储这些字段的信息。再就是对于定长数据，只需要将其数据写入到记录里面即可，主要处理的是变长数据，第2行表示的是如果长度大于256个字节，或者数据类型为BLOB，则用两个字节来存储其长度，低字节存储(len >> 8) | 0xc0，高字节存储(byte) len（被截断）。其它可以直接看出来。memcpy(end, dfield_get_data(field), len); 之后end指示结束位置。存储完一个字段接着下一个字段，是按照索引定义的顺序存储的。到这里，一条记录的逻辑到物理的转换就完成了，从中也知道了Innodb是如何实现其物理记录的存储的。

*************************************************

真的不太懂，源码需要结合格式前后去对比理解。好复杂的innodb啊

想起一件事，建表的时候dba强调不允许列有null，除了搜索没发优化外，对此还可以从行格式去理解，就是需要额外存储开销。

一 序

二 格式

三 从源码了解实现原理

猜你喜欢

一序

二格式

三从源码了解实现原理