作者: weiyinghua 原文来源: https://tidb.net/blog/4c49ac0d
一、Pseudo 统计信息总体生成规则
TiDB 在表无统计信息时,不会进行动态采样,而是用静态的、预设规则以及经验假设来生成计划。用函数 PseudoTable
创建一个伪统计表对象,通过默认 1万行,并填充列和索引 pseudo 统计信息,提供一个粗略的估算依据。
二、 使用 pseudo 统计信息原因
- 表或索引刚创建还没收集完统计信息,或数据刚导入未收集统计信息;
- 修改过的行数 / 表总行数比值超过
pseudo-estimate-ratio
默认值 0.8 时会认为统计信息过期而用 pseudo 统计信息; - 在 TiDB Server 启动时,由于统计信息加载未完成,可能会使用 pseudo 统计信息。
三、Pseudo 统计信息生成过程
PseudoTable
函数创建一个伪表统计信息对象,用于无统计信息情况下生成统计信息对象。
初始化 pseudo 统计信息
- 创建一个
HistColl
对象,设置其属性,如行数RealtimeCount
、物理IDPhysicalID
、是否允许触发加载CanNotTriggerLoad
等。 - 初始化列和索引的映射表
columns
和indices
。
创建统计信息表对象
- 创建一个
Table
对象,并将其HistColl
属性设置为前面创建的HistColl
对象。 - 初始化列和索引的存在映射表
ColAndIdxExistenceMap
。
生成列信息
-
遍历表的列信息
tblInfo.Columns
。 -
对于状态为
StatePublic
且不是隐藏列的列:- 在存在映射表中插入列信息。
- 如果
allowFillHistMeta
为true
,则为该列创建一个Column
对象,并填充直方图元数据。
生成索引信息
-
遍历表的索引信息
tblInfo.Indices
。 -
对于状态为
StatePublic
的索引:- 在存在映射表中插入索引信息。
- 如果
allowFillHistMeta
为true
,则为该索引创建一个Index
对象,并填充直方图元数据。
返回统计信息表对象
最后返回创建好的 Table
对象。
源码地址
https://github.com/pingcap/tidb/blob/426ce3e57069afbd8f061d7ae39c79d3f9e2ff5d/pkg/statistics/table.go#L1004
四、Pseudo 估算规则示例
示例表结构如下,并删除统计信息:
CREATE TABLE t1 (id bigint not null auto_random primary key,k char(64),v varchar(255),update_time datetime,key idx_k_update_time(k, update_time),key idx_update_time(update_time)
);
drop stats t1;
索引等值查询
普通索引等值查询,基于经验的假设,固定估算为10行,源码中常量 pseudoEqualRate 控制:
mysql> explain select count(1) from t1 where k='tom';
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| StreamAgg_17 | 1.00 | root | | funcs:count(Column#10)->Column#5 |
| └─IndexReader_18 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#10 |
| └─IndexRangeScan_16 | 10.00 | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom","tom"], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
如果是组合索引,第一个字段固定估算为全表的 1/1000 ,第二个字段固定估算为 1/100:
mysql> explain select count(1) from t1 where k='tom' and update_time='2024-11-25';
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+
| StreamAgg_10 | 1.00 | root | | funcs:count(1)->Column#5 |
| └─IndexReader_15 | 0.10 | root | | index:IndexRangeScan_14 |
| └─IndexRangeScan_14 | 0.10 | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom" 2024-11-25 00:00:00,"tom" 2024-11-25 00:00:00], keep order:false, stats:pseudo |
+---------------------------+---------+-----------+---------------------------------------------------+---------------------------------------------------------------------------------------------+
索引大于或小于
对于大于或小于查询,固定估算为1万行的三分之一,由源码中常量 pseudoLessRate 控制:
mysql> explain select count(1) from t1 where update_time < '2024-11-25';
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+
| HashAgg_12 | 1.00 | root | | funcs:count(Column#6)->Column#5 |
| └─IndexReader_13 | 1.00 | root | | index:HashAgg_6 |
| └─HashAgg_6 | 1.00 | cop[tikv] | | funcs:count(1)->Column#6 |
| └─IndexRangeScan_11 | 3323.33 | cop[tikv] | table:t1, index:idx_update_time(update_time) | range:[-inf,2024-11-25 00:00:00), keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+----------------------------------------------+------------------------------------------------------------------+
如果是组合索引,第一个字段估算为全表的 1/100,第二个字段再估算1/3,即:10000 * 1/100 * 1/3:
mysql> explain select count(1) from t1 where k='tom' and update_time>'2024-11-25';
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+
| StreamAgg_17 | 1.00 | root | | funcs:count(Column#7)->Column#5 |
| └─IndexReader_18 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#7 |
| └─IndexRangeScan_16 | 33.33 | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:("tom" 2024-11-25 00:00:00,"tom" +inf], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+---------------------------------------------------+------------------------------------------------------------------------------+
索引范围查询
对于 between 写法,固定估算为1万行的 1/40,由源码中常量 pseudoBetweenRate 控制:
mysql> explain select count(1) from t1 where update_time BETWEEN '2024-11-25' and '2024-11-26';
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+
| StreamAgg_17 | 1.00 | root | | funcs:count(Column#7)->Column#5 |
| └─IndexReader_18 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#7 |
| └─IndexRangeScan_16 | 250.00 | cop[tikv] | table:t1, index:idx_update_time(update_time) | range:[2024-11-25 00:00:00,2024-11-26 00:00:00], keep order:false, stats:pseudo |
+-----------------------------+---------+-----------+----------------------------------------------+---------------------------------------------------------------------------------+
普通列
对于非索引的普通列,固定估算全表扫描1万行,并预估过滤10行:
mysql> explain select count(1) from t1 where v = 'happy';
+------------------------------+----------+-----------+---------------+---------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------+----------+-----------+---------------+---------------------------------+
| StreamAgg_20 | 1.00 | root | | funcs:count(Column#7)->Column#5 |
| └─TableReader_21 | 1.00 | root | | data:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#7 |
| └─Selection_19 | 10.00 | cop[tikv] | | eq(test.t1.v, "happy") |
| └─TableFullScan_18 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+------------------------------+----------+-----------+---------------+---------------------------------+
唯一键
对于唯一键,固定返回1行:
mysql> explain select count(1) from t1 where id = 100;
+--------------------+---------+------+---------------+--------------------------+
| id | estRows | task | access object | operator info |
+--------------------+---------+------+---------------+--------------------------+
| StreamAgg_9 | 1.00 | root | | funcs:count(1)->Column#5 |
| └─Point_Get_11 | 1.00 | root | table:t1 | handle:100 |
+--------------------+---------+------+---------------+--------------------------+
其它情形
对于索引第一个字段 k 估算10行,对于索引第二个字符型字段 v 传入数字进行范围查找,在执行计划中出现了 cast(test.t1.v, double BINARY) 表示有隐式类型转换,不属于任何一种预设规则,在10行基础上乘以 0.8 估算得到8行,源码中常量 selectionFactor 控制:
mysql> explain select count(1) from t1 where k='tom' and v >10;
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
| StreamAgg_10 | 1.00 | root | | funcs:count(1)->Column#5 |
| └─IndexLookUp_37 | 8.00 | root | | |
| ├─IndexRangeScan_34(Build) | 10.00 | cop[tikv] | table:t1, index:idx_k_update_time(k, update_time) | range:["tom","tom"], keep order:false, stats:pseudo |
| └─Selection_36(Probe) | 8.00 | cop[tikv] | | gt(cast(test.t1.v, double BINARY), 10) |
| └─TableRowIDScan_35 | 10.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+----------------------------------+---------+-----------+---------------------------------------------------+-----------------------------------------------------+
源码地址
https://github.com/pingcap/tidb/blob/426ce3e57069afbd8f061d7ae39c79d3f9e2ff5d/pkg/planner/cardinality/pseudo.go
五、总结
与 MySQL 8.0 类似,TiDB 7.1 在表缺乏统计信息时,并不会像 Oracle 那样动态采样生成执行计划。总的来说,无统计信息时 TiDB 生成的执行计划基于假设数据规模和经验规则,可能存在较大的误差。TiDB 8.2 将引入 统计信息并发加载功能 ,加载统计信息效率将大幅度提升,有效减少使用 Pseudo 统计信息的可能。