本文介绍云数据库 SelectDB 版中外表统计信息收集功能,包括使用方法和实现原理,为您进行查询性能优化提供参考。
外表统计信息的收集方式、内容基本与内表统计信息一致,更多信息请参见统计信息。
注意事项
目前支持收集Hive、Iceberg和Hudi等外部表的统计信息。
暂不支持收集如下外表统计信息:
直方图收集。
分区的增量收集和更新。
自动收集(with auto),您可以使用周期性收集(with period)来代替。
抽样收集。
使用方法
本文展示在SelectDB中,通过执行ANALYZE
命令收集外表统计信息的相关示例。
除了上文提到的外表暂不支持的4个功能外,其余和内表使用方式相同。下面以hive.tpch100
数据库为例进行展示,tpch100
数据库中包含lineitem、orders、region等8张表。
信息收集
外表支持手动收集和周期性收集两种方式。
手动收集
收集lineitem表的表信息以及全部列的信息。
ANALYZE TABLE hive.tpch100.lineitem; +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | Catalog_Name | DB_Name | Table_Name | Columns | Job_Id | +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | hive | default_cluster:tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | 16990 | +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ 1 row in set (0.06 sec)
此操作是异步执行,会在后台创建收集任务,您可以通过Job_Id查看任务进度。
SHOW ANALYZE 16990; +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+ | job_id | catalog_name | db_name | tbl_name | col_name | job_type | analysis_type | message | last_exec_time_in_ms | state | progress | schedule_type | +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+ | 16990 | hive | default_cluster:tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | MANUAL | FUNDAMENTALS | | 2023-07-27 16:01:52 | RUNNING | 2 Finished/0 Failed/15 In Progress/17 Total | ONCE | +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+ 1 row in set (0.00 sec)
以及查看每一列的Task状态。
SHOW ANALYZE TASK STATUS 16990; +---------+-----------------+---------+------------------------+-----------------+----------+ | task_id | col_name | message | last_state_change_time | time_cost_in_ms | state | +---------+-----------------+---------+------------------------+-----------------+----------+ | 16991 | l_receiptdate | | 2023-07-27 16:01:29 | 0 | PENDING | | 16992 | l_returnflag | | 2023-07-27 16:01:44 | 14394 | FINISHED | | 16993 | l_tax | | 2023-07-27 16:01:52 | 7975 | FINISHED | | 16994 | l_shipmode | | 2023-07-27 16:02:11 | 18961 | FINISHED | | 16995 | l_suppkey | | 2023-07-27 16:02:17 | 6684 | FINISHED | | 16996 | l_shipdate | | 2023-07-27 16:02:26 | 8518 | FINISHED | | 16997 | l_commitdate | | 2023-07-27 16:02:26 | 0 | RUNNING | | 16998 | l_partkey | | 2023-07-27 16:01:29 | 0 | PENDING | | 16999 | l_quantity | | 2023-07-27 16:01:29 | 0 | PENDING | | 17000 | l_orderkey | | 2023-07-27 16:01:29 | 0 | PENDING | | 17001 | l_comment | | 2023-07-27 16:01:29 | 0 | PENDING | | 17002 | l_linestatus | | 2023-07-27 16:01:29 | 0 | PENDING | | 17003 | l_extendedprice | | 2023-07-27 16:01:29 | 0 | PENDING | | 17004 | l_linenumber | | 2023-07-27 16:01:29 | 0 | PENDING | | 17005 | l_shipinstruct | | 2023-07-27 16:01:29 | 0 | PENDING | | 17006 | l_discount | | 2023-07-27 16:01:29 | 0 | PENDING | | 17007 | TableRowCount | | 2023-07-27 16:01:29 | 0 | PENDING | +---------+-----------------+---------+------------------------+-----------------+----------+ 17 rows in set (0.00 sec)
收集tpch100数据库所有表的信息。
ANALYZE DATABASE hive.tpch100; +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | Catalog_Name | DB_Name | Table_Name | Columns | Job_Id | +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ | hive | tpch100 | supplier | [s_comment,s_phone,s_nationkey,s_name,s_address,s_acctbal,s_suppkey] | 17018 | | hive | tpch100 | nation | [n_comment,n_nationkey,n_regionkey,n_name] | 17027 | | hive | tpch100 | region | [r_regionkey,r_comment,r_name] | 17033 | | hive | tpch100 | partsupp | [ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost] | 17038 | | hive | tpch100 | orders | [o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority] | 17045 | | hive | tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | 17056 | | hive | tpch100 | part | [p_partkey,p_container,p_name,p_comment,p_brand,p_type,p_retailprice,p_mfgr,p_size] | 17074 | | hive | tpch100 | customer | [c_custkey,c_phone,c_acctbal,c_mktsegment,c_address,c_nationkey,c_name,c_comment] | 17085 | +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+ 8 rows in set (0.29 sec)
此操作会批量提交
tpch100
数据库下所有表的收集任务,也是异步执行。该操作会给每个表创建一个Job_Id,您可以通过Job_Id查看每张表的任务进度。同步收集。
可以使用WITH sync同步收集表或数据库的统计信息。这时不会创建后台任务,客户端在收集完成之前会被阻塞,直到收集任务执行完成再返回。
ANALYZE TABLE hive.tpch100.orders WITH sync; Query OK, 0 rows affected (33.19 sec)
需要注意的是,同步收集受
query_timeout session
变量影响。如果超时失败,需要调大该变量后重试。比如:set query_timeout=3600
(超时时间设置为1小时)。
周期性收集
使用WITH period可以设置周期性的执行收集任务:
ANALYZE TABLE hive.tpch100.orders WITH period 86400;
上述语句将创建一个周期性收集的任务,周期是1天,每天自动收集和更新orders表的统计信息。
任务管理
外表统计信息的任务管理方式也和内表相同,主要包括查看Job、查看Task、删除Job等功能。更多信息请参见统计信息中的任务管理。
查看所有Job状态。
SHOW ANALYZE; +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+ | job_id | catalog_name | db_name | tbl_name | col_name | job_type | analysis_type | message | last_exec_time_in_ms | state | progress | schedule_type | +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+ | 16990 | hive | default_cluster:tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | MANUAL | FUNDAMENTALS | | 2023-07-27 16:05:02 | FINISHED | 17 Finished/0 Failed/0 In Progress/17 Total | ONCE | +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+
查看一个Job的所有Task状态。
SHOW ANALYZE TASK STATUS 16990; +---------+-----------------+---------+------------------------+-----------------+----------+ | task_id | col_name | message | last_state_change_time | time_cost_in_ms | state | +---------+-----------------+---------+------------------------+-----------------+----------+ | 16991 | l_receiptdate | | 2023-07-27 16:05:02 | 9560 | FINISHED | | 16992 | l_returnflag | | 2023-07-27 16:01:44 | 14394 | FINISHED | | 16993 | l_tax | | 2023-07-27 16:01:52 | 7975 | FINISHED | | 16994 | l_shipmode | | 2023-07-27 16:02:11 | 18961 | FINISHED | | 16995 | l_suppkey | | 2023-07-27 16:02:17 | 6684 | FINISHED | | 16996 | l_shipdate | | 2023-07-27 16:02:26 | 8518 | FINISHED | | 16997 | l_commitdate | | 2023-07-27 16:02:34 | 8380 | FINISHED | | 16998 | l_partkey | | 2023-07-27 16:02:40 | 6060 | FINISHED | | 16999 | l_quantity | | 2023-07-27 16:02:50 | 9768 | FINISHED | | 17000 | l_orderkey | | 2023-07-27 16:02:57 | 7200 | FINISHED | | 17001 | l_comment | | 2023-07-27 16:03:36 | 38468 | FINISHED | | 17002 | l_linestatus | | 2023-07-27 16:03:51 | 15226 | FINISHED | | 17003 | l_extendedprice | | 2023-07-27 16:04:00 | 8713 | FINISHED | | 17004 | l_linenumber | | 2023-07-27 16:04:06 | 6659 | FINISHED | | 17005 | l_shipinstruct | | 2023-07-27 16:04:36 | 29777 | FINISHED | | 17006 | l_discount | | 2023-07-27 16:04:45 | 9212 | FINISHED | | 17007 | TableRowCount | | 2023-07-27 16:04:52 | 6974 | FINISHED | +---------+-----------------+---------+------------------------+-----------------+----------+
终止未完成的Job。
KILL ANALYZE [job_id]
删除周期性收集Job。
DROP ANALYZE JOB [JOB_ID]
信息查看
信息的查看包括表的统计信息(表的行数)查看和列统计信息查看,请参考统计信息查看统计信息部分。
表统计信息
SHOW TABLE [cached] stats TABLE_NAME;
查看statistics表中指定table的行数,如果指定cached参数,则展示的是指定表已加载到缓存中的行数信息,示例如下。
SHOW TABLE STATS hive.tpch100.orders;
+-----------+---------------------+---------------------+
| row_count | update_time | last_analyze_time |
+-----------+---------------------+---------------------+
| 150000000 | 2023-07-11 23:01:49 | 2023-07-11 23:01:44 |
+-----------+---------------------+---------------------+
列统计信息
SHOW COLUMN [cached] stats TABLE_NAME;
查看statistics表中指定table的列统计信息,如果指定cached参数,则展示的是指定表已加载到缓存中的列信息,示例如下。
SHOW COLUMN stats hive.tpch100.orders;
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| column_name | count | ndv | num_null | data_size | avg_size_byte | min | max |
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| o_orderstatus | 1.5E8 | 3.0 | 0.0 | 1.50000001E8 | 1.0 | 'F' | 'P' |
| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9 | 15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8 | 4.0 | '1992-01-01' | '1998-08-02' |
| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8 | 4.0 | 0 | 0 |
| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8 | 4.0 | 1 | 14999999 |
| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9 | 8.0 | 811.73 | 591036.15 |
| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9 | 8.0 | 1 | 600000000 |
| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9 | 48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9 | 8.40016536 | '1-URGENT' | '5-LOW' |
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
信息修改
修改信息支持您手动修改列统计信息。可以修改指定列的row_count, ndv, num_nulls, min_value, max_value, data_size
等信息。 请参考统计信息修改统计信息部分。
ALTER TABLE hive.tpch100.orders MODIFY COLUMN o_orderstatus SET STATS ('row_count'='6001215');
Query OK, 0 rows affected (0.03 sec)
SHOW COLUMN stats hive.tpch100.orders;
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| column_name | count | ndv | num_null | data_size | avg_size_byte | min | max |
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| o_orderstatus | 6001215.0 | 0.0 | 0.0 | 0.0 | 0.0 | 'NULL' | 'NULL' |
| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9 | 15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8 | 4.0 | '1992-01-01' | '1998-08-02' |
| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8 | 4.0 | 0 | 0 |
| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8 | 4.0 | 1 | 14999999 |
| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9 | 8.0 | 811.73 | 591036.15 |
| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9 | 8.0 | 1 | 600000000 |
| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9 | 48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9 | 8.40016536 | '1-URGENT' | '5-LOW' |
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
信息删除
外表统计信息支持您删除一张表的表行数信息和列统计信息。如果您指定了删除的列名,则只删除这些列的信息。如果不指定,则删除整张表所有列的统计信息以及表行数信息。 更多信息请参见统计信息中的删除统计信息部分。
删除整张表的信息。
DROP STATS hive.tpch100.orders
删除表中某几列的信息。
DROP STATS hive.tpch100.orders (o_orderkey, o_orderdate)
实现原理
统计信息数据来源
优化器(Nereids)通过Cache读取统计信息,Cache的数据来源为:
内部的Statistics表。Statistics表的数据通过您执行
ANALYZE
语句收集而来。这一部分的架构与内表相同,您可以像分析内表一样,对外表执行ANALYZE
语句来收集统计信息。Stats Collector。这部分来源只针对外表Cache。Stats Collector定义了一些接口,用来从外部数据源获取统计信息。比如目前已经支持的Hive Metastore和Iceberg两种数据源,这些接口可以获取外部数据源中已有的统计信息。
以Hive为例,如果您在Hive中执行过
ANALYZE
操作,那么在SelectDB中查询时,SelectDB可以直接从Hive Metastore中加载已有的统计信息到缓存中,包括表的行数、列的最大、最小值等。如果外部数据源也没有统计信息,Stats Collector会根据表中数据文件的大小和表的Schema,大致估算一个行数提供给优化器。在这种情况下,列的统计信息是缺失的,可能导致优化器生成比较低效的执行计划。Stats collector在Statistics表中无数据时自动执行,您无需执行命令或进行设置。