Star Schema Benchmark

云数据库 SelectDB 版致力于提供极致性能、简单易用的数据分析服务,在宽表聚合、多表关联和高并发点查等场景下均具有优异的性能表现。本文将为您介绍云数据库 SelectDB 版SSB标准测试上的测试方法和测试结果。

概述

星型模式基准测试(SSB)Star Schema Benchmark是一个轻量级的数仓场景下的性能测试集。SSB基于事务处理性能委员会(TPC-H)Transaction Processing Performance Council,提供了一个简化版的星型模型数据集,主要用于测试在星型模型下,多表关联查询的性能表现。此外,业界通常也将SSB称为宽表模型(简称SSB FLAT),以测试查询引擎的性能。本文档主要介绍SelectDBSSB 100 GB测试集上的性能表现。

说明

包括SSB在内的标准测试集通常和实际业务场景差距较大,并且部分测试会针对测试集进行参数调优。所以标准测试集的测试结果仅能反映数据库在特定场景下的性能表现。建议用户使用实际业务数据进行进一步的测试。

测试环境

  • 数据库环境。

    环境配置项

    配置说明

    地域和可用区

    华东1(杭州)地域,可用区K。

    规格

    64核 512 GB

    磁盘

    800 GB高性能云硬盘

    云数据库 SelectDB 版内核版本

    3.0.6

  • 客户端环境。

    环境配置项

    配置说明

    下载测试工具的设备

    云服务器ECS实例,详情请参见创建实例

    地域和可用区

    华东1(杭州)地域

    实例规格

    ecs.g7.2xlarge

    操作系统

    Ubuntu 22.04.1 LTS

    网络

    云数据库 SelectDB 版版实例为相同专有网络(VPC)。

测试数据集

整个测试数据集模拟生成SSB 100 GB500 GB的数据并导入SelectDB进行测试,下面是测试100 GB数据表的相关说明及数据量。

SSB表名

行数

备注

lineorder

600037902

商品订单明细表。

customer

3000000

客户信息表。

part

1400000

零件信息表。

supplier

200000

供应商信息表。

dates

2556

日期表。

lineorder_flat

600037902

数据展平后的宽表。

测试步骤

如下介绍进行测试所需要的前置工作和测试步骤。涉及的脚本获取请参见瑶池测试工具

步骤一:下载安装SSB数据生成工具

从上述脚本库中获取脚本后,解压脚本文件并进入对应目录,执行以下指令,下载并编译ssb-dbgen工具,示例如下。

tar -zxvf yaochi_performance_tool.tar.gz
cd ./yaochi_performance_tool/ssb-tools/bin
bash build-ssb-dbgen.sh

安装成功后,将在./yaochi_performance_tool/ssb-tools/bin/ssb-dbgen/目录下生成dbgen二进制文件。

步骤二:生成SSB测试集

在安装测试工具目录执行下脚本以生成SSB数据集,示例如下。

cd ./yaochi_performance_tool/ssb-tools/bin
bash gen-ssb-data.sh -s 100 -c 100

数据会以.tbl为后缀生成在ssb-data/目录下。

其中,-s 100表示测试集大小系数为100,-c 100表示并发100个线程生成lineorder表的数据。-c参数也决定了最终lineorder表的文件数量。参数越大,文件数越多,每个文件越小。

-s 100参数下,生成的数据集文件总大小约60GB。生成时间可能在数分钟到1小时不等。具体数据信息如下表。

Table

Rows

Size

File Number

lineorder

600037902

60GB

100

customer

3000000

277M

1

part

1400000

116M

1

supplier

200000

17M

1

dates

2556

228K

1

步骤三:建表

  1. 准备doris-cluster.conf文件

    在调用导入脚本前,需要将测试用信息写在doris-cluster.conf文件中。文件位置在ssb-tools/conf/目录下,内容包括连接集群的地址、HTTP端口、用户名、密码和待导入数据的DB。

    说明

    您可以在云数据库 SelectDB 版控制台的实例详情 > 网络信息中获取VPC地址(或公网地址)和HTTP协议端口

    export FE_HOST="xxx"
    export FE_HTTP_PORT="8080"
    export FE_QUERY_PORT="9030"
    export USER="root"
    export PASSWORD='xxx'
    export DB="ssb"
  2. 创建SSB

    ./yaochi_performance_tool/ssb-tools/bin/目录下执行脚本来自动创建测试用表。

    bash create-ssb-tables.sh

    可以考虑通过修改建表语句中的配置,以让性能测试更加贴合实际生产的环境。以lineorder_flat表建表语句为例,以下的建表语句中,设置了默认分桶为48个桶。您可以根据您的集群规模节点配置对这个分桶数进行调整,这样可以获取到更好的测试效果。

    CREATE TABLE `lineorder_flat` (
      `LO_ORDERDATE` date NOT NULL COMMENT "",
      `LO_ORDERKEY` int(11) NOT NULL COMMENT "",
      `LO_LINENUMBER` tinyint(4) NOT NULL COMMENT "",
      `LO_CUSTKEY` int(11) NOT NULL COMMENT "",
      `LO_PARTKEY` int(11) NOT NULL COMMENT "",
      `LO_SUPPKEY` int(11) NOT NULL COMMENT "",
      `LO_ORDERPRIORITY` varchar(100) NOT NULL COMMENT "",
      `LO_SHIPPRIORITY` tinyint(4) NOT NULL COMMENT "",
      `LO_QUANTITY` tinyint(4) NOT NULL COMMENT "",
      `LO_EXTENDEDPRICE` int(11) NOT NULL COMMENT "",
      `LO_ORDTOTALPRICE` int(11) NOT NULL COMMENT "",
      `LO_DISCOUNT` tinyint(4) NOT NULL COMMENT "",
      `LO_REVENUE` int(11) NOT NULL COMMENT "",
      `LO_SUPPLYCOST` int(11) NOT NULL COMMENT "",
      `LO_TAX` tinyint(4) NOT NULL COMMENT "",
      `LO_COMMITDATE` date NOT NULL COMMENT "",
      `LO_SHIPMODE` varchar(100) NOT NULL COMMENT "",
      `C_NAME` varchar(100) NOT NULL COMMENT "",
      `C_ADDRESS` varchar(100) NOT NULL COMMENT "",
      `C_CITY` varchar(100) NOT NULL COMMENT "",
      `C_NATION` varchar(100) NOT NULL COMMENT "",
      `C_REGION` varchar(100) NOT NULL COMMENT "",
      `C_PHONE` varchar(100) NOT NULL COMMENT "",
      `C_MKTSEGMENT` varchar(100) NOT NULL COMMENT "",
      `S_NAME` varchar(100) NOT NULL COMMENT "",
      `S_ADDRESS` varchar(100) NOT NULL COMMENT "",
      `S_CITY` varchar(100) NOT NULL COMMENT "",
      `S_NATION` varchar(100) NOT NULL COMMENT "",
      `S_REGION` varchar(100) NOT NULL COMMENT "",
      `S_PHONE` varchar(100) NOT NULL COMMENT "",
      `P_NAME` varchar(100) NOT NULL COMMENT "",
      `P_MFGR` varchar(100) NOT NULL COMMENT "",
      `P_CATEGORY` varchar(100) NOT NULL COMMENT "",
      `P_BRAND` varchar(100) NOT NULL COMMENT "",
      `P_COLOR` varchar(100) NOT NULL COMMENT "",
      `P_TYPE` varchar(100) NOT NULL COMMENT "",
      `P_SIZE` tinyint(4) NOT NULL COMMENT "",
      `P_CONTAINER` varchar(100) NOT NULL COMMENT ""
    ) ENGINE=OLAP
    DUPLICATE KEY(`LO_ORDERDATE`, `LO_ORDERKEY`)
    COMMENT "OLAP"
    PARTITION BY RANGE(`LO_ORDERDATE`)
    (PARTITION p1 VALUES [('0000-01-01'), ('1993-01-01')),
    PARTITION p2 VALUES [('1993-01-01'), ('1994-01-01')),
    PARTITION p3 VALUES [('1994-01-01'), ('1995-01-01')),
    PARTITION p4 VALUES [('1995-01-01'), ('1996-01-01')),
    PARTITION p5 VALUES [('1996-01-01'), ('1997-01-01')),
    PARTITION p6 VALUES [('1997-01-01'), ('1998-01-01')),
    PARTITION p7 VALUES [('1998-01-01'), ('1999-01-01')))
    DISTRIBUTED BY HASH(`LO_ORDERKEY`) BUCKETS 48
    PROPERTIES (
    "replication_num" = "1",
    "colocate_with" = "groupxx1",
    "in_memory" = "false",
    "storage_format" = "DEFAULT"
    );

步骤四:导入数据

./yaochi_performance_tool/ssb-tools/bin/目录下执行脚本完成SSB测试集所有数据的导入。

bash load-ssb-data.sh -c 5

其中,-c 5表示启动5个并发线程导入(默认为 5)。如果开启更多线程,可以加快导入速度,但会增加额外的内存开销。

说明

配置项flush_thread_num_per_store表示每个数据目录的写盘线程数,影响测试数据的导入速度,默认为6。较大的数据可以提升写数据吞吐,但可能会增加IO Util。

步骤五:检查导入数据

SELECT COUNT(*) FROM part;
SELECT COUNT(*) FROM customer;
SELECT COUNT(*) FROM supplier;
SELECT COUNT(*) FROM dates;
SELECT COUNT(*) FROM lineorder;
SELECT COUNT(*) FROM lineorder_flat;

按照上述流程和参数执行的场合,数据量应和以下表格给出的生成数据的行数一致。

Table

Rows

Origin Size

Compacted Size(1 Replica)

lineorder_flat

600037902

60 GB

59.709 GB

lineorder

600037902

60 GB

14.514 GB

customer

3000000

277 MB

138.247 MB

part

1400000

116 MB

12.759 MB

supplier

200000

17 MB

9.143 MB

dates

2556

228 KB

34.276 KB

步骤六:查询测试

标准SSB查询语句,示例如下。其他语句详情请参见ssb-queries

--Q1.1
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25;
--Q1.2
SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_yearmonth = 'Jan1994'
    AND lo_discount BETWEEN 4 AND 6
    AND lo_quantity BETWEEN 26 AND 35;
    
--Q1.3
SELECT
    SUM(lo_extendedprice * lo_discount) AS REVENUE
FROM lineorder, dates
WHERE
    lo_orderdate = d_datekey
    AND d_weeknuminyear = 6
    AND d_year = 1994
    AND lo_discount BETWEEN 5 AND 7
    AND lo_quantity BETWEEN 26 AND 35;
    
--Q2.1
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_category = 'MFGR#12'
    AND s_region = 'AMERICA'
GROUP BY d_year, p_brand
ORDER BY p_brand;

--Q2.2
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228'
    AND s_region = 'ASIA'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

--Q2.3
SELECT SUM(lo_revenue), d_year, p_brand
FROM lineorder, dates, part, supplier
WHERE
    lo_orderdate = d_datekey
    AND lo_partkey = p_partkey
    AND lo_suppkey = s_suppkey
    AND p_brand = 'MFGR#2239'
    AND s_region = 'EUROPE'
GROUP BY d_year, p_brand
ORDER BY d_year, p_brand;

--Q3.1
SELECT
    c_nation,
    s_nation,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND c_region = 'ASIA'
    AND s_region = 'ASIA'
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_nation, s_nation, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.2
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND c_nation = 'UNITED STATES'
    AND s_nation = 'UNITED STATES'
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.3
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND (
        c_city = 'UNITED KI1'
        OR c_city = 'UNITED KI5'
    )
    AND (
        s_city = 'UNITED KI1'
        OR s_city = 'UNITED KI5'
    )
    AND d_year >= 1992
    AND d_year <= 1997
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q3.4
SELECT
    c_city,
    s_city,
    d_year,
    SUM(lo_revenue) AS REVENUE
FROM customer, lineorder, supplier, dates
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_orderdate = d_datekey
    AND (
        c_city = 'UNITED KI1'
        OR c_city = 'UNITED KI5'
    )
    AND (
        s_city = 'UNITED KI1'
        OR s_city = 'UNITED KI5'
    )
    AND d_yearmonth = 'Dec1997'
GROUP BY c_city, s_city, d_year
ORDER BY d_year ASC, REVENUE DESC;

--Q4.1
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=4, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
    d_year,
    c_nation,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND c_region = 'AMERICA'
    AND s_region = 'AMERICA'
    AND (
        p_mfgr = 'MFGR#1'
        OR p_mfgr = 'MFGR#2'
    )
GROUP BY d_year, c_nation
ORDER BY d_year, c_nation;

--Q4.2
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */  
    d_year,
    s_nation,
    p_category,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND c_region = 'AMERICA'
    AND s_region = 'AMERICA'
    AND (
        d_year = 1997
        OR d_year = 1998
    )
    AND (
        p_mfgr = 'MFGR#1'
        OR p_mfgr = 'MFGR#2'
    )
GROUP BY d_year, s_nation, p_category
ORDER BY d_year, s_nation, p_category;

--Q4.3
SELECT /*+SET_VAR(parallel_fragment_exec_instance_num=2, enable_vectorized_engine=true, batch_size=4096, enable_cost_based_join_reorder=true, enable_projection=true) */
    d_year,
    s_city,
    p_brand,
    SUM(lo_revenue - lo_supplycost) AS PROFIT
FROM dates, customer, supplier, part, lineorder
WHERE
    lo_custkey = c_custkey
    AND lo_suppkey = s_suppkey
    AND lo_partkey = p_partkey
    AND lo_orderdate = d_datekey
    AND s_nation = 'UNITED STATES'
    AND (
        d_year = 1997
        OR d_year = 1998
    )
    AND p_category = 'MFGR#14'
GROUP BY d_year, s_city, p_brand
ORDER BY d_year, s_city, p_brand;

步骤七:测试结果

以下是SSB 100GB500GB的测试结果。

Query

SSB 100GB(s)

SSB 500GB(s)

Q1.1

0.06

0.81

Q1.2

0.03

0.49

Q1.3

0.03

0.44

Q2.1

0.15

1.35

Q2.2

0.12

1

Q2.3

0.11

0.95

Q3.1

0.35

2.55

Q3.2

0.13

1.09

Q3.3

0.11

0.95

Q3.4

0.05

0.83

Q4.1

0.33

5.84

Q4.2

0.15

2.44

Q4.3

0.1

1.41

合计

1.72

20.15