多值子列是HybridDB for MySQL分析引擎的独特特性,用于表示一个列(某个Cell)有多个不确定的值,列中列的概念。一个表中可以定义一个或者多个多值列,多值列可以做筛选、分组、参与连接过滤等。以下是几个具体的例子。

建表

CREATE TABLE test1 (
    user_id BIGINT,
    city VARCHAR,
    tags MULTIVALUE delimiter_tokenizer ',' value_type 'varchar',
    brands MULTIVALUE delimiter_tokenizer ': ,' value_type 'varchar int'
)
DISTRIBUTE BY HASH (user_id)
ENGINE = 'CSTORE'
INDEX_ALL='Y';

说明

说明 业务意义
tags 只有一个多值子列,用`,`分隔 每个用户可以打多个标签
brands 第一个多值子列类型为varchar,用`:`分隔 表示每个用户偏好的品牌
第二个多值子列类型为int,用`,`分隔 表示每个用户偏好品牌的偏好度(分数)

写入数据

insert into test1 values(1, 'HZ', 'A,B,C', 'X:80,Y:70');
insert into test1 values(2, 'BJ', 'B,D', 'X:50,Z:90,W:60');
insert into test1 values(3, 'SH', 'A,C,D,F', 'X:90');

mysql> select * from test1 order by user_id;
+---------+------+---------+----------------+
| user_id | city | tags    | brands         |
+---------+------+---------+----------------+
|       1 | HZ   | A,B,C   | X:80,Y:70      |
|       2 | BJ   | B,D     | X:50,Z:90,W:60 |
|       3 | SH   | A,C,D,F | X:90           |
+---------+------+---------+----------------+
3 rows in set (0.01 sec)

多值子列查询

过滤查询

#查询包含A、B标签的用户数目

mysql> select count(*) from test1 where ref(tags,0) in ('A', 'B');
+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+
1 row in set (0.10 sec)

#查询同时包含A和B标签的用户数目

mysql> select * from test1 where ref(tags,0) in ('A') and ref(tags,0) in ('B');
+---------+------+-------+-----------+
| user_id | city | tags  | brands    |
+---------+------+-------+-----------+
|       1 | HZ   | A,B,C | X:80,Y:70 |
+---------+------+-------+-----------+
1 row in set (0.01 sec)

#查询偏好品牌包含X和偏好度在80以上的用户

mysql> select * from test1 where ref(brands, 0) in ('X') and ref(brands, 1) > 80;
+---------+------+---------+----------------+
| user_id | city | tags    | brands         |
+---------+------+---------+----------------+
|       2 | BJ   | B,D     | X:50,Z:90,W:60 |
|       3 | SH   | A,C,D,F | X:90           |
+---------+------+---------+----------------+
2 rows in set (0.01 sec)

分组查询

#不在杭州且标签为A或者B的用户在A、B标签下的人数

mysql> select ref(tags,0), count(*) from test1
where ref(tags,0) in ('A', 'B') and city != 'HZ' group by ref(tags,0);
+--------------+----------+
| ref(tags, 0) | COUNT(*) |
+--------------+----------+
| A            |        1 |
| B            |        1 |
+--------------+----------+
2 rows in set (0.05 sec)

#查询A或者B标签和其对应城市的联合分组

mysql> select ref(tags,0), city, count(*) from test1
where ref(tags,0) in ('A', 'B') group by ref(tags,0), city;
+--------------+----------+
| ref(tags, 0) | COUNT(*) |
+--------------+----------+
| B            | BJ       |
| B            | HZ       |
| A            | SH       |
| A            | HZ       |
+--------------+----------+
4 rows in set (0.02 sec)

#多个多值子列的联合分组

mysql> select ref(tags,0), ref(brands,0), count(*) from test1
where ref(tags,0) in ('A', 'B') and ref(brands,0) in ('X')  
group by ref(tags,0), ref(brands,0) ;
+--------------+----------------+----------+
| ref(tags, 0) | ref(brands, 0) | COUNT(*) |
+--------------+----------------+----------+
| A            | X              |        2 |
| B            | X              |        2 |
+--------------+----------------+----------+
2 rows in set (0.02 sec)

连接查询

mysql> select count(*) from test1 join test
on test1.user_id = test.id where ref(tags,0) in ('A', 'B') ;
+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+
1 row in set (0.16 sec)


mysql> select sum(test.int_test) from test1 join test
on test1.user_id = test.id where ref(tags,0) in ('A', 'B') ;
+--------------------+
| SUM(test.int_test) |
+--------------------+
|               -207 |
+--------------------+
1 row in set (0.03 sec)


mysql> select test.id, test1.* from test1 join test
on test1.user_id = test.id
where ref(tags,0) in ('A', 'B') and ref(brands, 0) in ('Y');
+----+---------+------+-------+-----------+
| id | user_id | city | tags  | brands    |
+----+---------+------+-------+-----------+
|  1 |       1 | HZ   | A,B,C | X:80,Y:70 |
+----+---------+------+-------+-----------+
1 row in set (0.02 sec)

注意事项

  • 多值列的GROUP BY一定要带WHERE条件,否则引擎底层需要遍历所有的value(默认只遍历1024个value)。
  • 多个子列之间的组合筛选是Cell内的或(OR)关系。