本文介绍如何在不移动日志文件的情况下,通过DLA分析OSS中的CDN日志数据,及时发现问题,并有针对性的解决问题,提升CDN的服务质量。
背景信息
阿里云内容分发网络(Content Delivery Network,CDN)是建立并覆盖在承载网之上,由分布在不同区域的边缘节点服务器群组成的分布式网络。CDN分担源站压力,避免网络拥塞,确保在不同区域、不同场景下加速网站内容的分发,提高资源访问速度。您可以将CDN日志转存到OSS,或者将CDN日志配置实时日志推送,然后进行日志分析。
前提条件
步骤一:将CDN日志转存到OSS
将CDN日志转存到OSS,如何转存,请参见通过函数计算转存离线日志。
步骤二:创建OSS Schema
假设OSS Bucket中存储了以下CDN日志文件。
您可以在DMS页面编写SQL创建OSS Schema、创建日志文件表以及读取日志文件数据,也可以通过MySQL客户端或者MySQL命令行工具连接DLA,然后编写SQL创建OSS Schema、创建日志文件表以及读取日志文件数据。
CREATE SCHEMA cdn_log_schema with DBPROPERTIES(
catalog='oss',
location = 'oss://bucket-name/cdn/'
);
步骤三:创建日志文件表
CDN日志文件数据如下:
[9/Jun/2015:01:58:09 +0800] 188.165.XX.XX - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"
日志字段含义如下:
字段 | 说明 |
[9/Jun/2015:01:58:09 +0800] |
时间。 |
188.165.XX.XX |
访问IP地址。 |
- |
代理IP。 |
1542 |
responsetime,单位为ms。 |
"-" |
referer。 |
"GET http://www.aliyun.com/index.html" |
request。 |
200 |
http code。 |
191 |
requestsize,单位为byte。 |
2830 |
responsesize,单位为byte。 |
MISS |
cache命中状态。 |
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" |
UA头。 |
"text/html" |
文件类型。 |
在DLA中建表时,需要采用正则表达式对CDN日志进行解析,每个字段和对应的正则表达式说明如下:
字段 | 说明 |
[9/Jun/2015:01:58:09 +0800] |
(-|\\[[^\\]]*\\]) |
188.165.XX.XX |
([^ ]*) |
- |
([^ ]*) |
1542 |
([^ ]*) |
"-" |
([^ \"]*|\"[^\"]*\") |
"GET http://www.aliyun.com/index.html" |
([^ \"]*|\"[^\"]*\") |
200 |
(-|[0-9]*) |
191 |
(-|[0-9]*) |
2830 |
(-|[0-9]*) |
MISS |
([^ ]*) |
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" |
([^ \"]*|\"[^\"]*\") |
"text/html" |
([^ \"]*|\"[^\"]*\") |
完整的正则表达式为:(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")
(-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")
。
DLA中创建日志文件表的SQL语句如下:
CREATE EXTERNAL TABLE cdn_log (
log_timestamp VARCHAR,
access_ip VARCHAR,
proxy_ip VARCHAR,
response_time VARCHAR,
referer VARCHAR,
request VARCHAR,
httpcode SMALLINT,
request_size BIGINT,
response_size BIGINT,
cache_hit_status VARCHAR,
ua_header VARCHAR,
file_type VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-name/cdn/'
TBLPROPERTIES ('recursive.directories' = 'true');
步骤四:查询、分析CDN日志
SELECT * FROM cdn_log;
返回结果如下:
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.XX.XX | - | 777 | "-" | | 200 | 201 | 7159 | MISS | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
您可以使用DLA系统函数,对数据进行分析。
SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
access_ip,
proxy_ip,
response_time,
substring(referer, 2, length(referer) - 2) as referer,
substring(request, 2, length(request) - 2) as request,
httpcode,
request_size,
response_size,
cache_hit_status,
substring(ua_header, 2, length(ua_header) - 2) as ua_header,
substring(file_type, 2, length(file_type) - 2) as file_type
FROM cdn_log;
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.XX.XX | - | 777 | - | | 200 | 201 | 7159 | MISS | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+