从OSS迁移数据

本章节介绍如何将对象存储OSS上的数据迁移到LindormDFS。

准备工作

  1. 开通LindormDFS,详情请参见开通指南 

  2. 搭建Hadoop集群。建议您使用的Hadoop版本不低于2.7.3,本文档中使用的Hadoop版本为Apache Hadoop 2.7.3,修改Hadoop配置信息,详情参见使用开源HDFS客户端访问。 

  3. Hadoop集群所有节点上安装JDK,本操作要求JDK版本不低于1.8。

  4. Hadoop集群安装OSS客户端JindoFS SDK。JindoFS SDK详细介绍请参见JindoFS SDK

    • 下载 jindofs-sdk.jar。

    cp ./jindofs-sdk-*.jar  ${HADOOP_HOME}/share/hadoop/hdfs/lib/
    • Hadoop集群所有节点创建JindoFS SDK配置文件。

      • 添加如下环境变量到/etc/profile文件。

      export B2SDK_CONF_DIR=/etc/jindofs-sdk-conf
      • 创建OSS存储工具配置文件/etc/jindofs-sdk-conf/bigboot.cfg

      [bigboot]
      logger.dir=/tmp/bigboot-log[bigboot-client]
      client.oss.retry=5
      client.oss.upload.threads=4
      client.oss.upload.queue.size=5
      client.oss.upload.max.parallelism=16
      client.oss.timeout.millisecond=30000
      client.oss.connection.timeout.millisecond=4000
      • 加载环境变量使之生效。

      source /etc/profile
      • 验证是否可以在Hadoop 集群上使用OSS。

      ${HADOOP_HOME}/bin/hadoop fs -ls oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/

将对象存储OSS数据迁移到LindormDFS

  1. 检查并且确定需要迁移的数据大小。

    ${HADOOP_HOME}/bin/hadoop du -h oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data
  2. 启动Hadoop MapReduce任务(DistCp)将测试数据迁移至LindormDFS。

    ${HADOOP_HOME}/bin/hadoop distcp  \
    oss://<accessKeyId>:<accessKeySecret>@<bucket-name>.<endpoint>/test_data.txt \
    hdfs://${实例Id}/

    其中${实例Id}请根据您的实际情况进行修改。

    参数说明如下表所示:

    参数

    说明

    accessKeyId

    访问对象存储OSS API的密钥。获取方式请参见创建AccessKey

    accessKeySecret

    bucket-name.endpoint

    对象存储OSS的访问域名,包括存储空间(Bucket)名称和对应的地域域名(Endpoint)地址。

  3. 任务执行完成后,查看迁移结果。

    如果回显包含如下类似信息,说明迁移成功。

    20/09/29 12:23:59 INFO mapreduce.Job:  map 100% reduce 0%
    20/09/29 12:23:59 INFO mapreduce.Job: Job job_1601195105349_0015 completed successfully
    20/09/29 12:23:59 INFO mapreduce.Job: Counters: 38
     File System Counters
      FILE: Number of bytes read=0
      FILE: Number of bytes written=122343
      FILE: Number of read operations=0
      FILE: Number of large read operations=0
      FILE: Number of write operations=0
      HDFS: Number of bytes read=470
      HDFS: Number of bytes written=47047709
      HDFS: Number of read operations=15
      HDFS: Number of large read operations=0
      HDFS: Number of write operations=4
      OSS: Number of bytes read=0
      OSS: Number of bytes written=0
      OSS: Number of read operations=0
      OSS: Number of large read operations=0
      OSS: Number of write operations=0
     Job Counters
      Launched map tasks=1
      Other local map tasks=1
      Total time spent by all maps in occupied slots (ms)=5194
      Total time spent by all reduces in occupied slots (ms)=0
      Total time spent by all map tasks (ms)=5194
      Total vcore-milliseconds taken by all map tasks=5194
      Total megabyte-milliseconds taken by all map tasks=5318656
     Map-Reduce Framework
      Map input records=1
      Map output records=0
      Input split bytes=132
      Spilled Records=0
      Failed Shuffles=0
      Merged Map outputs=0
      GC time elapsed (ms)=64
      CPU time spent (ms)=2210
      Physical memory (bytes) snapshot=222294016
      Virtual memory (bytes) snapshot=2672074752
      Total committed heap usage (bytes)=110100480
     File Input Format Counters
      Bytes Read=338
     File Output Format Counters
      Bytes Written=0
     org.apache.hadoop.tools.mapred.CopyMapper$Counter
      BYTESCOPIED=47047709
      BYTESEXPECTED=47047709
      COPY=1
    20/09/29 12:23:59 INFO common.AbstractJindoFileSystem: Read total statistics: oss read average -1 us, cache read average -1 us, read oss percent 0%
  4. 验证迁移结果。

    查看迁移到LindormDFS的测试数据大小。

    ${HADOOP_HOME}/bin/hadoop fs -du -s -h hdfs://${实例Id}/