[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[GitHub] gaodayue opened a new pull request #6036: use S3 as a backup storage for hdfs deep storage

gaodayue opened a new pull request #6036: use S3 as a backup storage for hdfs deep storage
URL: https://github.com/apache/incubator-druid/pull/6036
   This PR improves the overall availability of hdfs-deep-storage by pushing data to S3 when HDFS is temporarily not available.
   # Motivation
   In many organization, Hadoop and HDFS are typically used in offline data analysis, while Druid targets online data serving. Thus SLA provided by HDFS often can't meet the needs of Druid. Consequently, users of hdfs-deep-storage often encounter task failures due to temporarily unavailable of HDFS. Task failures can cause data re-processing or even data loss depending on whether kafka-indexing-service or tranquility are used for realtime ingestion.
   # Goal
   Make segment handover continue to work even if HDFS is not available.
   # Approach taken by this PR
   We leverage the S3AFileSystem provided by the HDFS client library to support using S3 as a backup storage for HDFS. When we can't push segments or task logs to HDFS, we switch to S3 instead. By using S3 as a backup for HDFS, the overall availability of hdfs-deep-storage is increased.
   For segments pushed to S3, loadSpec is changed to `{"type":"hdfs", "path":"s3a://..."}`. Since file access is done with FileSystem abstraction, there is no need to change HdfsDataSegmentPuller.
   The following new configuration knobs are added to hdfs-deep-storage and hdfs task logs, please refer to doc changes in detail
   * druid.storage.useS3Backup
   * druid.storage.backupS3Bucket
   * druid.storage.backupS3BaseKey
   * druid.indexer.logs.useS3Backup
   * druid.indexer.logs.backupS3Bucket
   * druid.indexer.logs.backupS3BaseKey
   Besides what's included in this PR, I've also implemented a tool called `restore-hdfs-segment` to migrate segments temporarily pushed to S3 back to HDFS. This can free up spaces in S3 as well as make all segments reside on HDFS eventually. If you like the idea, I can send another PR for the tool later.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxx