当前位置: 首页 > 知识库问答 >
问题:

如何使用数据管道导出具有按需供应的DynamoDB表

濮升
2023-03-14

我曾经使用名为将DynamoDB表导出到S3的Data Pipeline模板将DynamoDB表导出到文件。我最近更新了我所有的DynamoDB表,以按需提供和模板不再工作。我很确定这是因为旧模板指定了要消耗的DynamoDB吞吐量的百分比,这与按需表无关。

我尝试将旧模板导出到JSON,删除对吞吐量百分比消耗的引用,并创建一个新的管道。然而,这是不成功的。

有人能建议如何将具有吞吐量规定的旧式管道脚本转换为新的按需表脚本吗?

以下是我的原始功能脚本:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myDDBReadThroughputRatio": "0.25",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

以下是我尝试的更新失败:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

以下是数据管道执行中的错误:

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java

共有2个答案

裴星洲
2023-03-14

今年早些时候,DDB导出工具增加了对按需表的支持: GitHub提交

我能够在S3上安装该工具的更新版本,并更新管道中的一些内容以使其正常工作:

{
  "objects": [
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://<your-tools-bucket>/emr-dynamodb-tools-4.11.0-SNAPSHOT.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "pipelineLogUri": "s3://<your-log-bucket>/",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "readThroughputPercent": "#{myDDBReadThroughputRatio}",
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.26.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "terminateAfter": "1 Hour"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-west-2",
    "myDDBTableName": "<your table name>",
    "myDDBReadThroughputRatio": "0.5",
    "myOutputS3Loc": "s3://<your-output-bucket>/"
  }
}

关键变化:

  • EmrClusterForBackup的发布标签更新为“emr-5.26.0”。这是获得AWS SDK for Java版本1.11和DynamoDB连接器版本4.11.0所必需的(请参见此处的发布矩阵:AWS文档)
  • 如上所述更新TableBackupActivity的步骤。将它指向*.jar的构建,并将工具的类名从DynamoDbExport更新为DynamoDbExport

希望默认模板也能得到更新,这样它就可以开箱即用了。

邢飞昂
2023-03-14

我在这方面向AWS开了一张支持票。他们的反应相当全面。我会把它贴在下面

感谢您就这个问题与我们联系。

不幸的是,DynamoDB的数据管道导出/导入作业不支持DynamoDB的新按需模式[1]。

使用按需容量的表没有定义读写单元的容量。数据管道在计算管道的吞吐量时依赖于此定义的容量。

例如,如果您有100个RCU(读取容量单位)和0.25(25%)的管道吞吐量,那么有效的管道吞吐量将是每秒25个读取单位(100 * 0.25)。然而,在按需容量的情况下,RCU和WCU(写入容量单位)被反映为0。无论管道吞吐量值如何,计算的有效吞吐量都是0。

当有效吞吐量小于1时,管道将不执行。

是否需要将DynamoDB表导出到S3?

如果您仅将这些表导出用于备份目的,我建议您使用DynamoDB的按需备份和恢复功能(一个与按需容量相似的名称)[2]。

请注意,按需备份不会影响表的吞吐量,只需几秒钟即可完成。您只需支付与备份相关的S3存储成本。但是,客户无法直接访问这些表备份,只能将其还原到源表。如果您希望对备份数据执行分析,或将数据导入其他系统、帐户或表,则此备份方法不适用。

如果您需要使用数据管道导出DynamoDB数据,那么唯一的方法是将表设置为配置容量模式

您可以手动执行此操作,或者使用AWS CLI命令[3]将其作为活动包含在管道本身中。

例如(按需支付也称为按请求支付模式):

$ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

-

$ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST

请注意,禁用按需容量模式后,您需要等待24小时才能再次启用。

==参考链接===

[1] DynamoDB按需容量(另请参阅关于不支持的服务/工具的说明):https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand

[2]DynamoDB按需备份和恢复:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html

[3] DynamoDB“更新表”的AWS CLI参考:https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

 类似资料:
  • 我正在尝试使用数据管道将数据从dynamoDb导出到S3。我的表是按需配置的,包含10gb的数据。它将消耗多少rcu?有没有一种方法可以减少rcu的扩展,并最终增加传输时间?

  • 我有一个百万记录的DynamoDB表。我正在使用数据管道将DynamoDb表导出到S3。但是数据管道以DynamoDB JSON格式将表导出为一组原始json文件。数据管道运行一小时后,由于超时异常,EMR失败。 有没有办法将DynamoDB表导出为CSV并增加数据管道中的EMR超时配置?

  • 我曾经使用pipeline.json.将数据从一个DynamoDB复制到另一个DynamoDB。当源表具有预配容量时,它就可以工作,如果目标设置为预配/按需设置也没关系。我希望我的两个表设置为按需容量。但当我使用相同的模板它不工作。我们有没有办法做到这一点,或者它还在开发中? 以下是我的原始功能脚本: 下面是当source DynamoDB表设置为On Demand capacity时数据管道执行

  • 我的DynamoDB表有大约1亿(30GB)个项目,我为它配置了10k RCU。我正在使用数据管道作业导出数据。 将DataPipeline读取吞吐量比设置为0.9。 如何计算完成导出的时间(管道完成导出需要4个多小时) 我如何优化它,使导出在更短的时间内完成。 读取吞吐量比率如何与DynamoDB导出相关?

  • 我在DynamoDB有一个表,它有数百万条记录。我已经根据标准创建了一个二级索引(GSI),并基于此筛选产品。现在,我想使用AWS数据管线从表中查询产品并将其导出到S3。 a)我们可以在管道中指定GSI名称吗?因为使用数据管道对大型表进行查询会因为超时问题而被取消。[管道配置有6小时的最大等待时间,它正在达到并被取消]?b)有没有更好的方法来使用GSI索引从表中快速创建导出转储? 请分享你的观点。

  • 在DynamoDB上有映射数据类型的记录,我想在EMR上使用HiveQL以JSON数据格式将这些记录导出到S3。这个怎么做?有可能吗? 我读了下面的文件,但我想要的信息是什么。 DynamoDB数据格式文档:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataFormat.html 用于导出的配置单元命令示例。。