使用 AWS CLI 来快速使用Amazon 提供的 S3、EMR、ES 等服务

公羊俊

2023-12-01

安装 AWS CLI 工具

安装条件：Python 2 version 2.7+ or Python 3 version 3.4+

安装 AWS CLI 工具的命令

pip3 install -U --user awscli aws_role_credentials oktaauth
# -U （update）表示更新所有的包到最新
# --user 表示安装到用户目录下，例如 ~/.local
# 如果在国内，网络很慢，可以在安装包名前加上 -i https://pypi.tuna.tsinghua.edu.cn/simple	使用清华源加速

# 验证安装是否成功
aws --version

官网有通过 aws configure来授权，但也可以通过Okta来获得cli的授权(不需要跳过)

oktaauth \	
--username [xxx@email.com(replace this)] \
--server yourcompany.okta.com(replace this) \
--apptype amazon_aws \
--appid exxxxaaaWewefw(replace this) | \
aws_role_credentials saml --profile profile_tw(replace this)

官网上的安装教程

创建 S3(simple store service)

通过 AWS CLI 来创建 S3 存储

$ aws s3api create-bucket \
     --bucket my-second-emr-bucket \
     --region us-east-2 \
     --create-bucket-configuration LocationConstraint=us-east-2

在开始使用S3命令行之前，可以先熟悉下S3的help命令

参考资料：

简而言之，就是有两个接口API， s3api 更底层，能够为Dev提供更丰富多样的开发能力。S3 提供更加易用的封装好借口。看情况选用

使用 s3api 创建bucket，其它参数详见

$ aws s3api create-bucket \
     --bucket your-bucket-name(replace this) \
     --region ap-southeast-1 \
     --create-bucket-configuration LocationConstraint=ap-southeast-1

# --bucket my-second-emr-bucket 创建的Bucket名称
# --region ap-southeast-1 指定Bucket所分配的的服务器区域
# --create-bucket-configuration 对于 bucket 的一些配置信息以K-V的形式添加

如果使用oktaauth 来授权验证的，则需要在每次运行命令的时候加上 --profile your-profile-name

使用 aws s3 sync 来上传文件到S3。（例如，同步需要运行的 Spark Jar 文件）
```
aws s3 sync s3-or-local-source-file-path/ s3:/your-bucket-name/destination
```

创建 EMR 集群

如何用 AWS CLI 来创建Spark的 EMR集群

创建EMR Cluster需要使用的Roles，如果Role已经存在则会返回 []
```
aws emr create-default-roles
```

如果想将集群连入已经存在的EC2子网，则可以增加Subnet选项

aws ec2 describe-subnets \
     --filters "Name=availabilityZone,Values=ap-southeast-1"

创建集群 Cluster，并且提交Spark 程序

aws emr create-cluster \
  --name your-cluster-name \
  --release-label emr-5.29.0 \
  --instance-type m4.large \
  --instance-count 3 \
  --use-default-roles \
  --applications Name=Spark \
  --log-uri s3://your-bucket-name/logs \
  --steps '[{"Name":"your-project-name","Type":"Spark","Args":["--deploy-mode","cluster","--class","top.ilovestudy.data.GdeltProcessor","--conf","spark.es.nodes.discovery=false","--conf", "spark.es.nodes=https://search-your-data-project-amazon-es-endpoint.ap-southeast-1.es.amazonaws.com","--conf", "spark.es.port=443","s3://your-bucket-name/data-project/libs/processer-0.0.1-all.jar","s3://your-bucket-name/data/update/2020-02-21T00:00:00+00:00/"],"ActionOnFailure":"CONTINUE"}]' \
  --auto-terminate \
  --region ap-southeast-1

创建 EMR 集群的命令，详阅create-cluster 。说明下 --steps 内指定的参数。该参数说明，在创建好EMR集群之后，执行的一系列操作。Value表示的是一个K-V结构的数组。

[
    {
        "Name": "Mirco-Project（replaced this）",
        "Type": "CUSTOM_JAR"|"STREAMING"|"HIVE"|"PIG"|"IMPALA|Spark", // 
        "Args": [
            "--deploy-mode cluster",
            "--master yarn",
            "--class top.ilovestudy.data.GdeltProcessor",
            "--conf spark.es.index.auto.create=true"
        ],
        "Jar": "s3://jinghui-s3/data-project/libs/processer-0.0.1-all.jar",
        "ActionOnFailure":  "TERMINATE_CLUSTER"|"CANCEL_AND_WAIT"|"CONTINUE"
    }
]

从官网上，没有浏览到对于Step参数的一些详细解释，通过实验发现，如果 Type 指定为 CUSTOM_JAR，则Args参数拼接的结果如下：

hadoop jar your-jar your-args
# 例如
hadoop jar /mnt/var/lib/hadoop/steps/s-3NK85TMMYQGT/processer-0.0.1-all.jar /home/hadoop/spark/bin/spark-submit

如果 Type 指定为Spark，则执行的真正命令结构如下：（回到了熟悉的spark-submit脚本）

spark-submit your-args

每个操作都可以通过web 界面，aws cli工具、SDK等实现，建议都通过界面亲自实现了一遍之后，可以加深理解AWS CLI中的每个参数含义。

使用 Amazon Elasticsearch Service

创建一个 Amazon ES domain。

aws es create-elasticsearch-domain --domain-name your-domain-name --elasticsearch-version 7.1 --elasticsearch-cluster-config InstanceType=t2.small.elasticsearch,InstanceCount=1 --ebs-options EBSEnabled=true,VolumeType=standard,VolumeSize=10 --access-policies '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":"*"},"Action":["es:*"],"Condition":{"IpAddress":{"aws:SourceIp":["your_ip_address"]}}}]}'

记得替换，其中的 domain-name，以及 aws:SourceIp 参数中指定的你访问外网的IP地址（不是你本机的192.x.x.x的地址，可以通过搜索引擎搜索“我的ip地址”查看）。

通过运行 curl https://checkip.amazonaws.com 可以直接查询获得你的公共IP。

（初始化很慢，大概10分钟）查看刚刚新建的ES Domain的状况

aws es describe-elasticsearch-domain --domain your-domain-name
# 或者列出指定区域内容所有es服务
aws es list-domain-names --region ap-southeast-1

上传文件到 Amazon ES domain中

curl -XPUT elasticsearch_domain_endpoint/movies/_doc/1 -d '{"director": "Burton, Tim", "genre": ["Comedy","Sci-Fi"], "year": 1996, "actor": ["Jack Nicholson","Pierce Brosnan","Sarah Jessica Parker"], "title": "Mars Attacks!"}' -H 'Content-Type: application/json'

也可以批量上传文件，例如文件名为 bulk_movies.json

curl -XPOST elasticsearch_domain_endpoint/_bulk --data-binary @bulk_movies.json -H 'Content-Type: application/json'

从Amazon ES domain中搜寻文件.
```
curl -XGET 'elasticsearch_domain_endpoint/movies/_search?q=mars'
```
Amazon中ES服务配置了一个Kibana插件，可以在Web UI界面点击使用。
删除 Amazon ES domain
```
aws es delete-elasticsearch-domain --domain-name movies
```
因为是按时间收费，所以一定要记得删除！！一定要记得删除！！一定要记得删除！！。如果需要重新恢复ES集群的化，可以使用提供的快照功能。

对于EC2的一些命令行操作

希望能够将Airflow部署在EC2的机器上

新建一个 Key-Pair 用来连接EC2

# 创建一个 新的 Key-pair
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
# 展示创建好的 Key-pair
aws ec2 describe-key-pairs --key-name MyKeyPair
# 删除建好的 Key-pair
aws ec2 delete-key-pair --key-name MyKeyPair

EC2 的安全组

创建一个新的安全组用来控制EC2的输入和输出，以下示例显示如何为指定的VPC创建安全组。

aws ec2 create-security-group --group-name my-sg --description "My security group" --vpc-id vpc-1a2b3c4d

同样的可以通过 describe-security-groups 命令来查看初始化信息，只能通过 vpc-id （而不是名字）来查看。vpc-id 会在创建的安全组的时候返回。

aws ec2 describe-security-groups --group-ids sg-903004f8

启动运行实例

从AMI（Amazon machine Image）中选中一个操作系统模板。指定前面是生成的Key-pair 和安全组 Security-Group。还有一点需要注意的就是，如果你需要绑定VPC，指定了VPC之后可以不用指定Subnet（子网），但是如果没有指定VPC，则一定需要指定 Subnet。

其它相关参考资料

使用 AWS CLI 来快速使用Amazon 提供的 S3、EMR、ES 等服务

安装 AWS CLI 工具

创建 S3(simple store service)

创建 EMR 集群

使用 Amazon Elasticsearch Service

对于EC2的一些命令行操作

EC2 的安全组

启动运行实例

相关阅读

相关文章

相关问答

相关文档