我们希望将一个经过训练的Tensorflow模型部署到AWS Sagemaker,以便使用tensorflow-serving-container进行推理。Tensorflow版本是2.1。遵循https://github . com/AWS/sage maker-tensor flow-serving-container上的指南,采取了以下步骤:
import os
import sagemaker
from sagemaker.tensorflow.serving import Model
from sagemaker.tensorflow.model import TensorFlowModel
from sagemaker.predictor import json_deserializer, json_serializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_JSON
def create_tfs_sagemaker_model():
sagemaker_session = sagemaker.Session()
role = 'arn:aws:iam::XXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXX
bucket = 'tf-serving'
prefix = 'sagemaker/tfs-test'
s3_path = 's3://{}/{}'.format(bucket, prefix)
image = 'XXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
endpoint_name = 'tf-serving-ep-test-1'
tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, framework_version='2.1')
tensorflow_serving_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
rt_predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sagemaker_session, serializer=json_serializer, content_type=CONTENT_TYPE_JSON, accept=CONTENT_TYPE_JSON)
def create_tfs_sagemaker_batch_transform():
sagemaker_session = sagemaker.Session()
print(sagemaker_session.boto_region_name)
role = 'arn:aws:iam::XXXXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXX'
bucket = 'XXXXXXX-tf-serving'
prefix = 'sagemaker/tfs-test'
image = 'XXXXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
s3_path = 's3://{}/{}'.format(bucket, prefix)
model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, name='deep-net-0', framework_version='2.1')
print(tensorflow_serving_model.model_data)
out_path = 's3://XXXXXX-serving-out/'
input_path = "s3://XXXXXX-serving-in/"
tensorflow_serving_transformer = tensorflow_serving_model.transformer(instance_count=1, instance_type='ml.c4.xlarge', accept='application/json', output_path=out_path)
tensorflow_serving_transformer.transform(input_path, content_type='application/json')
第4步和第5步都在运行,在AWS Cloudwatch日志中,我们看到实例成功启动,模型加载和TF-Serving进入事件循环,如下所示:
信息:主:启动服务
2020-07-08T17:07:16.156 02:00 信息:主要:nginx配置:
2020-07-08T17:07:16.156 02:00 load_module模块/ngx_http_js_module.so;
2020-07-08T17:07:16.156 02:00 worker_processs自动;
2020-07-08T17:07:16.156 02:00守护进程关闭;
2020-07-08t 17:07:16.156 02:00 PID/tmp/nginx . PID;
2020-07-08T17:07:16.157 02:00 error_log /dev/stderr 错误;
2020-07-08t 17:07:16.157 02:00 worker _ rlimit _ no file 4096;
2020-07-08T17:07:16.157 02:00事件{ worker _ connections 2048
2020-07-08T17:07:16.157 02:00 }
2020-07-08T17:07:16.162 02:00超文本传输协议{包括 /etc/nginx/tfs_version;default_type应用程序/json;access_log /dev/stdout合并;js_includedefault_tfs_model;上游tfs_upstream{服务器proxy_redirect;}上游gunicorn_upstream{服务器unix:/tmp/proxy_pass_request_headersfail_timeout=1;}服务器{监听8080延迟;client_max_body_size0;client_body_buffer_size100m;subrequest_output_buffer_size100m;设置$mime.types2.1;设置$tensorflow-serving.js无;位置 /tfs{重写^/tfs/(.*) /$1中断;localhost:10001关闭;gunicorn.sock关闭;proxy_set_header内容类型'应用程序/json';proxy_set_header接受'应用程序/json';proxy_pass超文本传输协议://tfs_upstream;}位置 /ping{js_contentping;}位置 /invocations{js_content调用;}位置 /models{proxy_pass超文本传输协议://gunicorn_upstream/模型;}位置/{返回404'{"error":"未找到"}'; } keepalive_timeout3;}
2020-07-08T17:07:16.162 02:00 }
2020-07-08T17:07:16.162 02:00 信息:tfs_utils:使用默认型号名称:型号
2020-07-08T17:07:16.162 02:00信息:tfs_utils:tensorflow服务模型配置:
2020-07-08t 17:07:16.162 02:00 model _ config _ list:{ config:{ name:" model ",base_path: "/opt/ml/model ",model_platform: "tensorflow" }
2020-07-08T17:07:16.162 02:00 }
2020-07-08T17:07:16.162 02:00 信息:主要:使用默认模型名称:模型
2020-07-08T17:07:16.162 02:00信息:main:tensorflow服务模型配置:
2020-07-08T17:07:16.163 02:00model_config_list:{config:{name:"model",base_path:"/opt/ml/model",model_platform:"tenorflow"}
2020-07-08T17:07:16.163 02:00}
2020-07-08T17:07:16.163 02:00信息:主要:tensorflow版本信息:
张量流模型服务器:2.1.0-rc1 dev.sha.075ffcf
2020-07-08T17:07:16.163 02:00 TensorFlow库:2.1.0
2020-07-08T17:07:16.163 02:00 INFO: main: tenorflow服务命令:tensorflow_model_server--port=10000--rest_api_port=10001--model_config_file=/sagemaker/model-config.cfg--max_num_load_retries=0
2020-07-08T17:07:16.163 02:00 INFO: main:已开始Tannorflow服务(pid: 13)
2020-07-08T17:07:16.163 02:00 信息:主要:nginx版本信息:
2020-07-08T17:07:16.163 02:00 nginx版本:nginx/1.18.0
2020-07-08T17:07:16.163 02:00由gcc 7.4.0构建(Ubuntu 7.4.0-1ubuntu1~18.04.1)
2020-07-08T17:07:16.163 02:00 使用OpenSSL 1.1.1构建 11 Sep 2018
2020-07-08T17:07:16.163 02:00启用TLS SNI支持
2020-07-08T17:07:16.163 02:00配置参数:--prefix=/etc/nginx--sbin path=/usr/sbin/nginx--modules path=/usr/lib/ngin x/modules--conf path=/etc/ngin/ngix。conf——错误日志路径=/var/log/nginx/error。log——http日志路径=/var/log/nginx/access。log——pid路径=/var/run/nginx。pid——锁定路径=/var/run/nginx。lock--http客户端主体温度路径=/var/cache/nginx/client_temp--http代理温度路径=/var/cache/nginx/proxy_tem--http fastcgi temp路径=/var/cache/nginx/fastcgi_temp--http uwsgi temp路径=/var cache/nginx/uwsgi_temp--http-scgi tem path=/var/cache/inginx/scgi temp--user=nginx--group=nginx--with compat--with file aio--with-http_addition module--with-http_auth_request_module--with http_dav_module--with-http_ flv_module---http_gzip_static_modula--with-http_ mp4_modul--http_random_index_mod with-httpUrealip_module---http _secure_link module with-http_slice_modle--http _ ssl_modal--WI-http ustub status_modu module--with-http_v2_module-with-mail--with-mail_ssl_module--with stream--with-stream_realip_modula--with-stream_ssl-module--with-STREAT_preread_module--cc opt='-g-O2-fdebug前缀映射=/data/builder/debild/nginx-1.18.0/debian/debildbase/nginx-1.8.0=-fstack-protector-strong-Wformat-Werror=format-security-Wp,-D_FORTIFY_SOURCE=2-fPIC'-带ld opt='-Wl,-b符号函数-Wl,--z,relro-Wl,now-Wl,-根据需要-pie'
2020-07-08T17:07:16.163 02:00信息:main:已启动nginx(pid:15)
2020-07-08T17:07:16.163 02:00 2020-07:08 15:07:15.075708:I tensorflow_serving/model_servers/server_core。cc:462]添加/更新模型。
2020-07-08T17:07:16.163 02:00 2020-07:08 15:07:15.075760:I tensorflow_serving/model_servers/server_core。cc:573](重新)添加模型:模型
2020-07-08T17:07:16.163 02:00 2020-07-08 15:07:15.180755: Itensorflow_serving/util/retrier.cc:46]重试为servable保留资源:{name: Model version: 1}已耗尽max_num_retries: 0
2020-07-08t 17:07:16.163 02:00 2020-07-08 15:07:15.180887:I tensor flow _ serving/core/basic _ manager . cc:739]成功预留资源以加载servable {name: model version: 1}
2020-07-08T17:07:16.163 02:00 2020-07-08 15:07:15.180919:我tensorflow_serving/core/loader_harness.cc:66] 批准可服务版本的负载{名称:模型版本:1}
2020-07-08t 17:07:16.163 02:00 2020-07-08 15:07:15.180944:I tensor flow _ serving/core/loader _ harness . cc:74]加载可服务版本{name: model version: 1}
2020-07-08t 17:07:16.163 02:00 2020-07-08 15:07:15.180995:I external/org _ tensor flow/tensor flow/cc/saved _ model/reader . cc:31]从/opt/ml/model/1读取保存的模型
2020-07-08T17:07:16.163 02:00 2020-07-08 15:07:15.205712: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] 阅读带有标签的元图 { serve }
2020-07-08T17:07:16.164 02:00 2020-07-08 15:07:15.205825: I外部/org_tensorflow/tenorflow/cc/saved_model/loader.cc:264]读取SavedModel调试信息(如果存在)从: /opt/ml/model/1
2020-07-08t 17:07:16.164 02:00 2020-07-08 15:07:15.208599:I external/org _ tensor flow/tensor flow/core/common _ runtime/process _ util . cc:147]使用默认inter op设置创建新线程池:2。使用inter_op_parallelism_threads进行调整以获得最佳性能。
2020-07-08T17:07:16.164 02:00 2020-07/08 15:07:15.328057:I外部/org_tensorflow/tensorflo/cc/saved_model/loader。cc:203]正在恢复SavedModel捆绑包。
2020-07-08T17:07:17.165 02:00 2020-07:08 15:07:16.578796:I外部/org_tensorflow/tensorflo/cc/saved_model/loader。cc:152]在路径:/opt/ml/model/1处对SavedModel捆绑包运行初始化操作
2020-07-08T17:07:17.165 02:00 2020-07:08 15:07:16.626494:I外部/org_tensorflow/tensorflo/cc/saved_model/loader。cc:333]为标签{serve}保存的模型加载;状态:成功:OK。花费了1445495微秒。
2020-07-08t 17:07:17.165 02:00 2020-07-08 15:07:16.630443:I tensor flow _ servables/tensor flow/saved _ model _ warmup . cc:105]在/opt/ml/model/1/assets . extra/TF _ serving _ warmup _ requests中找不到预热数据文件
2020-07-08t 17:07:17.165 02:00 2020-07-08 15:07:16.632461:I tensor flow _ serving/util/retrier . cc:46]重试加载servable: {name: model version: 1}用尽的最大次数重试次数:0
2020-07-08T17:07:17.165 02:00 2020-07:08 15:07:16.632484:I tensorflow_serving/core/loader_harness。cc:87]已成功加载可服务版本{名称:模型版本:1}
2020-07-08t 17:07:17.165 02:00 2020-07-08 15:07:16.634727:I tensor flow _ serving/model _ servers/server . cc:362]运行gRPC model server 0 . 0 . 0 . 0:10000...
2020-07-08T17:07:17.165 02:00 [警告] getaddrinfo:不支持节点名的地址系列
2020-07-08T17:07:17.165 02:00 2020-07:08 15:07:16.635747:I tensorflow_serving/model_servers/server。cc:382]正在导出HTTP/REST API:localhost:10001。。。
2020-07-08T17:07:17.165 02:00[evhttp_server.cc: 238]NET_LOG:进入事件循环...
但是(endpoint和批量转换)都无法通过 Sagemaker Ping 运行状况检查::
2020-07-08T17:07:32.169 02:00 2020/07/08 15:07:31 [错误] 16#16: *1 js: 失败 ping{ “错误”: “找不到任何版本的模型 None” }
2020-07-08T17:07:32.170 02:00 169.254.255.130 - - [08/Jul/2020:15:07:31 0000] “GET /ping HTTP/1.1” 502 157 “-” “Go-http-client/1.1”
此外,当使用自建的docker tf-serving-container进行本地测试时,该模型运行没有问题,并且可以使用curl进行查询。可能是什么问题?
该问题的解决方案如下:
环境变量“SAGEMAKER_TFS_DEFAULT_MODEL_NAME”需要设置为正确的模型名称,例如“模型”
import os
import sagemaker
from sagemaker.tensorflow.serving import Model
from sagemaker.tensorflow.model import TensorFlowModel
from sagemaker.predictor import json_deserializer, json_serializer, RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_JSON
def create_tfs_sagemaker_model():
sagemaker_session = sagemaker.Session()
role = 'arn:aws:iam::XXXXXXXXX:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXX
bucket = 'tf-serving'
prefix = 'sagemaker/tfs-test'
s3_path = 's3://{}/{}'.format(bucket, prefix)
image = 'XXXXXXXX.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-tensorflow-serving:2.1.0-cpu'
model_data = sagemaker_session.upload_data('model.tar.gz', bucket, os.path.join(prefix, 'model'))
endpoint_name = 'tf-serving-ep-test-1'
env = {"SAGEMAKER_TFS_DEFAULT_MODEL_NAME": "model"}
tensorflow_serving_model = Model(model_data=model_data, role=role, sagemaker_session=sagemaker_session, image=image, name='model', framework_version='2.1', env=env)
tensorflow_serving_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
rt_predictor = RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sagemaker_session, serializer=json_serializer, content_type=CONTENT_TYPE_JSON, accept=CONTENT_TYPE_JSON)
这将正确创建endpoint并通过ping健康检查:
2020-07-16T12:08:20.654 02:00 10.32.0.2 - - [ 16/Jul/2020:10:08:20 0000] “GET /ping HTTP/1.1” 200 0 “-” “AHC/2.0”
看起来您的模型被命名为< code >“model”,TensorFlow在日志的前面提供:
2020-07-08T17:07:16.162+02:00 INFO:main:using default model name: model
2020-07-08T17:07:16.162+02:00 INFO:main:tensorflow serving model config:
但在错误中,ping检查被路由到TensorFlow Serving作为名为:'"无"的模型
`Could not find any versions of model None`
我不确定这个错误是由于Docker容器还是SageMaker端造成的。但是…我确实发现了这个可疑的环境变量<code>TFS_DEFAULT_MODEL_NAME:
class PythonServiceResource:
def __init__(self):
if SAGEMAKER_MULTI_MODEL_ENABLED:
self._model_tfs_rest_port = {}
self._model_tfs_grpc_port = {}
self._model_tfs_pid = {}
self._tfs_ports = self._parse_sagemaker_port_range(SAGEMAKER_TFS_PORT_RANGE)
else:
self._tfs_grpc_port = TFS_GRPC_PORT
self._tfs_rest_port = TFS_REST_PORT
self._tfs_enable_batching = SAGEMAKER_BATCHING_ENABLED == 'true'
self._tfs_default_model_name = os.environ.get('TFS_DEFAULT_MODEL_NAME', "None")
你能试着在你的容器中设置< code > TFS _默认_模型_名称看看会发生什么吗?
如果这不起作用,你可以在TensorFlow SageMaker容器github上发布一个bug。亚马逊专家会定期检查。
顺便说一句,我想更多地谈谈你是如何将SageMakerendpoint与TensorFlow模型一起用于我正在做的一些研究的。如果您愿意,请通过<code>向我发送电子邮件yoavz@modelzoo.dev。
我正在尝试使用无服务器将lambda函数部署到AWS。执行时 无服务器部署--详细 我得到以下错误每次: 无服务器错误--------------------------------------- 出现错误:mainTable-无效的KeySchema:第一个 myserverless.yml如下所示: 你们中有人能帮忙吗? 干杯
本地项目能够正常跑起来,然后我把项目jar包和nacos上传到服务器上跑,出现了错误:日志显示是无法注册成功。但是我服务器上的nacos已经能够正常访问,nacos telnet也能连通。项目的配置文件中nacos配置地址是服务器内网地址。8848,9848,9849端口都已放开。dubbo版本3.09,nacos版本2.1.0.(应该不是版本的问题,本地项目是能跑的)。 pom.xml prov
关于aiohttp服务器部署,这里有以下几种选择: 独立的服务器。 使用nginx, HAProxy等反向代理服务器,之后是后端服务器。 在反向代理之后在部署一层gunicorn,然后才是后端服务器。 独立服务器 只需要调用aiohttp.web.run_app(),并传递aiohttp.web.Application实例即可。 该方法最简单,也是在比较小的程序中最好的解决方法。但该方法并不能完全
我有一个简单的应用程序构建为docker图像(ubuntu),并把它放入docker容器。它有几个卷附加到它。我想将此容器推送到Azure AppServiceLinux。我尝试了几个选择,但没有成功。 > Azure CLI创建web应用程序并将容器推送到Azure容器注册表,然后将其部署到web应用程序。 给出错误。 将容器上传到,并更新Web应用容器设置以将此容器加载到Web应用中。 给出或
我的问题与这个问题有些相似。 我正在使用Cumber框架对企业java网站进行自动化测试。我写了很多。用于测试各种功能的功能文件。最近,当我对Jenkins进行回归测试时,一些测试开始崩溃。试验如下。 1)在chrome中进入临时邮件收件箱。 2)打开最新的电子邮件。 3) 单击“确认注册链接”,将打开一个新选项卡。 4)切换到新打开的选项卡,继续登录。 测试在第五步失败,出现“未找到元素”异常。
尽可能把client和server部署在同一台机器上,比如都部署在app server,或者一个网段中,减少网络延迟对于redis的影响。 如果是同一台机器,又想榨干redis性能可以考虑采用UNIX domain sockets配置方式,配置方式如下 # 0 = do not listen on a port port 0 # listen on localhost only bind 127.