问题：

AWS sagemaker容器：如何创建或传递resourceconfig。json到培训框架？

夏建弼

2023-03-14

我正在尝试为Amazon Sagemaker创建自定义模型/图像/容器。我已经阅读了所有的基础教程，如何根据您的要求创建图像。实际上，我有一个正确设置的映像，它运行tensorflow，在本地训练、部署和服务模型。

当我尝试使用sagemaker python SDK运行容器时，问题就出现了。更准确地说，尝试使用框架模块和类创建我自己的自定义估计器来运行自定义图像/容器。

在这里，我张贴的最低代码来解释我的情况：

文件结构：

.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
|   ├── train
|   ├── serve
|   ├── predict.py
|   └── custom_code/my_model_functions
|
└── local_test
    ├── train_local.sh
    ├── serve_local.sh
    ├── predict.sh
    └── test_dir
        ├── model/model.pkl
        ├── output/output.txt
        └── input
            ├── data/data.pkl
            └── config
                ├── hyperparameters.json
                ├── inputdataconfig.json
                └── resourceconfig.json

Dockerfile.

FROM ubuntu:16.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

# Install python and other runtime dependencies
RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
    apt-get -y install python3-dev python3-setuptools

# Install pip
RUN cd /tmp && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py && \
    rm get-pip.py

# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables

COPY local_test/test_dir /opt/ml

# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program

火车


from __future__ import absolute_import

import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint

# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset

# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env

logger = logging.getLogger(__name__)

def run_algorithm_mode():
    """Run training in algorithm mode, which does not require a user entry point. """

    train_config = os.environ.get('training_env_variables')
    model_path = os.environ.get("model_path")

    print("Downloading Dataset")
    train_dataset,  test_dataset = create_dataset(None)
    print("Creating Model")
    clf = CustomEstimator.create_model(train_config)
    print("Starting Training")
    clf = clf.train_model(train_dataset, test_dataset)
    print("Saving Model")
    module_name = 'classifier.pkl'
    CustomEstimator.save_model(clf, model_path)


def train(training_environment):
    """Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
    The user supplied module and its dependencies are downloaded from S3.
    Training is invoked by calling a "train" function in the user supplied module.
    Args:
        training_environment: training environment object containing environment variables,
                               training arguments and hyperparameters
    """

    if training_environment.user_entry_point is not None:
        print("Entry Point Receive")
        framework.modules.run_module(training_environment.module_dir,
                                     training_environment.to_cmd_args(),
                                     training_environment.to_env_vars(),
                                     training_environment.module_name,
                                     capture_error=False)
        print_directories()
    else:
        logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
        try:
            _env.write_env_vars(training_environment.to_env_vars())
        except Exception as error:
            print(error)
        run_algorithm_mode()

def main():
    train(framework.training_env())
    sys.exit(0)

if __name__ == '__main__':
    main()

测试你的制造厂商。ipynb

我使用sagemaker估计器的框架类创建了这个定制的sagemaker估计器。

import boto3
from sagemaker.estimator import Framework

class ScriptModeTensorFlow(Framework):
    """This class is temporary until the final version of Script Mode is released.
    """

    __framework_name__ = "tensorflow-scriptmode"

    create_model = TensorFlow.create_model

    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        image_name=None,
        **kwargs
    ):
        super(ScriptModeTensorFlow, self).__init__(
            entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
        )
        self.py_version = py_version
        self.image_name = None
        self.framework_version = '2.0.0'
        self.user_entry_point = entry_point
        print(self.user_entry_point)

然后创建传递entry_point和图像（类需要运行的所有其他参数）的估计器。

estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
                       image_name='sagemaker-custom-image:latest',
                       source_dir='source_dir_path/input/config',
                       train_instance_type='local',      # Run in local mode
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       py_version='py3',
                       role=role)

最后，击球训练...

estimator.fit({"train": "s3://s3-bucket-path/training_data"})

但我得到了以下错误：

Creating tmpm3ft7ijm_algo-1-mjqkd_1 ... 
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1  | Reporting training FAILURE
algo-1-mjqkd_1  | framework error: 
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1  |     env = sagemaker_containers.training_env()
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1  |     resource_config=_env.read_resource_config(),
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1  |     return _read_json(resource_config_file_dir)
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1  |     with open(path, "r") as f:
algo-1-mjqkd_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | 
algo-1-mjqkd_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1  |     subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1  |   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1  |     raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1  | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...

乍一看，这个错误似乎很明显，文件'/opt/ml/输入/配置/resourceconfig.json'丢失了。问题是我没有办法创建这个文件，以便sagemaker框架可以得到主机进行多重处理（我不需要他们耶）。当我创建图像'sagemaker自定义图像：最新'以下的文件夹结构显示以下，我已经给'resoruceconfig.json'的'/选择/ml/输入/配置/'文件夹内的图像。

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   ├── inputdataconfig.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure

阅读AWS中的文档，当使用sagemaker sdk运行您的映像时，它表示“opt/ml”文件夹中容器中的所有文件在培训期间可能不再可见。

/opt/ml和所有子目录由Amazon SageMaker training保留。在构建算法的docker映像时，请确保不要将算法所需的任何数据放在它们下面，因为这些数据在培训期间可能不再可见。亚马逊SageMaker如何运行您的培训形象

这基本上恢复了我的问题。

是的，我知道我可以利用sagemaker预先构建的估计器和图像。

是的，我知道我可以绕过framwork库，从docker run运行图像列车。

但是我需要实现一个完全定制的sagemaker sdk/image/container/model来与entrypoint一起使用。我知道他有点野心勃勃。

因此，重新表述我的问题：如何让Sagemaker框架或SDK在映像内部创建require resourceconfig。json文件？

共有2个答案

穆建元

2023-03-14

该文件似乎已从“resourceConfig.json”重命名为“resourceConfig.json”。

宋烨烁

2023-03-14

显然，远程运行映像解决了这个问题。我正在使用远程aws机器“ml.m5”。大的。sagemaker sdk代码中的某个地方正在创建并提供映像所需的文件。但仅当在远程计算机中运行时，而不是在本地运行时。

AWS sagemaker容器：如何创建或传递resourceconfig。json到培训框架？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档