百度智能云容器引擎服务 CCE 新建Pytorch任务

容器引擎服务CCE

简介/价格/文档

百度智能云容器引擎服务 CCE 新建Pytorch任务

文档简介：

您可以新建一个 Pytorch 类型的任务。前提条件：您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。若您是子用户，队列关联的用户中有您才能使用该队列新建任务。

*此产品及展示信息均由百度智能云官方提供。免费试用咨询热线：400-826-7010，为您提供专业的售前咨询，让您快速了解云产品，助您轻松上云！微信咨询

免费试用、价格特惠

文档详情

您可以新建一个 Pytorch 类型的任务。

前提条件

您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。
若您是子用户，队列关联的用户中有您才能使用该队列新建任务。
安装组件 CCE Deep Learning Frameworks Operator 时，系统安装了 Pytorch 深度学习框架。

操作步骤

登录百度智能云官网，并进入管理控制台。
选择“产品服务 > 云原生 > 容器引擎 CCE”，单击进入容器引擎管理控制台。
单击左侧导航栏中的集群管理 > 集群列表。
在集群列表页面中，单击目标集群名称进入集群管理页面。
在集群管理页面单击云原生AI > 任务管理。
在任务管理页面单击新建任务。
在新建任务页面中，完成任务基本信息配置：

任务名称：自定义任务名称，支持大小写字母、数字、以及-_ /.特殊字符，必须以中文或字母开头，长度 1-65。
队列：选择新建任务关联的队列。
框架：选择任务对应的深度学习框架“Pytorch”。

参考 yaml 模板完成配置：

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
        spec:
          schedulerName: volcano
          containers:
            - name: pytorch
              image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              resources:
                requests:
                  cpu: 1
                  memory: 1Gi
                limits:
                  baidu.com/v100_32g_cgpu: "1"
                  # for gpu core/memory isolation
                  baidu.com/v100_32g_cgpu_core: 10
                  baidu.com/v100_32g_cgpu_memory: "2"
              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
              # `mnist.py` needs to be replaced with the name of your gpu process.
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep mnist.py | grep -v grep

| awk '{print $2}'` && sleep 1"]
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          schedulerName: volcano
          containers:
            - name: pytorch
              image: registry.baidubce.com/cce-public/kubeflow/pytorch-dist-mnist-test-with-data:1.0
              args: ["--backend", "gloo"]
              resources:
                requests:
                  cpu: 1
                  memory: 1Gi
                limits:
                  baidu.com/v100_32g_cgpu: "1"
                  # for gpu core/memory isolation
                  baidu.com/v100_32g_cgpu_core: 20
                  baidu.com/v100_32g_cgpu_memory: "4"
              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
              # `mnist.py` needs to be replaced with the name of your gpu process.
              lifecycle:
                preStop:
                  exec:
                    command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep mnist.py | grep -v grep

 | awk '{print $2}'` && sleep 1"]

点击“确定”按钮，完成任务的新建。

相似文档

百度智能云容器引擎服务 CCE 新建Mxnet任务
您可以新建一个 Mxnet 类型的任务。前提条件：您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。若您是子用户，队列关联的用户中有您才能使用该队列新建任务。
百度智能云容器引擎服务 CCE 新建PaddlePaddle任务
您可以新建一个 PaddlePaddle 类型的任务。前提条件：您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。若您是子用户，队列关联的用户中有您才能使用该队列新建任务。
百度智能云容器引擎服务 CCE 查看任务信息
任务创建成功后，您可以在容器引擎控制台查看任务运行情况，包括基本信息、监控信息和日志信息。前提条件：您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。您成功创建任务。
百度智能云容器引擎服务 CCE 删除任务
您可以通过容器引擎管理控制台删除不再需要的任务。前提条件：您已成功安装 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 组件，否则云原生 AI 功能将无法使用。您已成功创建任务。
百度智能云容器引擎服务 CCE GPU独占和共享说明
若集群支持 GPU 算力和显存的共享与隔离，您可以在新建任务的时候，根据提交的 YMAL 来决定独占还是共享 GPU 资源。为保证正常使用GPU资源，请正确指定GPU卡型号对应的资源名称，本文以使用GPU卡数量为例，若想指定GPU卡算力资源或显存资源，则在资源名称后加上“_core”或“_memory”即可。

文档中心

全民上云·上云补贴申领

免费试用（限企业）

前提条件

操作步骤