欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/136499768

PyTorchJob 是 Kubernetes 中的自定义资源,用于在 Kubernetes 上运行 PyTorch 训练任务,这是 Kubeflow 组件的一部分,具有稳定的状态,PyTorchJob 允许像管理 Kubernetes 中的其他内置资源一样创建和管理 PyTorch 作业。要使用 PyTorchJob,需要先安装 PyTorch Operator。默认情况下,PyTorch Operator 会作为控制器部署在 training operator 中。

YAML 配置如下,其中:

  • kindPyTorchJob
  • metadata/name,运行的 Job 名称,不要重名
  • 节点使用 Workerreplicas 重复的节点数量,resources 配置 GPU 数量,即支持2机1卡,或1机2卡
  • command 是运行命令

源码:

apiVersion: "kubeflow.org/v1"kind: PyTorchJobmetadata:name: pytorch-simple-001spec:pytorchReplicaSpecs:Worker:replicas: 1template:metadata:annotations:sidecar.istio.io/inject: "false"labels:file-mount: "true"user-mount: "true"spec:#hostNetwork: false# Newcontainers:- name: pytorchcommand:- /bin/sh- -cl- "bash k8s/run_grid0_for_gpu1.sh > nohup.test.log 2>&1"image: "harbor.[xxx].com/cryoem:v1.3.1"imagePullPolicy: AlwayssecurityContext:# Newprivileged: falsecapabilities:add: [ "IPC_LOCK" ]resources:limits:rdma/hca : 1cpu: 12memory: "100G"nvidia.com/gpu: 2workingDir: "workspace/cryoem-project/"volumeMounts:- name: cache-volume# change the name to your volume on k8smountPath: /dev/shmnodeSelector:gpu.device: "a100"# support 'a10' or 'a100'group: "algo2"tolerations:- effect: NoSchedulekey: roleoperator: Equalvalue: "algo2"volumes: - name: cache-volume# change the name to your volume on k8s emptyDir: medium: Memory sizeLimit: "960G"

查看运行情况:

kubectl get pytorchjobs# kubectl delete pytorchjobs pytorch-simple-001kubectl get podskubectl exec -it -n [your name] pytorch-simple-001-worker-0 bash

运行结果:

Thu Mar7 07:39:13 2024 +-----------------------------------------------------------------------------+| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+| GPUNamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC || FanTempPerfPwr:Usage/Cap| Memory-Usage | GPU-UtilCompute M. || || MIG M. ||===============================+======================+======================|| 0NVIDIA A800-SXM...On | 00000000:58:00.0 Off |0 || N/A 52CP0 259W / 400W | 7833MiB / 81920MiB | 93%Default || || Disabled |+-------------------------------+----------------------+----------------------+| 1NVIDIA A800-SXM...On | 00000000:D0:00.0 Off |0 || N/A 52CP0 235W / 400W |12917MiB / 81920MiB | 93%Default || || Disabled |+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+| Processes:||GPU GI CIPID Type Process nameGPU Memory ||ID ID Usage||=============================================================================|+-----------------------------------------------------------------------------+