cube-studio/job-template/job/kaldi_distributed_on_volcanojob
2022-07-19 12:07:29 +08:00
..
build.sh 修改镜像仓库地址 2022-06-08 16:47:57 +08:00
common.py 更新kaldi分布式训练模板 2022-05-25 17:27:55 +08:00
Dockerfile 添加spark serverless支持 2022-07-19 12:07:29 +08:00
k8s.pl add kaldi train 2022-05-15 20:00:46 +08:00
launcher.py 替换stern为in cluster模式 2022-07-12 17:19:55 +08:00
master.sh add kaldi train 2022-05-15 20:00:46 +08:00
README.md 修改镜像仓库地址 2022-06-08 16:47:57 +08:00
stern.sh 替换stern为in cluster模式 2022-07-12 17:19:55 +08:00
worker.sh add kaldi train 2022-05-15 20:00:46 +08:00

kaldi分布式训练 模板

镜像ccr.ccs.tencentyun.com/cube-studio/kaldi_distributed_on_volcano:v2 挂载4G(memory):/dev/shm,kubernetes-config(configmap):/root/.kube 环境变量:

NO_RESOURCE_CHECK=true
TASK_RESOURCE_CPU=4
TASK_RESOURCE_MEMORY=4G
TASK_RESOURCE_GPU=0

账号kubeflow-pipeline 启动参数:

{
    "shell": {
        "--working_dir": {
            "type": "str",
            "item_type": "str",
            "label": "",
            "require": 1,
            "choice": [],
            "range": "",
            "default": "",
            "placeholder": "启动目录",
            "describe": "启动目录",
            "editable": 1,
            "condition": "",
            "sub_args": {}
        },
        "--user_cmd": {
            "type": "str",
            "item_type": "str",
            "label": "",
            "require": 1,
            "choice": [],
            "range": "",
            "default": "./run.sh",
            "placeholder": "启动命令",
            "describe": "启动命令",
            "editable": 1,
            "condition": "",
            "sub_args": {}
        },
        "--num_worker": {
            "type": "str",
            "item_type": "str",
            "label": "",
            "require": 1,
            "choice": [],
            "range": "",
            "default": "2",
            "placeholder": "worker数量",
            "describe": "worker数量",
            "editable": 1,
            "condition": "",
            "sub_args": {}
        },
        "--image": {
            "type": "str",
            "item_type": "str",
            "label": "",
            "require": 1,
            "choice": [],
            "range": "",
            "default": "ccr.ccs.tencentyun.com/cube-studio/kaldi_distributed_worker:v1",
            "placeholder": "",
            "describe": "worker镜像直接运行你代码的环境镜像 <a href='https://github.com/tencentmusic/cube-studio/tree/master/images'>基础镜像</a>",
            "editable": 1,
            "condition": "",
            "sub_args": {}
        }
    }
}

使用说明

说明用户通过working_dir参数指定一个工作目录工作目录需包含run.sh、path.sh等参照kaldi官方样例 /egs/wsj/s5。同时需要将path里面的 KALDI_ROOT 修改为 /opt/kaldi/。当需要在k8s上运行分布式任务时将cmd.sh里面的各个pl替换为k8s.pl跟原来的替换为queue.pl类似k8s.pl 的参数和 run.pl 一样不接受限制资源的参数资源通过模板参数调整。任务运行时会以工作目录下run.sh为执行入口。

  • --working_dir:如上文描述。
  • --num_worker:启动多少个worker节点。
  • --image:默认即可如果要升级kaldi可以更换需要保证镜像内/opt/kaldi/安装了kaldi并且安装了ssh工具。