add help url

This commit is contained in:
pengluan 2022-05-24 11:53:31 +08:00
parent 4ab9308235
commit cf447feb24
5 changed files with 279 additions and 103 deletions

View File

@ -1,55 +0,0 @@
架构图
![image](https://user-images.githubusercontent.com/20157705/167534673-322f4784-e240-451e-875e-ada57f121418.png)
多集群管理
![image](https://user-images.githubusercontent.com/20157705/167534695-d63b8239-e85e-42c4-bc7b-5999b9eff882.png)
分布式存储
![image](https://user-images.githubusercontent.com/20157705/167534724-733ad796-745e-47e1-9224-9e749f918cf2.png)
在线debug
![image](https://user-images.githubusercontent.com/20157705/167534731-8d19cab9-1420-46cf-8a1d-a4c68823c63d.png)
pipeline编排
![image](https://user-images.githubusercontent.com/20157705/167534748-9adf82ae-fd08-46f1-9ba6-a60b55bb8d3b.png)
job模板
![image](https://user-images.githubusercontent.com/20157705/167534770-505ffce8-8172-49be-9506-b265cd6ed465.png)
nni超参搜索
![image](https://user-images.githubusercontent.com/20157705/167534784-255f101a-3273-4eea-9254-f2df6879ddf1.png)
分布式框架
![image](https://user-images.githubusercontent.com/20157705/167534807-ca9a847f-45dc-4acb-a124-099e5915d81f.png)
推理服务
![image](https://user-images.githubusercontent.com/20157705/167534820-9202851a-a97c-41f7-8d63-900d73e4c57e.png)
实时大模型训练
![image](https://user-images.githubusercontent.com/20157705/167534836-418855cf-daef-45a5-85c9-3bb1b7135f4f.png)
界面效果
![image](https://user-images.githubusercontent.com/20157705/167534850-e7f40f1e-058d-4370-be01-8bbcaf80c3e0.png)

View File

@ -79,7 +79,8 @@ __notebook__开启一个jupyter-notebook自动挂载个人工作目录。
### jupyter示例
![image](https://user-images.githubusercontent.com/20157705/167538488-cba41bf6-ba66-4150-b17e-f31f5cc5013d.png)
<img width="70%" alt="167874734-5b1629e0-c3bb-41b0-871d-ffa43d914066" src="https://user-images.githubusercontent.com/20157705/167538488-cba41bf6-ba66-4150-b17e-f31f5cc5013d.png">
### vscode示例
@ -100,37 +101,15 @@ __notebook__开启一个jupyter-notebook自动挂载个人工作目录。
![image](https://user-images.githubusercontent.com/20157705/167538625-39c19c33-a63d-44fa-a16a-2aaa7b480190.png)
### 常用基础镜像
#### ubuntu
cuda10.2-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7-python3.7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7-python3.8
cuda10.1-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.6
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.8
cuda10.0-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.6
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.8
cuda9.1-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.6
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.7
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.8
cuda10.1-cuda10.0-cuda9.0-cudnn7.6
- ai.tencentmusic.com/tme-public/gpu:ubuntu18.04-python3.6-cuda10.1-cuda10.0-cuda9.0-cudnn7.6-base
扩展字段高级配置:
```bash
{
"volume_mount":"kubeflow-user-workspace(pvc):/mnt,kubeflow-archives(pvc):/archives",
"resource_memory":"8G",
"resource_cpu": "4"
}
```
[基础镜像和封装方法参考](https://github.com/tencentmusic/cube-studio/tree/master/images)
# 配置/调试/定时运行pipeline
@ -187,7 +166,8 @@ pod效果
配置定时pipeline编辑界面
![image](https://user-images.githubusercontent.com/20157705/167538811-3644c420-5b00-4c13-af75-c672aef899b2.png)
<img width="50%" alt="167874734-5b1629e0-c3bb-41b0-871d-ffa43d914066" src="https://user-images.githubusercontent.com/20157705/167538811-3644c420-5b00-4c13-af75-c672aef899b2.png">
查看路径:训练-定时调度记录
@ -201,3 +181,257 @@ pod效果
1、平台会根据pipeline的配置决定是否发起调度。
2、状态链接中可以看到本地调度发起的workflow的运行情况
3、日志链接中可以看到本地调度发起的日志
# nni超参搜索
可以参考[nni官网](https://github.com/microsoft/nni)的书写方式
## 超参空间
必须是标准的json。示例
```
{
"batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
"hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
"lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
"momentum":{"_type":"uniform","_value":[0, 1]}
}
```
不同超参算法支持不同的超参空间
|choice |choice(nested) |randint |uniform |quniform |loguniform |qloguniform |normal |qnormal |lognormal |qlognormal |
| :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
|TPE Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|Random Search Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|Anneal Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|Evolution Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|SMAC Tuner |✓ | |✓ |✓ |✓ |✓ | | | | | |
|Batch Tuner |✓ | | | | | | | | | | |
|Grid Search Tuner |✓ | |✓ | |✓ | | | | | | |
|Hyperband Advisor |✓ | |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|Metis Tuner |✓ | |✓ |✓ |✓ | | | | | | |
|GP Tuner |✓ | |✓ |✓ |✓ |✓ |✓ | | | |
## 代码要求
### 参数接收
启动超参搜索会根据用户配置的超参搜索算法选择好超参的可选值并将选择值传递给用户的容器。例如上面的超参定义会在用户docker运行时传递下面的参数。所以用户不需要在启动命令或参数中添加这些变量系统会自动添加用户只需要在自己的业务代码中接收这些参数并根据这些参数输出值就可以了。
```
--lr=0.021593113434583065 --num-layers=5 --optimizer=ftrl
```
### 结果上报
业务方容器和代码启动接收超参进行迭代计算,通过主动上报结果来进行迭代。
示例如下用户代码需要能接受超参可取值为输入参数同时每次迭代通过nni.report_intermediate_result上报每次epoch的结果值并使用nni.report_final_result上报每次实例的结果值。
```
import os
import argparse
import logging,random,time
import nni
from nni.utils import merge_parameter
logger = logging.getLogger('mnist_AutoML')
def main(args):
test_acc=random.randint(30,50)
for epoch in range(1, 11):
test_acc_epoch= random.randint(3,5)
time.sleep(3)
test_acc+=test_acc_epoch
# 上报当前迭代目标值
nni.report_intermediate_result(test_acc)
# 上报最总目标值
nni.report_final_result(test_acc)
def get_params():
# 必须接收超参数为输入参数
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch_size', type=int, default=64, help='input batch size for training (default: 64)')
args, _ = parser.parse_known_args()
return args
if __name__ == '__main__':
try:
# get parameters form tuner
tuner_params = nni.get_next_parameter()
params = vars(merge_parameter(get_params(), tuner_params))
print(tuner_params,params)
main(params)
except Exception as exception:
logger.exception(exception)
raise
```
## web发起一个搜索实验
![image](https://user-images.githubusercontent.com/20157705/169943169-6fb72bdf-0913-4873-92be-6702b11084c7.png)
## web查看搜索效果
可以参考https://nni.readthedocs.io/zh/stable/Tutorial/WebUI.html
总览界面可以看到实验的id和当前示例运行的状态
![image](https://user-images.githubusercontent.com/20157705/169943044-65efa03d-6023-4675-978e-e2b10570dc54.png)
![image](https://user-images.githubusercontent.com/20157705/169943083-9eef65fd-dd1f-4a75-8100-c794be9a236b.png)
可以看每次trial的运行情况计算出来的目标值
![image](https://user-images.githubusercontent.com/20157705/169943117-43a19fc7-7598-44d6-82bf-af32ca618d12.png)
也可以看某次trial中每次epoch得到的结果值
# 内部服务
## 普通服务
### 开发注册
1、开发你的服务化镜像push到docker仓库内
2、注册你的服务
![image](https://user-images.githubusercontent.com/20157705/169932303-0ec981cc-09ca-423c-96f9-da164ed309da.png)
## mysql web服务
镜像ai.tencentmusic.com/tme-public/phpmyadmin
环境变量:
```
PMA_HOST=xx.xx.xx.xx
PMA_PORT=xx
PMA_USER=xx
PMA_PASSWORD=xx
```
端口80
## mongo web服务
镜像mongo-express:0.54.0
环境变量:
```
ME_CONFIG_MONGODB_SERVER=xx.xx.xx.xx
ME_CONFIG_MONGODB_PORT=xx
ME_CONFIG_MONGODB_ENABLE_ADMIN=true
ME_CONFIG_MONGODB_ADMINUSERNAME=xx
ME_CONFIG_MONGODB_ADMINPASSWORD=xx
ME_CONFIG_MONGODB_AUTH_DATABASE=xx
VCAP_APP_HOST=0.0.0.0
VCAP_APP_PORT=8081
ME_CONFIG_OPTIONS_EDITORTHEME=ambiance
```
端口8081
## redis web
镜像ai.tencentmusic.com/tme-public/patrikx3:latest
环境变量
```
REDIS_NAME=xx
REDIS_HOST=xx
REDIS_PORT=xx
REDIS_PASSWORD=xx
```
端口7843
## 图数据库neo4j
镜像ai.tencentmusic.com/tme-public/neo4j:4.4
环境变量
```
NEO4J_AUTH=neo4j/admin
```
端口7474,7687
## jaeger链路追踪
镜像jaegertracing/all-in-one:1.29
端口5775,16686
# 推理服务
## 版本/域名/pod的关系
`$服务名=$服务类型-$模型名-$模型版本(只取版本中的数字)`
![image](https://user-images.githubusercontent.com/20157705/169943323-0849f8fd-b20e-4036-9ce5-33892a5bb643.png)
`$k8s-deploymnet-name=$服务名`
![image](https://user-images.githubusercontent.com/20157705/169943360-b7883e39-f070-4dbb-af16-caf021e3b7fa.png)
`$k8s-hpa-name=$服务名`
在最大最小副本数不一致时创建hpa
![image](https://user-images.githubusercontent.com/20157705/169943401-6e7abef7-29e2-4986-a4c9-cb3d5da4a7f0.png)
`$k8s-service-name=$服务名` 用于域名的代理
`$k8s-service-name=$服务名-external` 用户ip/L5的代理
![image](https://user-images.githubusercontent.com/20157705/169943472-34b161c2-b487-4aab-a335-f45465bda33b.png)
## 系统自带域名
自动配置域名需要泛域名支持。例如泛域名为domain = *.kfserving.woa.com
生产域名
http://$服务名.service.$domain
测试环境域名
http://test.$服务名.service.$domain
http://debug.$服务名.service.$domain
## 自定义域名
用户可通过host字段配置服务的访问域名但是必须与泛域名结尾
多个服务可以配置相同的域名
## 流量复制和分流
多个服务(可以是相同模型或者不同模型间)配置相同的域名
1、分流属性字段控制分配多少流量到其他服务上剩余流量归属于当前服务
2、流量镜像字段控制复制多少流量到其他服务上。但只会将当前服务的响应返回给客户端
![image](https://user-images.githubusercontent.com/20157705/169944196-bd98064d-124f-4233-af24-5b226ab38831.png)
## 灰度升级
1、同一个服务灰度升级只需要修改服务的配置重新部署服务会自动滚动升级pod
2、不同服务进行灰度升级。比如同一个模型的不同版本之间那么多个服务使用相同的域名新部署的服务上线正常后会自动下线同域名的旧服务。
## 弹性伸缩容
弹性伸缩容的触发条件可以使用自定义指标可以使用其中一个指标或者多个指标示例cpu:50%,mem:%50,gpu:50%
## 环境变量
系统携带的环境变量
```bash
KUBEFLOW_ENV=test
KUBEFLOW_MODEL_PATH=
KUBEFLOW_MODEL_VERSION=
KUBEFLOW_MODEL_IMAGES=
KUBEFLOW_MODEL_NAME=
KUBEFLOW_AREA=shanghai/guangzhou
K8S_NODE_NAME=
K8S_POD_NAMESPACE=
K8S_POD_IP=
K8S_HOST_IP=
K8S_POD_NAME=
```

View File

@ -1,6 +1,6 @@
# 在线构建镜像
![](../pic/tapd_20424693_1630748567_87.png)
![image](https://user-images.githubusercontent.com/20157705/167538625-39c19c33-a63d-44fa-a16a-2aaa7b480190.png)
扩展字段高级配置(例如)
```

View File

@ -707,19 +707,16 @@ NOTEBOOK_GPU_TYPE='NVIDIA'
# 各类model list界面的帮助文档
HELP_URL={
"pipeline":"http://xx.xx/xx",
"job_template":"http://xx.xx/xx",
"task":"http://xx.xx/xx",
"hp":"http://xx.xx/xx",
"nni":"http://xx.xx/xx",
"images":"http://xx.xx/xx",
"notebook":"http://xx.xx/xx",
"service":"http://xx.xx/xx",
"kfserving":"http://xx.xx/xx",
"inferenceservice":"http://xx.xx/xx",
"model":"http://xx.xx/xx",
"run":"http://xx.xx/xx",
"docker":"http://xx.xx/xx"
"pipeline":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"job_template":"https://github.com/tencentmusic/cube-studio/tree/master/job-template",
"task":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"nni":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"images":"https://github.com/tencentmusic/cube-studio/tree/master/images",
"notebook":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"service":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"inferenceservice":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"run":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
"docker":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example"
}
# 不使用模板中定义的镜像而直接使用用户镜像的模板名称

View File

@ -161,7 +161,7 @@ def init():
"range": "",
"default": "ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.6",
"placeholder": "",
"describe": "要调试的镜像,<a target='_blank' href='https://github.com/tencentmusic/cube-studio/tree/master/docs/example/images'>基础镜像参考<a>",
"describe": "要调试的镜像,<a target='_blank' href='https://github.com/tencentmusic/cube-studio/tree/master/imagess'>基础镜像参考<a>",
"editable": 1,
"condition": "",
"sub_args": {}