mirror of
https://github.com/tencentmusic/cube-studio.git
synced 2025-02-23 14:51:43 +08:00
add help url
This commit is contained in:
parent
4ab9308235
commit
cf447feb24
@ -1,55 +0,0 @@
|
||||
架构图
|
||||
|
||||

|
||||
|
||||
|
||||
多集群管理
|
||||
|
||||

|
||||
|
||||
|
||||
分布式存储
|
||||
|
||||

|
||||
|
||||
|
||||
在线debug
|
||||
|
||||

|
||||
|
||||
|
||||
pipeline编排
|
||||
|
||||

|
||||
|
||||
|
||||
job模板
|
||||
|
||||

|
||||
|
||||
|
||||
nni超参搜索
|
||||
|
||||

|
||||
|
||||
|
||||
分布式框架
|
||||
|
||||

|
||||
|
||||
|
||||
推理服务
|
||||
|
||||

|
||||
|
||||
|
||||
实时大模型训练
|
||||
|
||||

|
||||
|
||||
|
||||
界面效果
|
||||
|
||||

|
||||
|
||||
|
@ -79,7 +79,8 @@ __notebook__:开启一个jupyter-notebook,自动挂载个人工作目录。
|
||||
|
||||
### jupyter示例:
|
||||
|
||||

|
||||
<img width="70%" alt="167874734-5b1629e0-c3bb-41b0-871d-ffa43d914066" src="https://user-images.githubusercontent.com/20157705/167538488-cba41bf6-ba66-4150-b17e-f31f5cc5013d.png">
|
||||
|
||||
|
||||
### vscode示例:
|
||||
|
||||
@ -100,37 +101,15 @@ __notebook__:开启一个jupyter-notebook,自动挂载个人工作目录。
|
||||
|
||||

|
||||
|
||||
### 常用基础镜像
|
||||
|
||||
#### ubuntu
|
||||
|
||||
cuda10.2-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7-python3.7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.2-cudnn7-python3.8
|
||||
|
||||
cuda10.1-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.6
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.8
|
||||
|
||||
cuda10.0-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.6
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.0-cudnn7-python3.8
|
||||
|
||||
cuda9.1-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.6
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.7
|
||||
- ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda9.1-cudnn7-python3.8
|
||||
|
||||
|
||||
cuda10.1-cuda10.0-cuda9.0-cudnn7.6
|
||||
- ai.tencentmusic.com/tme-public/gpu:ubuntu18.04-python3.6-cuda10.1-cuda10.0-cuda9.0-cudnn7.6-base
|
||||
|
||||
扩展字段高级配置:
|
||||
```bash
|
||||
{
|
||||
"volume_mount":"kubeflow-user-workspace(pvc):/mnt,kubeflow-archives(pvc):/archives",
|
||||
"resource_memory":"8G",
|
||||
"resource_cpu": "4"
|
||||
}
|
||||
```
|
||||
[基础镜像和封装方法参考](https://github.com/tencentmusic/cube-studio/tree/master/images)
|
||||
|
||||
# 配置/调试/定时运行pipeline
|
||||
|
||||
@ -187,7 +166,8 @@ pod效果:
|
||||
|
||||
配置定时:pipeline编辑界面
|
||||
|
||||

|
||||
<img width="50%" alt="167874734-5b1629e0-c3bb-41b0-871d-ffa43d914066" src="https://user-images.githubusercontent.com/20157705/167538811-3644c420-5b00-4c13-af75-c672aef899b2.png">
|
||||
|
||||
|
||||
查看路径:训练-定时调度记录
|
||||
|
||||
@ -201,3 +181,257 @@ pod效果:
|
||||
1、平台会根据pipeline的配置决定是否发起调度。
|
||||
2、状态链接中可以看到本地调度发起的workflow的运行情况
|
||||
3、日志链接中可以看到本地调度发起的日志
|
||||
|
||||
# nni超参搜索
|
||||
|
||||
可以参考[nni官网](https://github.com/microsoft/nni)的书写方式
|
||||
|
||||
## 超参空间
|
||||
必须是标准的json。示例
|
||||
```
|
||||
{
|
||||
"batch_size": {"_type":"choice", "_value": [16, 32, 64, 128]},
|
||||
"hidden_size":{"_type":"choice","_value":[128, 256, 512, 1024]},
|
||||
"lr":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]},
|
||||
"momentum":{"_type":"uniform","_value":[0, 1]}
|
||||
}
|
||||
```
|
||||
不同超参算法支持不同的超参空间
|
||||
|
||||
|choice |choice(nested) |randint |uniform |quniform |loguniform |qloguniform |normal |qnormal |lognormal |qlognormal |
|
||||
| :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- | :----- |
|
||||
|TPE Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|
||||
|Random Search Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|
||||
|Anneal Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|
||||
|Evolution Tuner |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|
||||
|SMAC Tuner |✓ | |✓ |✓ |✓ |✓ | | | | | |
|
||||
|Batch Tuner |✓ | | | | | | | | | | |
|
||||
|Grid Search Tuner |✓ | |✓ | |✓ | | | | | | |
|
||||
|Hyperband Advisor |✓ | |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |✓ |
|
||||
|Metis Tuner |✓ | |✓ |✓ |✓ | | | | | | |
|
||||
|GP Tuner |✓ | |✓ |✓ |✓ |✓ |✓ | | | |
|
||||
|
||||
## 代码要求
|
||||
|
||||
### 参数接收
|
||||
启动超参搜索,会根据用户配置的超参搜索算法,选择好超参的可选值,并将选择值传递给用户的容器。例如上面的超参定义会在用户docker运行时传递下面的参数。所以用户不需要在启动命令或参数中添加这些变量,系统会自动添加,用户只需要在自己的业务代码中接收这些参数,并根据这些参数输出值就可以了。
|
||||
|
||||
```
|
||||
--lr=0.021593113434583065 --num-layers=5 --optimizer=ftrl
|
||||
```
|
||||
|
||||
### 结果上报
|
||||
业务方容器和代码启动接收超参进行迭代计算,通过主动上报结果来进行迭代。
|
||||
示例如下,用户代码需要能接受超参可取值为输入参数,同时每次迭代通过nni.report_intermediate_result上报每次epoch的结果值,并使用nni.report_final_result上报每次实例的结果值。
|
||||
```
|
||||
import os
|
||||
import argparse
|
||||
import logging,random,time
|
||||
import nni
|
||||
from nni.utils import merge_parameter
|
||||
|
||||
logger = logging.getLogger('mnist_AutoML')
|
||||
|
||||
def main(args):
|
||||
test_acc=random.randint(30,50)
|
||||
for epoch in range(1, 11):
|
||||
test_acc_epoch= random.randint(3,5)
|
||||
time.sleep(3)
|
||||
test_acc+=test_acc_epoch
|
||||
# 上报当前迭代目标值
|
||||
nni.report_intermediate_result(test_acc)
|
||||
# 上报最总目标值
|
||||
nni.report_final_result(test_acc)
|
||||
|
||||
|
||||
def get_params():
|
||||
# 必须接收超参数为输入参数
|
||||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
|
||||
parser.add_argument('--batch_size', type=int, default=64, help='input batch size for training (default: 64)')
|
||||
|
||||
args, _ = parser.parse_known_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
try:
|
||||
# get parameters form tuner
|
||||
tuner_params = nni.get_next_parameter()
|
||||
params = vars(merge_parameter(get_params(), tuner_params))
|
||||
print(tuner_params,params)
|
||||
main(params)
|
||||
except Exception as exception:
|
||||
logger.exception(exception)
|
||||
raise
|
||||
```
|
||||
|
||||
## web发起一个搜索实验
|
||||
|
||||

|
||||
|
||||
## web查看搜索效果
|
||||
|
||||
可以参考:https://nni.readthedocs.io/zh/stable/Tutorial/WebUI.html
|
||||
|
||||
总览界面可以看到实验的id,和当前示例运行的状态
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
可以看每次trial的运行情况,计算出来的目标值
|
||||
|
||||
|
||||

|
||||
|
||||
也可以看某次trial中每次epoch得到的结果值
|
||||
|
||||
# 内部服务
|
||||
|
||||
## 普通服务
|
||||
|
||||
### 开发注册
|
||||
|
||||
1、开发你的服务化镜像,push到docker仓库内
|
||||
|
||||
2、注册你的服务
|
||||
|
||||

|
||||
|
||||
## mysql web服务
|
||||
|
||||
镜像:ai.tencentmusic.com/tme-public/phpmyadmin
|
||||
|
||||
环境变量:
|
||||
```
|
||||
PMA_HOST=xx.xx.xx.xx
|
||||
PMA_PORT=xx
|
||||
PMA_USER=xx
|
||||
PMA_PASSWORD=xx
|
||||
```
|
||||
端口:80
|
||||
|
||||
## mongo web服务
|
||||
镜像:mongo-express:0.54.0
|
||||
|
||||
环境变量:
|
||||
```
|
||||
ME_CONFIG_MONGODB_SERVER=xx.xx.xx.xx
|
||||
ME_CONFIG_MONGODB_PORT=xx
|
||||
ME_CONFIG_MONGODB_ENABLE_ADMIN=true
|
||||
ME_CONFIG_MONGODB_ADMINUSERNAME=xx
|
||||
ME_CONFIG_MONGODB_ADMINPASSWORD=xx
|
||||
ME_CONFIG_MONGODB_AUTH_DATABASE=xx
|
||||
VCAP_APP_HOST=0.0.0.0
|
||||
VCAP_APP_PORT=8081
|
||||
ME_CONFIG_OPTIONS_EDITORTHEME=ambiance
|
||||
```
|
||||
端口:8081
|
||||
|
||||
## redis web
|
||||
镜像:ai.tencentmusic.com/tme-public/patrikx3:latest
|
||||
|
||||
环境变量
|
||||
```
|
||||
REDIS_NAME=xx
|
||||
REDIS_HOST=xx
|
||||
REDIS_PORT=xx
|
||||
REDIS_PASSWORD=xx
|
||||
```
|
||||
端口:7843
|
||||
|
||||
## 图数据库neo4j
|
||||
|
||||
镜像:ai.tencentmusic.com/tme-public/neo4j:4.4
|
||||
|
||||
环境变量
|
||||
```
|
||||
NEO4J_AUTH=neo4j/admin
|
||||
```
|
||||
端口:7474,7687
|
||||
|
||||
## jaeger链路追踪
|
||||
|
||||
镜像:jaegertracing/all-in-one:1.29
|
||||
|
||||
端口:5775,16686
|
||||
|
||||
# 推理服务
|
||||
|
||||
## 版本/域名/pod的关系
|
||||
`$服务名=$服务类型-$模型名-$模型版本(只取版本中的数字)`
|
||||
|
||||

|
||||
|
||||
`$k8s-deploymnet-name=$服务名`
|
||||
|
||||

|
||||
|
||||
`$k8s-hpa-name=$服务名`
|
||||
|
||||
在最大最小副本数不一致时创建hpa
|
||||
|
||||

|
||||
|
||||
`$k8s-service-name=$服务名` 用于域名的代理
|
||||
|
||||
`$k8s-service-name=$服务名-external` 用户ip/L5的代理
|
||||
|
||||

|
||||
|
||||
|
||||
## 系统自带域名
|
||||
|
||||
自动配置域名需要泛域名支持。例如泛域名为domain = *.kfserving.woa.com
|
||||
|
||||
生产域名
|
||||
|
||||
http://$服务名.service.$domain
|
||||
|
||||
测试环境域名
|
||||
|
||||
http://test.$服务名.service.$domain
|
||||
http://debug.$服务名.service.$domain
|
||||
|
||||
## 自定义域名
|
||||
|
||||
用户可通过host字段配置服务的访问域名,但是必须与泛域名结尾
|
||||
|
||||
多个服务可以配置相同的域名
|
||||
|
||||
## 流量复制和分流
|
||||
|
||||
多个服务(可以是相同模型或者不同模型间)配置相同的域名
|
||||
1、分流属性字段控制分配多少流量到其他服务上,剩余流量归属于当前服务
|
||||
2、流量镜像字段控制复制多少流量到其他服务上。但只会将当前服务的响应返回给客户端
|
||||
|
||||

|
||||
|
||||
## 灰度升级
|
||||
|
||||
1、同一个服务灰度升级,只需要修改服务的配置,重新部署,服务会自动滚动升级pod
|
||||
2、不同服务进行灰度升级。比如同一个模型的不同版本之间,那么多个服务使用相同的域名,新部署的服务上线正常后,会自动下线同域名的旧服务。
|
||||
|
||||
## 弹性伸缩容
|
||||
|
||||
弹性伸缩容的触发条件:可以使用自定义指标,可以使用其中一个指标或者多个指标,示例:cpu:50%,mem:%50,gpu:50%
|
||||
|
||||
## 环境变量
|
||||
|
||||
系统携带的环境变量
|
||||
```bash
|
||||
KUBEFLOW_ENV=test
|
||||
KUBEFLOW_MODEL_PATH=
|
||||
KUBEFLOW_MODEL_VERSION=
|
||||
KUBEFLOW_MODEL_IMAGES=
|
||||
KUBEFLOW_MODEL_NAME=
|
||||
KUBEFLOW_AREA=shanghai/guangzhou
|
||||
|
||||
K8S_NODE_NAME=
|
||||
K8S_POD_NAMESPACE=
|
||||
K8S_POD_IP=
|
||||
K8S_HOST_IP=
|
||||
K8S_POD_NAME=
|
||||
```
|
||||
|
||||
|
||||
|
@ -1,6 +1,6 @@
|
||||
# 在线构建镜像
|
||||
|
||||

|
||||

|
||||
|
||||
扩展字段高级配置(例如):
|
||||
```
|
@ -707,19 +707,16 @@ NOTEBOOK_GPU_TYPE='NVIDIA'
|
||||
|
||||
# 各类model list界面的帮助文档
|
||||
HELP_URL={
|
||||
"pipeline":"http://xx.xx/xx",
|
||||
"job_template":"http://xx.xx/xx",
|
||||
"task":"http://xx.xx/xx",
|
||||
"hp":"http://xx.xx/xx",
|
||||
"nni":"http://xx.xx/xx",
|
||||
"images":"http://xx.xx/xx",
|
||||
"notebook":"http://xx.xx/xx",
|
||||
"service":"http://xx.xx/xx",
|
||||
"kfserving":"http://xx.xx/xx",
|
||||
"inferenceservice":"http://xx.xx/xx",
|
||||
"model":"http://xx.xx/xx",
|
||||
"run":"http://xx.xx/xx",
|
||||
"docker":"http://xx.xx/xx"
|
||||
"pipeline":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"job_template":"https://github.com/tencentmusic/cube-studio/tree/master/job-template",
|
||||
"task":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"nni":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"images":"https://github.com/tencentmusic/cube-studio/tree/master/images",
|
||||
"notebook":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"service":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"inferenceservice":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"run":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example",
|
||||
"docker":"https://github.com/tencentmusic/cube-studio/tree/master/docs/example"
|
||||
}
|
||||
|
||||
# 不使用模板中定义的镜像而直接使用用户镜像的模板名称
|
||||
|
@ -161,7 +161,7 @@ def init():
|
||||
"range": "",
|
||||
"default": "ai.tencentmusic.com/tme-public/ubuntu-gpu:cuda10.1-cudnn7-python3.6",
|
||||
"placeholder": "",
|
||||
"describe": "要调试的镜像,<a target='_blank' href='https://github.com/tencentmusic/cube-studio/tree/master/docs/example/images'>基础镜像参考<a>",
|
||||
"describe": "要调试的镜像,<a target='_blank' href='https://github.com/tencentmusic/cube-studio/tree/master/imagess'>基础镜像参考<a>",
|
||||
"editable": 1,
|
||||
"condition": "",
|
||||
"sub_args": {}
|
||||
|
Loading…
Reference in New Issue
Block a user