【AI实战】大模型 LLM 部署推理框架的 vLLM 应用

vLLM介绍
环境配置
- 环境要求
- 安装 vllm
算力要求
- 算力查询方法
- 算力问题
Quickstart
- 离线批量推理
- API Server
- 兼容 OpenAI Server
Serving
- 分布式推理和服务
- 使用 SkyPilot 运行服务
模型
- vLLM支持的模型
- 添加自己的模型
参考

vLLM介绍

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM 速度很快:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels

vLLM灵活且易于使用:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server

vLLM 无缝支持多数 Huggingface 模型，包括：

BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
LLaMA (lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)

环境配置

环境要求

OS: Linux
Python: 3.8 or higher
CUDA: 11.0 – 11.8
GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)

安装 vllm

pip安装

pip install vllm

源码安装

git clone https://github.com/vllm-project/vllm.gitcd vllmpip install -e .# This may take 5-10 minutes.

算力要求

算力查询方法

打开bing查询地址：https://cn.bing.com/
查询方式选择国际版
输入查询内容：
```
t4 GPUscompute capability
```
我的 GPU 是 T4，修改 t4 为你的即可
查询结果如下：

算力问题

vllm 对GPU 的 compute capability 要求必须大于等于 7.0，否则会报错，错误信息如下：

RuntimeError: GPUs with compute capability less than 7.0 are not supported.

Quickstart

离线批量推理

示例代码：

from vllm import LLM, SamplingParamsprompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",]sampling_params = SamplingParams(temperature=0.8, top_p=0.95)llm = LLM(model="facebook/opt-125m")outputs = llm.generate(prompts, sampling_params)# Print the outputs.for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

API Server

以 FastAPI server为例子，服务使用 AsyncLLMEngine类来支持异步请求。

开启服务：

python -m vllm.entrypoints.api_server

默认接口：http://localhost:8000
默认模型：OPT-125M model

测试：

curl http://localhost:8000/generate \-d '{"prompt": "San Francisco is a","use_beam_search": true,"n": 4,"temperature": 0}'

兼容 OpenAI Server

开启服务：

python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m

可选参数：--host，--port

查询服务：

curl http://localhost:8000/v1/models

测试：

curl http://localhost:8000/v1/completions \-H "Content-Type: application/json" \-d '{"model": "facebook/opt-125m","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}'

Serving

分布式推理和服务

安装依赖库：

pip install ray

多GPU推理
4块GPU推理：

from vllm import LLMllm = LLM("facebook/opt-13b", tensor_parallel_size=4)output = llm.generate("San Franciso is a")

使用 tensor_parallel_size 指定 GPU 数量

多GPU服务

python -m vllm.entrypoints.api_server \--model facebook/opt-13b \--tensor-parallel-size 4

扩展到多节点
运行vllm之前开启Ray runtime：

# On head noderay start --head# On worker nodesray start --address=

使用 SkyPilot 运行服务

安装 SkyPilot ：

pip install skypilotsky check

serving.yaml：

resources:accelerators: A100envs:MODEL_NAME: decapoda-research/llama-13b-hfTOKENIZER: hf-internal-testing/llama-tokenizersetup: |conda create -n vllm python=3.9 -yconda activate vllmgit clone https://github.com/vllm-project/vllm.gitcd vllmpip install .pip install gradiorun: |conda activate vllmecho 'Starting vllm api server...'python -u -m vllm.entrypoints.api_server \--model $MODEL_NAME \--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \--tokenizer $TOKENIZER 2>&1 | tee api_server.log &echo 'Waiting for vllm api server to start...'while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; doneecho 'Starting gradio server...'python vllm/examples/gradio_webserver.py

开启服务：

sky launch serving.yaml

其他可选参数：

sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf

测试：
浏览器打开：https://.gradio.live

模型

vLLM支持的模型

https://vllm.readthedocs.io/en/latest/models/supported_models.html#supported-models

添加自己的模型

本文档提供了将HuggingFace Transformers模型集成到vLLM中的高级指南。
https://vllm.readthedocs.io/en/latest/models/adding_model.html

参考

1.https://vllm.readthedocs.io/en/latest/
2.https://github.com/vllm-project/vllm
3.https://vllm.ai/
4.https://github.com/vllm-project/vllm/discussions
5.https://github.com/skypilot-org/skypilot/blob/master/llm/vllm

【AI实战】大模型 LLM 部署推理框架的 vLLM 应用

【AI实战】大模型 LLM 部署推理框架的 vLLM 应用

vLLM介绍

环境配置

环境要求

安装 vllm

算力要求

算力查询方法

算力问题

Quickstart

离线批量推理

API Server

兼容 OpenAI Server

Serving

分布式推理和服务

使用 SkyPilot 运行服务

模型

vLLM支持的模型

添加自己的模型

参考

最新关注

热文推荐

如何用 GPT-4 帮你写游戏（以24点游戏举例）

踩坑 | vue动态绑定img标签src属性的一系列报错

改进YOLOv5系列：27.YOLOv5 结合 Swin Transformer V2结构，Swin Transformer V2：通向视觉大模型之路

【程序环境与预处理】

【AI机器学习入门与实战】CNN卷积神经网络识别图片验证码案例

【JavaScript】3.3 JavaScript工具和库

【AI实战】大模型 LLM 部署推理框架的 vLLM 应用

【AI实战】大模型 LLM 部署推理框架的 vLLM 应用

vLLM介绍

环境配置

环境要求

安装 vllm

算力要求

算力查询方法

算力问题

Quickstart

离线批量推理

API Server

兼容 OpenAI Server

Serving

分布式推理和服务

使用 SkyPilot 运行服务

模型

vLLM支持的模型

添加自己的模型

参考

相关文章

最新关注

热文推荐