What is vLLM-DeepSeek-R1-Distill
Knowledge distillation technology can transfer the reasoning capabilities of large models to smaller models, significantly enhancing the performance of compact models. By using reasoning data generated through DeepSeek-R1, the DeepSeek team has fine-tuned several commonly used small dense models and open-sourced the DeepSeek-R1-Distill series of distilled models based on Qwen2.5 and Llama3 architectures, covering 1.5B, 7B, 8B, 14B, 32B, and 70B parameter scales. Detailed model names and recommended configurations:
Model Name | Recommended Configuration |
---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 1x RTX 4090 |
DeepSeek-R1-Distill-Qwen-7B | 1x RTX 4090 |
DeepSeek-R1-Distill-Llama-8B | 1x RTX 4090 |
DeepSeek-R1-Distill-Qwen-14B | 2x RTX 4090 |
DeepSeek-R1-Distill-Qwen-32B | 4x RTX 4090 |
DeepSeek-R1-Distill-Llama-70B | 8x RTX 4090 |
Advantages of Distilled Models:
- Performance Enhancement::Maintains comparable performance to original models while reducing parameter counts. Distilled small models outperform reinforcement learning (RL)-trained counterparts in reasoning tasks.
- Resource Efficiency::Compact models are ideal for hardware with limited resources (single/multi-GPU setups), significantly improving inference speed and reducing GPU memory consumption.
vLLM-DeepSeek-R1-Distill
integrates DeepSeek-R1-Distill
models with the vLLM
framework, which optimizes inference performance for large language models through efficient GPU memory management and distributed inference support, substantially boosting operational efficiency.
How to Run vLLM-DeepSeek-R1-Distill
Starting vLLM Service
The pre-configured environment requires no additional setup. Launch the service with a single command!
After initializing your instance, execute the following command in JupyterLab
:
# Start vLLM API service
vllm serve <model_path> --port 8000
Example for DeepSeek-R1-Distill-Qwen-7B::
# DeepSeek-R1-Distill-Qwen-7B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --port 8000 --max-model-len 65536
When you see the output below in the console, the service is successfully running:
The service uses port 8000
by default. Enable this port in your firewall for public access.
Recommended GPU Configurations & Launch Commands
For 1x RTX 4090
# DeepSeek-R1-Distill-Qwen-1.5B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 8000
# DeepSeek-R1-Distill-Qwen-7B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --port 8000 --max-model-len 65536
# DeepSeek-R1-Distill-Llama-8B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Llama-8B --port 8000 --max-model-len 17984
For 2x RTX 4090:
# DeepSeek-R1-Distill-Qwen-14B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --port 8000 -tp 2 --max-model-len 59968
For 4x RTX 4090:
# DeepSeek-R1-Distill-Qwen-32B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --port 8000 -tp 4 --max-model-len 65168
For 8x RTX 4090:
# DeepSeek-R1-Distill-Llama-70B
vllm serve /model/HuggingFace/deepseek-ai/DeepSeek-R1-Distill-Llama-70B --port 8000 -tp 8 --max-model-len 88048
Running Examples
After starting the vLLM service (do not terminate the process), create a new Launcher via the top-left "+" button:
Open a new Terminal from the Launcher::
In the workspace/
directory, run test.py
to validate the model. This script requests the model to generate a 200-word essay:
`
python test.py
View the full code in test.py via
JupyterLab
.
After execution, the console will display the model's response (see below). The <reasoning>
tags enclose intermediate steps, while the final output appears outside these tags:
The DeepSeek-R1-Distill
series model is a large-scale inference language model, so its reply content has two parts, among which <think>
and </think>
wrap the inference process of the output of the large model, and the final output content of the large model outside the label is the output result we want.
Benchmark Scores & Usage Scenarios
Recommended Use Cases: 1.Resource-constrained scenarios: Qwen-7B (1189 CF / 92.8 MATH) 2.Balanced performance needs: Qwen-14B (1481 CF / 93.9 MATH) 3.High-performance demands: Qwen-32B or Llama-70B (CF 1600+ / MATH 94%+) 4.Quality-critical tasks: Llama-70B (GPQA 65.2 / AIME cons@64 86.7)