Qwen 2.5 Local LLM
FastAPI proxy inference qua vLLM backend. Model Qwen/Qwen2.5-7B-Instruct-AWQ (AWQ 4-bit). RTX 3060, CUDA 2, port 8043.
Hoạt động
Thông tin:
AWQ 4-bit quantization giúp giảm memory. Hỗ trợ tool calling, agent capabilities, /agent endpoint structured reasoning. Context 8K tokens.
Base URL
https://pnt.badt.vn/qwen
Authentication
Bearer token (API_AI_TOKEN):
Authorization: Bearer <API_AI_TOKEN>
API Endpoints
POST
/v1/chat/completions
OpenAI Compatible
OpenAI-compatible format, gọi vLLM backend port 8091.
| Param | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | qwen-7b-awq-vllm |
| messages | array | Yes | Message objects (role + content) |
| max_tokens | int | No | Default: 4096 |
| temperature | float | No | 0.0 - 2.0, default: 0.7 |
| stream | bool | No | SSE streaming |
POST
/chat
Native
Native endpoint, hỗ trợ system_prompt, streaming.
| Param | Type | Required | Description |
|---|---|---|---|
| messages | array | Yes | Message objects |
| system_prompt | string | No | System instruction |
| max_tokens | int | No | Default: 4096 |
| temperature | float | No | Default: 0.7 |
| stream | bool | No | SSE streaming |
GET
/health
Utility
Health check service and vLLM backend.
Thông số kỹ thuật
ModelQwen/Qwen2.5-7B-Instruct-AWQ
QuantizationAWQ 4-bit
Context8K tokens
vLLM Port8091 (CUDA 2)
Proxy Port8043 (FastAPI)
GPURTX 3060 12GB
Khả năng
Chat
Đa lượt
Analysis
Phân tích text
Translation
Đa ngôn ngữ
Tool Calling
Function calling
Agent
Agent capabilities