Qwen 2.5 Local LLM

FastAPI proxy inference qua vLLM backend. Model Qwen/Qwen2.5-7B-Instruct-AWQ (AWQ 4-bit). RTX 3060, CUDA 2, port 8043.

Hoạt động
Thông tin: AWQ 4-bit quantization giúp giảm memory. Hỗ trợ tool calling, agent capabilities, /agent endpoint structured reasoning. Context 8K tokens.

Base URL

https://pnt.badt.vn/qwen

Authentication

Bearer token (API_AI_TOKEN):

Authorization: Bearer <API_AI_TOKEN>

API Endpoints

POST /v1/chat/completions OpenAI Compatible

OpenAI-compatible format, gọi vLLM backend port 8091.

ParamTypeRequiredDescription
modelstringYesqwen-7b-awq-vllm
messagesarrayYesMessage objects (role + content)
max_tokensintNoDefault: 4096
temperaturefloatNo0.0 - 2.0, default: 0.7
streamboolNoSSE streaming
POST /chat Native

Native endpoint, hỗ trợ system_prompt, streaming.

ParamTypeRequiredDescription
messagesarrayYesMessage objects
system_promptstringNoSystem instruction
max_tokensintNoDefault: 4096
temperaturefloatNoDefault: 0.7
streamboolNoSSE streaming
GET /health Utility

Health check service and vLLM backend.

Thông số kỹ thuật

ModelQwen/Qwen2.5-7B-Instruct-AWQ
QuantizationAWQ 4-bit
Context8K tokens
vLLM Port8091 (CUDA 2)
Proxy Port8043 (FastAPI)
GPURTX 3060 12GB

Khả năng

Chat

Đa lượt

Analysis

Phân tích text

Translation

Đa ngôn ngữ

Tool Calling

Function calling

Agent

Agent capabilities