Qwen 2.5 Local LLM

FastAPI proxy inference qua vLLM backend. Model Qwen/Qwen2.5-7B-Instruct-AWQ (AWQ 4-bit). RTX 3060, CUDA 2, port 8043.

Hoạt động

Thông tin: AWQ 4-bit quantization giúp giảm memory. Hỗ trợ tool calling, agent capabilities, /agent endpoint structured reasoning. Context 8K tokens.

Base URL

https://pnt.badt.vn/qwen

Authentication

Bearer token (API_AI_TOKEN):

Authorization: Bearer <API_AI_TOKEN>

API Endpoints

POST /v1/chat/completions OpenAI Compatible

OpenAI-compatible format, gọi vLLM backend port 8091.

Param	Type	Required	Description
model	string	Yes	qwen-7b-awq-vllm
messages	array	Yes	Message objects (role + content)
max_tokens	int	No	Default: 4096
temperature	float	No	0.0 - 2.0, default: 0.7
stream	bool	No	SSE streaming

POST /chat Native

Native endpoint, hỗ trợ system_prompt, streaming.

Param	Type	Required	Description
messages	array	Yes	Message objects
system_prompt	string	No	System instruction
max_tokens	int	No	Default: 4096
temperature	float	No	Default: 0.7
stream	bool	No	SSE streaming

GET /health Utility

Health check service and vLLM backend.

Thông số kỹ thuật

ModelQwen/Qwen2.5-7B-Instruct-AWQ

QuantizationAWQ 4-bit

Context8K tokens

vLLM Port8091 (CUDA 2)

Proxy Port8043 (FastAPI)

GPURTX 3060 12GB

Khả năng

Chat

Đa lượt

Analysis

Phân tích text

Translation

Đa ngôn ngữ

Tool Calling

Function calling

Agent

Agent capabilities