Qwen3 14B Claude-Distill

FastAPI proxy inference qua vLLM backend (port 8093). Model Qwen3 14B distilled từ Claude. RTX 5060 Ti 16GB CUDA 3.

Hoạt động
Model mạnh nhất hệ thống. Tốt cho coding, debugging, reasoning, analysis. OpenAI-compatible + tool calling. Dùng VLLM_API_KEY cho auth.

Base URL

https://pnt.badt.vn/agenticcoder

Authentication

Bearer token (VLLM_API_KEY):

Authorization: Bearer <VLLM_API_KEY>

API Endpoints

POST /v1/chat/completions OpenAI Compatible

OpenAI-compatible format, gọi vLLM backend port 8093.

ParamTypeRequiredDescription
modelstringYesqwen3-claude-distill
messagesarrayYesMessage objects
max_tokensintNoDefault: 4096
temperaturefloatNo0.0 - 2.0, default: 0.7
streamboolNoSSE streaming
toolsarrayNoFunction/tool definitions
POST /chat Native

Native endpoint, system_prompt + streaming + tool calling.

ParamTypeRequiredDescription
messagesarrayYesMessage objects
system_promptstringNoSystem instruction
max_tokensintNoDefault: 4096
temperaturefloatNoDefault: 0.7
streamboolNoSSE
GET /health Utility

Service & vLLM health.

GET /v1/models OpenAI

List available models.

Thông số kỹ thuật

Modelqwen3-claude-distill (14B)
Context~14K tokens
vLLM Port8093 (CUDA 3)
Proxy Port8045 (FastAPI)
GPURTX 5060 Ti 16GB
AuthVLLM_API_KEY

Khả năng

Coding

Code gen & debug

Reasoning

Deep reasoning

Debug

Troubleshoot

Chat

Đa lượt

Tool Calling

Function calling