Qwen3 14B LLM - API Documentation

Tổng quan

Qwen3 14B LLM Service

Dịch vụ LLM chạy local với Qwen3-14B-Claude-4.5-Opus-Distill (GGUF Q4_K_M). Model được chưng cất (distilled) từ Claude Opus 4.5 — kế thừa khả năng suy luận chuyên sâu trong một model 14B nhỏ gọn, hiệu quả. Chạy trên GPU RTX 5060 Ti 16GB với vLLM backend, hỗ trợ streaming và OpenAI-compatible API đầy đủ.

🧬 Claude-Distilled Model

Model này được huấn luyện bằng kỹ thuật knowledge distillation từ Claude Opus 4.5 — một trong những LLM mạnh nhất hiện tại. Kết quả là một model 14B có khả năng suy luận vượt trội so với kích thước, đặc biệt cho các tác vụ y khoa phức tạp, phân tích lâm sàng và trả lời câu hỏi có cấu trúc.

Đặc điểm kỹ thuật

Model	Qwen3-14B-Claude-4.5-Opus-Distill (`qwen3-claude-distill`)
Quantization	GGUF Q4_K_M (~14 GiB loaded)
Framework	vLLM (OpenAI-compatible backend)
Architecture ports	vLLM backend :8093, FastAPI gateway :8045
GPU	RTX 5060 Ti 16GB (CUDA_VISIBLE_DEVICES=3)
VRAM Usage	~14GB (model weights + KV cache)
Context Window	14,384 tokens (`max_model_len=14384`)
Ngôn ngữ	Tiếng Việt, tiếng Anh, tiếng Trung và 29+ ngôn ngữ khác
Vai trò trong hệ thống	High-reasoning LLM — phân tích lâm sàng phức tạp, giảng dạy y khoa

Tính năng nổi bật

High Reasoning

Suy luận chuyên sâu — distilled từ Claude Opus 4.5 cho tác vụ y khoa phức tạp

Streaming

Server-Sent Events (SSE) streaming — phản hồi token-by-token, tích hợp VS Code Continue

14K Context

Context window 14.384 tokens — phân tích hồ sơ bệnh án dài, tài liệu y khoa

Đa ngôn ngữ

Tiếng Việt, Anh, Trung chất lượng cao — phù hợp môi trường bệnh viện

OpenAI API

Drop-in replacement cho OpenAI Python/JS SDK — không cần thay đổi code

Local & Bảo mật

Xử lý 100% local — dữ liệu bệnh nhân không rời khỏi hệ thống

API Endpoints

Base URL

https://pnt.badt.vn/qwen3

Danh sách Endpoints

GET /health

Health check — trả về trạng thái service và vLLM backend. Không yêu cầu xác thực.

Response

{
  "status": "healthy"
}

Trả về 503 nếu vLLM backend chưa sẵn sàng.

GET /

Root info — trạng thái service và cờ vLLM ready.

Response

{
  "service": "Qwen3-14B API",
  "vllm_ready": true
}

POST /chat

Endpoint linh hoạt — hỗ trợ cả Simple Chat (field message) và OpenAI format (field messages).

Lưu ý: Khi dùng format messages, endpoint tự động trả về Streaming (SSE). Khi dùng message đơn giản, trả về JSON thông thường.

Request — Simple Chat

{
  "message": "string (required)",
  "system": "Bạn là trợ lý y khoa chuyên nghiệp.",
  "temperature": 0.7,
  "max_tokens": 2048
}

Response — Simple Chat

{
  "response": "string",
  "model": "qwen3-14b",
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 180,
    "total_tokens": 205
  }
}

Request — OpenAI Messages Format (Streaming)

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "temperature": 0.7,
  "max_tokens": 2048
}

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Tương thích hoàn toàn với OpenAI Python/JS SDK và VS Code Continue.

Request body

{
  "model": "qwen3-14b",
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."}
  ],
  "temperature": 0.7,
  "max_tokens": 2048
}

GET /v1/models

List available models (OpenAI format). Proxy tới vLLM backend.

Xác thực

Bearer Token Authentication Required

Tất cả endpoint (trừ /health) đều yêu cầu Bearer token trong Authorization header. Liên hệ admin để nhận API token.

Cách dùng

Authorization: Bearer YOUR_API_TOKEN

Lưu ý bảo mật

Không chia sẻ token công khai hoặc commit vào source code
Dùng environment variables để lưu token
Rate limiting áp dụng để đảm bảo công bằng

Ví dụ sử dụng

1. Simple Chat (cURL)

curl -X POST https://pnt.badt.vn/qwen3/chat \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Triệu chứng của bệnh viêm phổi là gì?",
    "max_tokens": 300,
    "temperature": 0.7
  }'

Response

{
  "response": "Triệu chứng của bệnh viêm phổi thường bao gồm:\n\n1. Ho có đờm (có thể có màu vàng/xanh/đỏ)\n2. Sốt cao (38-40°C), rét run\n3. Khó thở, thở nhanh...",
  "model": "qwen3-14b",
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 135,
    "total_tokens": 157
  }
}

2. Chat với System Prompt chuyên biệt (Python)

import requests
import os

url = "https://pnt.badt.vn/qwen3/chat"
headers = {
    "Authorization": f"Bearer {os.getenv('QWEN3_API_TOKEN')}",
    "Content-Type": "application/json"
}
data = {
    "message": "BN nam 58 tuổi, ECG: ST chênh lên V1-V4, Troponin T: 3.2 ng/mL. Chẩn đoán và hướng xử trí?",
    "system": "Bạn là bác sĩ tim mạch can thiệp. Phân tích chuyên sâu, ngắn gọn, đưa ra khuyến cáo điều trị cụ thể.",
    "max_tokens": 600,
    "temperature": 0.4
}

response = requests.post(url, json=data, headers=headers)
result = response.json()
print(result["response"])

3. Health Check

curl https://pnt.badt.vn/qwen3/health

Response (khi OK)

{
  "status": "healthy"
}

Response (khi vLLM chưa sẵn sàng)

{
  "status": "vLLM not ready",
  "vllm_url": "http://localhost:8093"
}

Streaming (SSE)

⚡ Server-Sent Events Streaming

Endpoint /chat khi nhận payload có field messages sẽ tự động trả về StreamingResponse (text/event-stream). Đây là chế độ mặc định khi dùng với VS Code Continue plugin hoặc bất kỳ client OpenAI-compatible nào.

Ví dụ Streaming với Python

import requests
import json
import os

url = "https://pnt.badt.vn/qwen3/chat"
headers = {
    "Authorization": f"Bearer {os.getenv('QWEN3_API_TOKEN')}",
    "Content-Type": "application/json"
}
data = {
    "messages": [
        {"role": "system", "content": "Bạn là bác sĩ AI chuyên nội khoa."},
        {"role": "user", "content": "Giải thích cơ chế bệnh sinh của suy tim mạn tính."}
    ],
    "temperature": 0.6,
    "max_tokens": 800
}

with requests.post(url, json=data, headers=headers, stream=True) as resp:
    for line in resp.iter_lines():
        if line:
            line = line.decode("utf-8")
            if line.startswith("data: "):
                chunk = line[6:]
                if chunk == "[DONE]":
                    break
                try:
                    obj = json.loads(chunk)
                    delta = obj["choices"][0]["delta"].get("content", "")
                    print(delta, end="", flush=True)
                except json.JSONDecodeError:
                    pass
print()  # newline

Tích hợp VS Code Continue

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Qwen3 14B (Local)",
      "provider": "openai",
      "model": "qwen3-14b",
      "apiBase": "https://pnt.badt.vn/qwen3/v1",
      "apiKey": "YOUR_API_TOKEN"
    }
  ]
}

OpenAI Compatible API

Qwen3 service hỗ trợ đầy đủ OpenAI-compatible API — dùng trực tiếp với OpenAI Python/JS SDK:

Python với OpenAI SDK

from openai import OpenAI
import os

client = OpenAI(
    api_key="not-needed",
    base_url="https://pnt.badt.vn/qwen3/v1",
    default_headers={
        "Authorization": f"Bearer {os.getenv('QWEN3_API_TOKEN')}"
    }
)

response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[
        {"role": "system", "content": "Bạn là bác sĩ AI chuyên y học nội khoa."},
        {"role": "user", "content": "Phân tích: BN 70 tuổi, tăng huyết áp, Cr 2.8 mg/dL. Lựa chọn thuốc hạ áp phù hợp?"}
    ],
    temperature=0.5,
    max_tokens=500
)

print(response.choices[0].message.content)

JavaScript với OpenAI SDK

import OpenAI from 'openai';

const client = new OpenAI({
    apiKey: 'not-needed',
    baseURL: 'https://pnt.badt.vn/qwen3/v1',
    defaultHeaders: {
        'Authorization': `Bearer ${process.env.QWEN3_API_TOKEN}`
    }
});

const response = await client.chat.completions.create({
    model: 'qwen3-14b',
    messages: [
        { role: 'system', content: 'Bạn là trợ lý y tế AI chuyên nghiệp.' },
        { role: 'user', content: 'Xin chào, tôi cần tư vấn về bệnh đái tháo đường.' }
    ],
    max_tokens: 400
});

console.log(response.choices[0].message.content);

So sánh các LLM trong hệ thống

Chọn LLM phù hợp với từng tác vụ

Tính năng	Qwen3 14B	Qwen 2.5-7B	Gemma4 E4B
Parameters	14B (GGUF Q4_K_M)	7B (AWQ 4-bit)	4.5B eff. (BnB INT4)
Context	14K tokens	8K tokens	128K tokens
Reasoning	✅ Claude-distilled	⚡ Standard	✅ Thinking mode
Tool-calling	❌	✅	✅
Multimodal	❌ Text only	❌ Text only	✅ Text + Image
Streaming	✅ SSE	❌	❌
VRAM	~14GB	~5GB	~10GB
GPU	RTX 5060 Ti 16GB	RTX 3060 12GB	RTX 3060 12GB
Base URL	`/qwen3/v1`	`/qwen/v1`	`/gemma4/v1`
Phù hợp nhất	Phân tích sâu, giảng dạy	Agent tasks, tool-use	Tài liệu dài, vision

Rate Limits

Fair Usage Policy

Max concurrent requests: phụ thuộc cấu hình vLLM
Request timeout: 1.200 giây (streaming long-form)
Max tokens per request: 2.048 (mặc định), tối đa ~14.000
Max model context: 14.384 tokens
Context truncation: tự động cắt lịch sử hội thoại cũ khi vượt giới hạn

Context Truncation tự động

Service tự động cắt bớt lịch sử hội thoại khi tổng token vượt giới hạn. Chiến lược: luôn giữ system message và message cuối cùng của user, xóa các message cũ nhất ở giữa.

Error Codes

Code	Error	Description
`400`	Bad Request	Tham số request không hợp lệ
`401`	Unauthorized	Bearer token thiếu hoặc không hợp lệ
`422`	Validation Error	Định dạng request không hợp lệ — cần `messages` hoặc `message`
`500`	Internal Server Error	Lỗi model server
`503`	Service Unavailable	vLLM backend chưa sẵn sàng (đang tải model)
`504`	Gateway Timeout	Request timeout (>1200s)

Tài liệu bổ sung

Interactive API Docs

Swagger UI — thử trực tiếp trên trình duyệt

Health Check

Trạng thái service real-time

HuggingFace Model

Qwen3-14B — model card gốc