Gemma2 9B Local LLM

Tổng quan

Gemma2 9B Local LLM Service

Dịch vụ LLM chạy local với Google Gemma 2 9B Instruct (AWQ 4-bit quantized). Được tối ưu cho tiếng Việt và y khoa với tốc độ siêu nhanh (~39 tokens/sec) - Nhanh hơn 5-10 lần so với các API-based models.

Đặc điểm kỹ thuật

Model	Google Gemma 2 9B Instruct (hugging-quants/gemma-2-9b-it-AWQ-INT4)
Quantization	AWQ 4-bit với awq_marlin kernel
Framework	vLLM 0.12.0 - High-performance inference engine
GPU	RTX 3060 12GB (CUDA_VISIBLE_DEVICES=1)
VRAM Usage	~6.5GB (Model: 5.78GB + KV Cache: 0.71GB)
Performance	~39 tokens/second, Response time: 0.5-6.3s
Languages	Vietnamese, English, Multilingual
Max Context	4096 tokens

Tính năng nổi bật

Siêu nhanh

5-10x nhanh hơn Vistral API, ~39 tokens/sec

Tiếng Việt

Hỗ trợ tiếng Việt xuất sắc với ngữ cảnh y khoa

Y khoa

Được tối ưu cho lĩnh vực y tế và giáo dục

OpenAI API

Compatible với OpenAI Python SDK

Miễn phí

100% free, không có API cost

Bảo mật

Dữ liệu xử lý local, không gửi ra ngoài

API Endpoints

Base URL

https://pnt.badt.vn/gemma2

Available Endpoints

GET /health

Health check endpoint

GET /status

Service status và model information

POST /chat

Simple chat endpoint (backward compatible)

POST /v1/chat/completions

OpenAI-compatible chat completions API

GET /v1/models

List available models

Xác thực

Bearer Token Authentication Required

API này yêu cầu Bearer token trong Authorization header. Vui lòng liên hệ admin để nhận API token.

Cách sử dụng

Thêm header sau vào mọi request:

Authorization: Bearer YOUR_API_TOKEN

Lưu ý bảo mật

Giữ API token bí mật, không chia sẻ công khai
Không commit token vào source code
Sử dụng environment variables để lưu token
Rate limiting vẫn áp dụng để đảm bảo công bằng

Ví dụ sử dụng

1. Simple Chat (CURL)

curl -X POST https://pnt.badt.vn/gemma2/chat \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Triệu chứng của bệnh viêm phổi là gì?",
    "max_tokens": 200,
    "temperature": 0.7
  }'

Response

{
  "response": "Triệu chứng của bệnh viêm phổi thường bao gồm:\n\n1. Ho có đờm (có thể có máu)\n2. Sốt cao\n3. Khó thở, thở nhanh\n4. Đau ngực khi hít thở sâu\n5. Mệt mỏi\n...",
  "model": "hugging-quants/gemma-2-9b-it-AWQ-INT4",
  "elapsed_time": "2.45s",
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 85,
    "total_tokens": 100
  }
}

2. Python với requests

import requests
import os

url = "https://pnt.badt.vn/gemma2/chat"
headers = {
    "Authorization": f"Bearer {os.getenv('GEMMA2_API_TOKEN')}",
    "Content-Type": "application/json"
}
data = {
    "prompt": "Xin chào! Bạn là trợ lý AI y tế.",
    "max_tokens": 150,
    "temperature": 0.7
}

response = requests.post(url, json=data, headers=headers)
result = response.json()

print(result["response"])
print(f"Time: {result['elapsed_time']}")

3. Text Correction (Medical Transcription)

Use Case: Hiệu chỉnh lỗi ghi âm y khoa

Với prompt được tối ưu, model chỉ trả về câu đã sửa mà không có giải thích thêm.

curl -X POST https://pnt.badt.vn/gemma2/chat \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Chỉ viết lại câu đã sửa lỗi, không thêm gì khác:\n\nMột số yếu tố như là kháng trị với corticoid, lợi yếu tố nguy cơ, tuổi trên 55 tuổi lợi yếu tố nguy cơ",
    "max_tokens": 50,
    "temperature": 0.2
  }'

Response

{
  "response": "Một số yếu tố như kháng trị với corticoid, lợi yếu tố nguy cơ, tuổi trên 55 tuổi là yếu tố nguy cơ.\n",
  "model": "hugging-quants/gemma-2-9b-it-AWQ-INT4",
  "elapsed_time": "0.88s",
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 32,
    "total_tokens": 77
  }
}

Best Practices cho Text Correction

Temperature thấp (0.2-0.3): Giảm tính ngẫu nhiên, output ổn định hơn
Max tokens ít (50-100): Tránh model sinh thêm text không cần thiết
Prompt rõ ràng: "Chỉ viết lại", "không thêm gì khác", "không giải thích"

4. Health Check

curl https://pnt.badt.vn/gemma2/health \
  -H "Authorization: Bearer YOUR_API_TOKEN"

Response

{
  "status": "healthy",
  "service": "gemma2_service",
  "vllm_server": "healthy",
  "model": "hugging-quants/gemma-2-9b-it-AWQ-INT4",
  "timestamp": "2025-12-06T06:24:48.316887"
}

OpenAI Compatible API

Gemma2 service hỗ trợ OpenAI-compatible API, có thể sử dụng với OpenAI Python SDK:

Python với OpenAI SDK

from openai import OpenAI
import os

# Initialize client with Bearer token
client = OpenAI(
    api_key="not-needed",
    base_url="https://pnt.badt.vn/gemma2/v1",
    default_headers={
        "Authorization": f"Bearer {os.getenv('GEMMA2_API_TOKEN')}"
    }
)

# Chat completion
response = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[
        {"role": "user", "content": "Triệu chứng của bệnh tiểu đường?"}
    ],
    temperature=0.7,
    max_tokens=300
)

print(response.choices[0].message.content)

JavaScript với OpenAI SDK

import OpenAI from 'openai';

const client = new OpenAI({
    apiKey: 'not-needed',
    baseURL: 'https://pnt.badt.vn/gemma2/v1',
    defaultHeaders: {
        'Authorization': `Bearer ${process.env.GEMMA2_API_TOKEN}`
    }
});

const response = await client.chat.completions.create({
    model: 'gemma-2-9b-it',
    messages: [
        { role: 'user', content: 'Xin chào!' }
    ],
    max_tokens: 100
});

console.log(response.choices[0].message.content);

Rate Limits

Fair Usage Policy

Service này shared resource, vui lòng sử dụng hợp lý:

Max concurrent requests: 3
Request timeout: 600 seconds
Max tokens per request: 4096
Max prompt size: 10MB

Error Codes

Code	Error	Description
`400`	Bad Request	Invalid request parameters
`422`	Validation Error	Request validation failed
`500`	Internal Server Error	Model server error
`503`	Service Unavailable	vLLM server not ready
`504`	Gateway Timeout	Request timeout (>600s)

Vietnam Medical AI

Tổng quan

Đặc điểm kỹ thuật

Tính năng nổi bật

Siêu nhanh

Tiếng Việt

Y khoa

OpenAI API

Miễn phí

Bảo mật

API Endpoints

Base URL

Available Endpoints

Xác thực

Cách sử dụng

Ví dụ sử dụng

1. Simple Chat (CURL)

Response

2. Python với requests

3. Text Correction (Medical Transcription)

Response

Best Practices cho Text Correction

4. Health Check

Response

OpenAI Compatible API

Python với OpenAI SDK

JavaScript với OpenAI SDK

Rate Limits

Error Codes

Tài liệu bổ sung

Interactive API Docs

Health Check