norbert/watsonx-openai-proxy

Fork 0

Files

Michael 2e2b817435 Add AGENTS.md documentation for AI agent guidance

2026-02-23 09:59:52 -05:00

8.1 KiB

Raw Blame History

AGENTS.md

This file provides guidance to agents when working with code in this repository.

Project Overview

watsonx-openai-proxy is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.

Core Purpose

Provide drop-in replacement for OpenAI API endpoints
Translate OpenAI API requests to watsonx.ai API calls
Handle IBM Cloud authentication and token management automatically
Support streaming responses via Server-Sent Events (SSE)

Technology Stack

Framework: FastAPI (async web framework)
Language: Python 3.9+
HTTP Client: httpx (async HTTP client)
Validation: Pydantic v2 (data validation and settings)
Server: uvicorn (ASGI server)

Architecture

The codebase follows a clean, modular architecture:

app/
├── main.py              # FastAPI app initialization, middleware, lifespan management
├── config.py            # Settings management, model mapping, environment variables
├── routers/             # API endpoint handlers (chat, completions, embeddings, models)
├── services/            # Business logic (watsonx_service for API interactions)
├── models/              # Pydantic models for OpenAI-compatible schemas
└── utils/               # Helper functions (request/response transformers)

Key Design Patterns:

Service Layer: watsonx_service.py encapsulates all watsonx.ai API interactions
Transformer Pattern: transformers.py handles bidirectional conversion between OpenAI and watsonx formats
Singleton Services: Global service instances (watsonx_service, settings) for shared state
Async/Await: All I/O operations are asynchronous for better performance
Middleware: Custom authentication middleware for optional API key validation

Building and Running

Prerequisites

# Python 3.9 or higher required
python --version

# IBM Cloud credentials needed:
# - IBM_CLOUD_API_KEY
# - WATSONX_PROJECT_ID

Installation

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your IBM Cloud credentials

Running the Server

# Development (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production (with workers)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

# Using Python module
python -m app.main

Docker Deployment

# Build image
docker build -t watsonx-openai-proxy .

# Run container
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy

# Using docker-compose
docker-compose up

Testing

# Install test dependencies
pip install pytest pytest-asyncio httpx

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=app

Development Conventions

Code Style

Async First: Use async/await for all I/O operations (HTTP requests, file operations)
Type Hints: All functions should have type annotations for parameters and return values
Docstrings: Use Google-style docstrings for functions and classes
Logging: Use the logging module with appropriate log levels (info, warning, error)

Error Handling

Catch exceptions at router level and return OpenAI-compatible error responses
Use HTTPException with proper status codes and error details
Log errors with full context using logger.error(..., exc_info=True)
Return structured error responses matching OpenAI's error format

Configuration Management

All configuration via environment variables (.env file)
Use pydantic-settings for type-safe configuration
Model mapping via MODEL_MAP_* environment variables
Settings accessed through global settings instance

Token Management

Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
Token refresh on 401 errors from watsonx.ai
Thread-safe token refresh using asyncio.Lock
Initial token obtained during application startup

API Compatibility

Maintain strict OpenAI API compatibility in request/response formats
Use Pydantic models from openai_models.py for validation
Transform requests/responses using functions in transformers.py
Support both streaming and non-streaming responses

Adding New Endpoints

Create router in app/routers/ (e.g., new_endpoint.py)
Define Pydantic models in app/models/openai_models.py
Add transformation logic in app/utils/transformers.py
Add watsonx.ai API method in app/services/watsonx_service.py
Register router in app/main.py using app.include_router()

Streaming Responses

Use StreamingResponse with media_type="text/event-stream"
Format chunks as Server-Sent Events using format_sse_event()
Always send [DONE] message at the end of stream
Handle errors gracefully and send error events in SSE format

Model Mapping

Map OpenAI model names to watsonx models via environment variables
Format: MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>
Example: MODEL_MAP_GPT4=ibm/granite-4-h-small
Mapping applied in settings.map_model() before API calls

Security Considerations

Optional API key authentication via API_KEY environment variable
Middleware validates Bearer token in Authorization header
IBM Cloud API key stored securely in environment variables
CORS configured via ALLOWED_ORIGINS (default: *)

Logging Best Practices

Use structured logging with context (model names, request IDs)
Log level controlled by LOG_LEVEL environment variable
Log token refresh events at INFO level
Log API errors at ERROR level with full traceback
Include request/response details for debugging

Dependencies

Keep requirements.txt minimal and pinned to specific versions
FastAPI and Pydantic are core dependencies - avoid breaking changes
httpx for async HTTP - prefer over requests/aiohttp
Use uvicorn[standard] for production-ready server

Important Implementation Notes

watsonx.ai API Specifics

Base URL format: https://{cluster}.ml.cloud.ibm.com/ml/v1
API version parameter: version=2024-02-13 (required on all requests)
Chat endpoint: /text/chat (non-streaming) or /text/chat_stream (streaming)
Text generation: /text/generation
Embeddings: /text/embeddings

Request/Response Transformation

OpenAI messages → watsonx messages: Direct mapping with role/content
watsonx responses → OpenAI format: Extract choices, usage, and metadata
Streaming chunks: Parse SSE format, transform delta objects
Generate unique IDs: chatcmpl-{uuid} for chat, cmpl-{uuid} for completions

Common Pitfalls

Don't forget to refresh tokens before they expire (50-minute interval)
Always close httpx client on shutdown (await watsonx_service.close())
Handle both string and list formats for stop parameter
Validate model IDs exist in watsonx.ai before making requests
Set appropriate timeouts for long-running generation requests (300s default)

Performance Optimization

Reuse httpx client instance (don't create per request)
Use connection pooling (httpx default behavior)
Consider worker processes for production (--workers 4)
Monitor token refresh to avoid rate limiting

Environment Variables Reference

Required

IBM_CLOUD_API_KEY: IBM Cloud API key for authentication
WATSONX_PROJECT_ID: watsonx.ai project ID

Optional

WATSONX_CLUSTER: Region (default: us-south)
HOST: Server host (default: 0.0.0.0)
PORT: Server port (default: 8000)
LOG_LEVEL: Logging level (default: info)
API_KEY: Optional proxy authentication key
ALLOWED_ORIGINS: CORS origins (default: *)
MODEL_MAP_*: Model name mappings

API Endpoints

GET / - API information and available endpoints
GET /health - Health check (bypasses authentication)
GET /docs - Interactive Swagger UI documentation
POST /v1/chat/completions - Chat completions (streaming supported)
POST /v1/completions - Text completions (legacy)
POST /v1/embeddings - Generate embeddings
GET /v1/models - List available models
GET /v1/models/{model_id} - Get specific model info

8.1 KiB Raw Blame History