8.1 KiB
8.1 KiB
AGENTS.md
This file provides guidance to agents when working with code in this repository.
Project Overview
watsonx-openai-proxy is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.
Core Purpose
- Provide drop-in replacement for OpenAI API endpoints
- Translate OpenAI API requests to watsonx.ai API calls
- Handle IBM Cloud authentication and token management automatically
- Support streaming responses via Server-Sent Events (SSE)
Technology Stack
- Framework: FastAPI (async web framework)
- Language: Python 3.9+
- HTTP Client: httpx (async HTTP client)
- Validation: Pydantic v2 (data validation and settings)
- Server: uvicorn (ASGI server)
Architecture
The codebase follows a clean, modular architecture:
app/
├── main.py # FastAPI app initialization, middleware, lifespan management
├── config.py # Settings management, model mapping, environment variables
├── routers/ # API endpoint handlers (chat, completions, embeddings, models)
├── services/ # Business logic (watsonx_service for API interactions)
├── models/ # Pydantic models for OpenAI-compatible schemas
└── utils/ # Helper functions (request/response transformers)
Key Design Patterns:
- Service Layer:
watsonx_service.pyencapsulates all watsonx.ai API interactions - Transformer Pattern:
transformers.pyhandles bidirectional conversion between OpenAI and watsonx formats - Singleton Services: Global service instances (
watsonx_service,settings) for shared state - Async/Await: All I/O operations are asynchronous for better performance
- Middleware: Custom authentication middleware for optional API key validation
Building and Running
Prerequisites
# Python 3.9 or higher required
python --version
# IBM Cloud credentials needed:
# - IBM_CLOUD_API_KEY
# - WATSONX_PROJECT_ID
Installation
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your IBM Cloud credentials
Running the Server
# Development (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# Production (with workers)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4
# Using Python module
python -m app.main
Docker Deployment
# Build image
docker build -t watsonx-openai-proxy .
# Run container
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy
# Using docker-compose
docker-compose up
Testing
# Install test dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=app
Development Conventions
Code Style
- Async First: Use
async/awaitfor all I/O operations (HTTP requests, file operations) - Type Hints: All functions should have type annotations for parameters and return values
- Docstrings: Use Google-style docstrings for functions and classes
- Logging: Use the
loggingmodule with appropriate log levels (info, warning, error)
Error Handling
- Catch exceptions at router level and return OpenAI-compatible error responses
- Use
HTTPExceptionwith proper status codes and error details - Log errors with full context using
logger.error(..., exc_info=True) - Return structured error responses matching OpenAI's error format
Configuration Management
- All configuration via environment variables (
.envfile) - Use
pydantic-settingsfor type-safe configuration - Model mapping via
MODEL_MAP_*environment variables - Settings accessed through global
settingsinstance
Token Management
- Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
- Token refresh on 401 errors from watsonx.ai
- Thread-safe token refresh using
asyncio.Lock - Initial token obtained during application startup
API Compatibility
- Maintain strict OpenAI API compatibility in request/response formats
- Use Pydantic models from
openai_models.pyfor validation - Transform requests/responses using functions in
transformers.py - Support both streaming and non-streaming responses
Adding New Endpoints
- Create router in
app/routers/(e.g.,new_endpoint.py) - Define Pydantic models in
app/models/openai_models.py - Add transformation logic in
app/utils/transformers.py - Add watsonx.ai API method in
app/services/watsonx_service.py - Register router in
app/main.pyusingapp.include_router()
Streaming Responses
- Use
StreamingResponsewithmedia_type="text/event-stream" - Format chunks as Server-Sent Events using
format_sse_event() - Always send
[DONE]message at the end of stream - Handle errors gracefully and send error events in SSE format
Model Mapping
- Map OpenAI model names to watsonx models via environment variables
- Format:
MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID> - Example:
MODEL_MAP_GPT4=ibm/granite-4-h-small - Mapping applied in
settings.map_model()before API calls
Security Considerations
- Optional API key authentication via
API_KEYenvironment variable - Middleware validates Bearer token in Authorization header
- IBM Cloud API key stored securely in environment variables
- CORS configured via
ALLOWED_ORIGINS(default:*)
Logging Best Practices
- Use structured logging with context (model names, request IDs)
- Log level controlled by
LOG_LEVELenvironment variable - Log token refresh events at INFO level
- Log API errors at ERROR level with full traceback
- Include request/response details for debugging
Dependencies
- Keep
requirements.txtminimal and pinned to specific versions - FastAPI and Pydantic are core dependencies - avoid breaking changes
- httpx for async HTTP - prefer over requests/aiohttp
- Use
uvicorn[standard]for production-ready server
Important Implementation Notes
watsonx.ai API Specifics
- Base URL format:
https://{cluster}.ml.cloud.ibm.com/ml/v1 - API version parameter:
version=2024-02-13(required on all requests) - Chat endpoint:
/text/chat(non-streaming) or/text/chat_stream(streaming) - Text generation:
/text/generation - Embeddings:
/text/embeddings
Request/Response Transformation
- OpenAI messages → watsonx messages: Direct mapping with role/content
- watsonx responses → OpenAI format: Extract choices, usage, and metadata
- Streaming chunks: Parse SSE format, transform delta objects
- Generate unique IDs:
chatcmpl-{uuid}for chat,cmpl-{uuid}for completions
Common Pitfalls
- Don't forget to refresh tokens before they expire (50-minute interval)
- Always close httpx client on shutdown (
await watsonx_service.close()) - Handle both string and list formats for
stopparameter - Validate model IDs exist in watsonx.ai before making requests
- Set appropriate timeouts for long-running generation requests (300s default)
Performance Optimization
- Reuse httpx client instance (don't create per request)
- Use connection pooling (httpx default behavior)
- Consider worker processes for production (
--workers 4) - Monitor token refresh to avoid rate limiting
Environment Variables Reference
Required
IBM_CLOUD_API_KEY: IBM Cloud API key for authenticationWATSONX_PROJECT_ID: watsonx.ai project ID
Optional
WATSONX_CLUSTER: Region (default:us-south)HOST: Server host (default:0.0.0.0)PORT: Server port (default:8000)LOG_LEVEL: Logging level (default:info)API_KEY: Optional proxy authentication keyALLOWED_ORIGINS: CORS origins (default:*)MODEL_MAP_*: Model name mappings
API Endpoints
GET /- API information and available endpointsGET /health- Health check (bypasses authentication)GET /docs- Interactive Swagger UI documentationPOST /v1/chat/completions- Chat completions (streaming supported)POST /v1/completions- Text completions (legacy)POST /v1/embeddings- Generate embeddingsGET /v1/models- List available modelsGET /v1/models/{model_id}- Get specific model info