Files
watsonx-openai-proxy/AGENTS.md

8.1 KiB

AGENTS.md

This file provides guidance to agents when working with code in this repository.

Project Overview

watsonx-openai-proxy is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.

Core Purpose

  • Provide drop-in replacement for OpenAI API endpoints
  • Translate OpenAI API requests to watsonx.ai API calls
  • Handle IBM Cloud authentication and token management automatically
  • Support streaming responses via Server-Sent Events (SSE)

Technology Stack

  • Framework: FastAPI (async web framework)
  • Language: Python 3.9+
  • HTTP Client: httpx (async HTTP client)
  • Validation: Pydantic v2 (data validation and settings)
  • Server: uvicorn (ASGI server)

Architecture

The codebase follows a clean, modular architecture:

app/
├── main.py              # FastAPI app initialization, middleware, lifespan management
├── config.py            # Settings management, model mapping, environment variables
├── routers/             # API endpoint handlers (chat, completions, embeddings, models)
├── services/            # Business logic (watsonx_service for API interactions)
├── models/              # Pydantic models for OpenAI-compatible schemas
└── utils/               # Helper functions (request/response transformers)

Key Design Patterns:

  • Service Layer: watsonx_service.py encapsulates all watsonx.ai API interactions
  • Transformer Pattern: transformers.py handles bidirectional conversion between OpenAI and watsonx formats
  • Singleton Services: Global service instances (watsonx_service, settings) for shared state
  • Async/Await: All I/O operations are asynchronous for better performance
  • Middleware: Custom authentication middleware for optional API key validation

Building and Running

Prerequisites

# Python 3.9 or higher required
python --version

# IBM Cloud credentials needed:
# - IBM_CLOUD_API_KEY
# - WATSONX_PROJECT_ID

Installation

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your IBM Cloud credentials

Running the Server

# Development (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production (with workers)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

# Using Python module
python -m app.main

Docker Deployment

# Build image
docker build -t watsonx-openai-proxy .

# Run container
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy

# Using docker-compose
docker-compose up

Testing

# Install test dependencies
pip install pytest pytest-asyncio httpx

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=app

Development Conventions

Code Style

  • Async First: Use async/await for all I/O operations (HTTP requests, file operations)
  • Type Hints: All functions should have type annotations for parameters and return values
  • Docstrings: Use Google-style docstrings for functions and classes
  • Logging: Use the logging module with appropriate log levels (info, warning, error)

Error Handling

  • Catch exceptions at router level and return OpenAI-compatible error responses
  • Use HTTPException with proper status codes and error details
  • Log errors with full context using logger.error(..., exc_info=True)
  • Return structured error responses matching OpenAI's error format

Configuration Management

  • All configuration via environment variables (.env file)
  • Use pydantic-settings for type-safe configuration
  • Model mapping via MODEL_MAP_* environment variables
  • Settings accessed through global settings instance

Token Management

  • Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
  • Token refresh on 401 errors from watsonx.ai
  • Thread-safe token refresh using asyncio.Lock
  • Initial token obtained during application startup

API Compatibility

  • Maintain strict OpenAI API compatibility in request/response formats
  • Use Pydantic models from openai_models.py for validation
  • Transform requests/responses using functions in transformers.py
  • Support both streaming and non-streaming responses

Adding New Endpoints

  1. Create router in app/routers/ (e.g., new_endpoint.py)
  2. Define Pydantic models in app/models/openai_models.py
  3. Add transformation logic in app/utils/transformers.py
  4. Add watsonx.ai API method in app/services/watsonx_service.py
  5. Register router in app/main.py using app.include_router()

Streaming Responses

  • Use StreamingResponse with media_type="text/event-stream"
  • Format chunks as Server-Sent Events using format_sse_event()
  • Always send [DONE] message at the end of stream
  • Handle errors gracefully and send error events in SSE format

Model Mapping

  • Map OpenAI model names to watsonx models via environment variables
  • Format: MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>
  • Example: MODEL_MAP_GPT4=ibm/granite-4-h-small
  • Mapping applied in settings.map_model() before API calls

Security Considerations

  • Optional API key authentication via API_KEY environment variable
  • Middleware validates Bearer token in Authorization header
  • IBM Cloud API key stored securely in environment variables
  • CORS configured via ALLOWED_ORIGINS (default: *)

Logging Best Practices

  • Use structured logging with context (model names, request IDs)
  • Log level controlled by LOG_LEVEL environment variable
  • Log token refresh events at INFO level
  • Log API errors at ERROR level with full traceback
  • Include request/response details for debugging

Dependencies

  • Keep requirements.txt minimal and pinned to specific versions
  • FastAPI and Pydantic are core dependencies - avoid breaking changes
  • httpx for async HTTP - prefer over requests/aiohttp
  • Use uvicorn[standard] for production-ready server

Important Implementation Notes

watsonx.ai API Specifics

  • Base URL format: https://{cluster}.ml.cloud.ibm.com/ml/v1
  • API version parameter: version=2024-02-13 (required on all requests)
  • Chat endpoint: /text/chat (non-streaming) or /text/chat_stream (streaming)
  • Text generation: /text/generation
  • Embeddings: /text/embeddings

Request/Response Transformation

  • OpenAI messages → watsonx messages: Direct mapping with role/content
  • watsonx responses → OpenAI format: Extract choices, usage, and metadata
  • Streaming chunks: Parse SSE format, transform delta objects
  • Generate unique IDs: chatcmpl-{uuid} for chat, cmpl-{uuid} for completions

Common Pitfalls

  • Don't forget to refresh tokens before they expire (50-minute interval)
  • Always close httpx client on shutdown (await watsonx_service.close())
  • Handle both string and list formats for stop parameter
  • Validate model IDs exist in watsonx.ai before making requests
  • Set appropriate timeouts for long-running generation requests (300s default)

Performance Optimization

  • Reuse httpx client instance (don't create per request)
  • Use connection pooling (httpx default behavior)
  • Consider worker processes for production (--workers 4)
  • Monitor token refresh to avoid rate limiting

Environment Variables Reference

Required

  • IBM_CLOUD_API_KEY: IBM Cloud API key for authentication
  • WATSONX_PROJECT_ID: watsonx.ai project ID

Optional

  • WATSONX_CLUSTER: Region (default: us-south)
  • HOST: Server host (default: 0.0.0.0)
  • PORT: Server port (default: 8000)
  • LOG_LEVEL: Logging level (default: info)
  • API_KEY: Optional proxy authentication key
  • ALLOWED_ORIGINS: CORS origins (default: *)
  • MODEL_MAP_*: Model name mappings

API Endpoints

  • GET / - API information and available endpoints
  • GET /health - Health check (bypasses authentication)
  • GET /docs - Interactive Swagger UI documentation
  • POST /v1/chat/completions - Chat completions (streaming supported)
  • POST /v1/completions - Text completions (legacy)
  • POST /v1/embeddings - Generate embeddings
  • GET /v1/models - List available models
  • GET /v1/models/{model_id} - Get specific model info