# AGENTS.md This file provides guidance to agents when working with code in this repository. ## Project Overview **watsonx-openai-proxy** is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models. ### Core Purpose - Provide drop-in replacement for OpenAI API endpoints - Translate OpenAI API requests to watsonx.ai API calls - Handle IBM Cloud authentication and token management automatically - Support streaming responses via Server-Sent Events (SSE) ### Technology Stack - **Framework**: FastAPI (async web framework) - **Language**: Python 3.9+ - **HTTP Client**: httpx (async HTTP client) - **Validation**: Pydantic v2 (data validation and settings) - **Server**: uvicorn (ASGI server) ### Architecture The codebase follows a clean, modular architecture: ``` app/ ├── main.py # FastAPI app initialization, middleware, lifespan management ├── config.py # Settings management, model mapping, environment variables ├── routers/ # API endpoint handlers (chat, completions, embeddings, models) ├── services/ # Business logic (watsonx_service for API interactions) ├── models/ # Pydantic models for OpenAI-compatible schemas └── utils/ # Helper functions (request/response transformers) ``` **Key Design Patterns**: - **Service Layer**: `watsonx_service.py` encapsulates all watsonx.ai API interactions - **Transformer Pattern**: `transformers.py` handles bidirectional conversion between OpenAI and watsonx formats - **Singleton Services**: Global service instances (`watsonx_service`, `settings`) for shared state - **Async/Await**: All I/O operations are asynchronous for better performance - **Middleware**: Custom authentication middleware for optional API key validation ## Building and Running ### Prerequisites ```bash # Python 3.9 or higher required python --version # IBM Cloud credentials needed: # - IBM_CLOUD_API_KEY # - WATSONX_PROJECT_ID ``` ### Installation ```bash # Install dependencies pip install -r requirements.txt # Configure environment cp .env.example .env # Edit .env with your IBM Cloud credentials ``` ### Running the Server ```bash # Development (with auto-reload) uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 # Production (with workers) uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4 # Using Python module python -m app.main ``` ### Docker Deployment ```bash # Build image docker build -t watsonx-openai-proxy . # Run container docker run -p 8000:8000 --env-file .env watsonx-openai-proxy # Using docker-compose docker-compose up ``` ### Testing ```bash # Install test dependencies pip install pytest pytest-asyncio httpx # Run tests pytest tests/ # Run with coverage pytest tests/ --cov=app ``` ## Development Conventions ### Code Style - **Async First**: Use `async`/`await` for all I/O operations (HTTP requests, file operations) - **Type Hints**: All functions should have type annotations for parameters and return values - **Docstrings**: Use Google-style docstrings for functions and classes - **Logging**: Use the `logging` module with appropriate log levels (info, warning, error) ### Error Handling - Catch exceptions at router level and return OpenAI-compatible error responses - Use `HTTPException` with proper status codes and error details - Log errors with full context using `logger.error(..., exc_info=True)` - Return structured error responses matching OpenAI's error format ### Configuration Management - All configuration via environment variables (`.env` file) - Use `pydantic-settings` for type-safe configuration - Model mapping via `MODEL_MAP_*` environment variables - Settings accessed through global `settings` instance ### Token Management - Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes) - Token refresh on 401 errors from watsonx.ai - Thread-safe token refresh using `asyncio.Lock` - Initial token obtained during application startup ### API Compatibility - Maintain strict OpenAI API compatibility in request/response formats - Use Pydantic models from `openai_models.py` for validation - Transform requests/responses using functions in `transformers.py` - Support both streaming and non-streaming responses ### Adding New Endpoints 1. Create router in `app/routers/` (e.g., `new_endpoint.py`) 2. Define Pydantic models in `app/models/openai_models.py` 3. Add transformation logic in `app/utils/transformers.py` 4. Add watsonx.ai API method in `app/services/watsonx_service.py` 5. Register router in `app/main.py` using `app.include_router()` ### Streaming Responses - Use `StreamingResponse` with `media_type="text/event-stream"` - Format chunks as Server-Sent Events using `format_sse_event()` - Always send `[DONE]` message at the end of stream - Handle errors gracefully and send error events in SSE format ### Model Mapping - Map OpenAI model names to watsonx models via environment variables - Format: `MODEL_MAP_=` - Example: `MODEL_MAP_GPT4=ibm/granite-4-h-small` - Mapping applied in `settings.map_model()` before API calls ### Security Considerations - Optional API key authentication via `API_KEY` environment variable - Middleware validates Bearer token in Authorization header - IBM Cloud API key stored securely in environment variables - CORS configured via `ALLOWED_ORIGINS` (default: `*`) ### Logging Best Practices - Use structured logging with context (model names, request IDs) - Log level controlled by `LOG_LEVEL` environment variable - Log token refresh events at INFO level - Log API errors at ERROR level with full traceback - Include request/response details for debugging ### Dependencies - Keep `requirements.txt` minimal and pinned to specific versions - FastAPI and Pydantic are core dependencies - avoid breaking changes - httpx for async HTTP - prefer over requests/aiohttp - Use `uvicorn[standard]` for production-ready server ## Important Implementation Notes ### watsonx.ai API Specifics - Base URL format: `https://{cluster}.ml.cloud.ibm.com/ml/v1` - API version parameter: `version=2024-02-13` (required on all requests) - Chat endpoint: `/text/chat` (non-streaming) or `/text/chat_stream` (streaming) - Text generation: `/text/generation` - Embeddings: `/text/embeddings` ### Request/Response Transformation - OpenAI messages → watsonx messages: Direct mapping with role/content - watsonx responses → OpenAI format: Extract choices, usage, and metadata - Streaming chunks: Parse SSE format, transform delta objects - Generate unique IDs: `chatcmpl-{uuid}` for chat, `cmpl-{uuid}` for completions ### Common Pitfalls - Don't forget to refresh tokens before they expire (50-minute interval) - Always close httpx client on shutdown (`await watsonx_service.close()`) - Handle both string and list formats for `stop` parameter - Validate model IDs exist in watsonx.ai before making requests - Set appropriate timeouts for long-running generation requests (300s default) ### Performance Optimization - Reuse httpx client instance (don't create per request) - Use connection pooling (httpx default behavior) - Consider worker processes for production (`--workers 4`) - Monitor token refresh to avoid rate limiting ## Environment Variables Reference ### Required - `IBM_CLOUD_API_KEY`: IBM Cloud API key for authentication - `WATSONX_PROJECT_ID`: watsonx.ai project ID ### Optional - `WATSONX_CLUSTER`: Region (default: `us-south`) - `HOST`: Server host (default: `0.0.0.0`) - `PORT`: Server port (default: `8000`) - `LOG_LEVEL`: Logging level (default: `info`) - `API_KEY`: Optional proxy authentication key - `ALLOWED_ORIGINS`: CORS origins (default: `*`) - `MODEL_MAP_*`: Model name mappings ## API Endpoints - `GET /` - API information and available endpoints - `GET /health` - Health check (bypasses authentication) - `GET /docs` - Interactive Swagger UI documentation - `POST /v1/chat/completions` - Chat completions (streaming supported) - `POST /v1/completions` - Text completions (legacy) - `POST /v1/embeddings` - Generate embeddings - `GET /v1/models` - List available models - `GET /v1/models/{model_id}` - Get specific model info