watsonx-openai-proxy/AGENTS.md

# AGENTS.md

This file provides guidance to agents when working with code in this repository.

## Project Overview

**watsonx-openai-proxy** is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.

### Core Purpose
- Provide drop-in replacement for OpenAI API endpoints
- Translate OpenAI API requests to watsonx.ai API calls
- Handle IBM Cloud authentication and token management automatically
- Support streaming responses via Server-Sent Events (SSE)

### Technology Stack
- **Framework**: FastAPI (async web framework)
- **Language**: Python 3.9+
- **HTTP Client**: httpx (async HTTP client)
- **Validation**: Pydantic v2 (data validation and settings)
- **Server**: uvicorn (ASGI server)

### Architecture

The codebase follows a clean, modular architecture:

```
app/
├── main.py              # FastAPI app initialization, middleware, lifespan management
├── config.py            # Settings management, model mapping, environment variables
├── routers/             # API endpoint handlers (chat, completions, embeddings, models)
├── services/            # Business logic (watsonx_service for API interactions)
├── models/              # Pydantic models for OpenAI-compatible schemas
└── utils/               # Helper functions (request/response transformers)
```

**Key Design Patterns**:
- **Service Layer**: `watsonx_service.py` encapsulates all watsonx.ai API interactions
- **Transformer Pattern**: `transformers.py` handles bidirectional conversion between OpenAI and watsonx formats
- **Singleton Services**: Global service instances (`watsonx_service`, `settings`) for shared state
- **Async/Await**: All I/O operations are asynchronous for better performance
- **Middleware**: Custom authentication middleware for optional API key validation

## Building and Running

### Prerequisites
```bash
# Python 3.9 or higher required
python --version

# IBM Cloud credentials needed:
# - IBM_CLOUD_API_KEY
# - WATSONX_PROJECT_ID
```

### Installation
```bash
# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your IBM Cloud credentials
```

### Running the Server
```bash
# Development (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production (with workers)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

# Using Python module
python -m app.main
```

### Docker Deployment
```bash
# Build image
docker build -t watsonx-openai-proxy .

# Run container
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy

# Using docker-compose
docker-compose up
```

### Testing
```bash
# Install test dependencies
pip install pytest pytest-asyncio httpx

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=app
```

## Development Conventions

### Code Style
- **Async First**: Use `async`/`await` for all I/O operations (HTTP requests, file operations)
- **Type Hints**: All functions should have type annotations for parameters and return values
- **Docstrings**: Use Google-style docstrings for functions and classes
- **Logging**: Use the `logging` module with appropriate log levels (info, warning, error)

### Error Handling
- Catch exceptions at router level and return OpenAI-compatible error responses
- Use `HTTPException` with proper status codes and error details
- Log errors with full context using `logger.error(..., exc_info=True)`
- Return structured error responses matching OpenAI's error format

### Configuration Management
- All configuration via environment variables (`.env` file)
- Use `pydantic-settings` for type-safe configuration
- Model mapping via `MODEL_MAP_*` environment variables
- Settings accessed through global `settings` instance

### Token Management
- Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
- Token refresh on 401 errors from watsonx.ai
- Thread-safe token refresh using `asyncio.Lock`
- Initial token obtained during application startup

### API Compatibility
- Maintain strict OpenAI API compatibility in request/response formats
- Use Pydantic models from `openai_models.py` for validation
- Transform requests/responses using functions in `transformers.py`
- Support both streaming and non-streaming responses

### Adding New Endpoints
1. Create router in `app/routers/` (e.g., `new_endpoint.py`)
2. Define Pydantic models in `app/models/openai_models.py`
3. Add transformation logic in `app/utils/transformers.py`
4. Add watsonx.ai API method in `app/services/watsonx_service.py`
5. Register router in `app/main.py` using `app.include_router()`

### Streaming Responses
- Use `StreamingResponse` with `media_type="text/event-stream"`
- Format chunks as Server-Sent Events using `format_sse_event()`
- Always send `[DONE]` message at the end of stream
- Handle errors gracefully and send error events in SSE format

### Model Mapping
- Map OpenAI model names to watsonx models via environment variables
- Format: `MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>`
- Example: `MODEL_MAP_GPT4=ibm/granite-4-h-small`
- Mapping applied in `settings.map_model()` before API calls

### Security Considerations
- Optional API key authentication via `API_KEY` environment variable
- Middleware validates Bearer token in Authorization header
- IBM Cloud API key stored securely in environment variables
- CORS configured via `ALLOWED_ORIGINS` (default: `*`)

### Logging Best Practices
- Use structured logging with context (model names, request IDs)
- Log level controlled by `LOG_LEVEL` environment variable
- Log token refresh events at INFO level
- Log API errors at ERROR level with full traceback
- Include request/response details for debugging

### Dependencies
- Keep `requirements.txt` minimal and pinned to specific versions
- FastAPI and Pydantic are core dependencies - avoid breaking changes
- httpx for async HTTP - prefer over requests/aiohttp
- Use `uvicorn[standard]` for production-ready server

## Important Implementation Notes

### watsonx.ai API Specifics
- Base URL format: `https://{cluster}.ml.cloud.ibm.com/ml/v1`
- API version parameter: `version=2024-02-13` (required on all requests)
- Chat endpoint: `/text/chat` (non-streaming) or `/text/chat_stream` (streaming)
- Text generation: `/text/generation`
- Embeddings: `/text/embeddings`

### Request/Response Transformation
- OpenAI messages → watsonx messages: Direct mapping with role/content
- watsonx responses → OpenAI format: Extract choices, usage, and metadata
- Streaming chunks: Parse SSE format, transform delta objects
- Generate unique IDs: `chatcmpl-{uuid}` for chat, `cmpl-{uuid}` for completions

### Common Pitfalls
- Don't forget to refresh tokens before they expire (50-minute interval)
- Always close httpx client on shutdown (`await watsonx_service.close()`)
- Handle both string and list formats for `stop` parameter
- Validate model IDs exist in watsonx.ai before making requests
- Set appropriate timeouts for long-running generation requests (300s default)

### Performance Optimization
- Reuse httpx client instance (don't create per request)
- Use connection pooling (httpx default behavior)
- Consider worker processes for production (`--workers 4`)
- Monitor token refresh to avoid rate limiting

## Environment Variables Reference

### Required
- `IBM_CLOUD_API_KEY`: IBM Cloud API key for authentication
- `WATSONX_PROJECT_ID`: watsonx.ai project ID

### Optional
- `WATSONX_CLUSTER`: Region (default: `us-south`)
- `HOST`: Server host (default: `0.0.0.0`)
- `PORT`: Server port (default: `8000`)
- `LOG_LEVEL`: Logging level (default: `info`)
- `API_KEY`: Optional proxy authentication key
- `ALLOWED_ORIGINS`: CORS origins (default: `*`)
- `MODEL_MAP_*`: Model name mappings

## API Endpoints

- `GET /` - API information and available endpoints
- `GET /health` - Health check (bypasses authentication)
- `GET /docs` - Interactive Swagger UI documentation
- `POST /v1/chat/completions` - Chat completions (streaming supported)
- `POST /v1/completions` - Text completions (legacy)
- `POST /v1/embeddings` - Generate embeddings
- `GET /v1/models` - List available models
- `GET /v1/models/{model_id}` - Get specific model info