Files
watsonx-openai-proxy/AGENTS.md

224 lines
8.1 KiB
Markdown

# AGENTS.md
This file provides guidance to agents when working with code in this repository.
## Project Overview
**watsonx-openai-proxy** is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.
### Core Purpose
- Provide drop-in replacement for OpenAI API endpoints
- Translate OpenAI API requests to watsonx.ai API calls
- Handle IBM Cloud authentication and token management automatically
- Support streaming responses via Server-Sent Events (SSE)
### Technology Stack
- **Framework**: FastAPI (async web framework)
- **Language**: Python 3.9+
- **HTTP Client**: httpx (async HTTP client)
- **Validation**: Pydantic v2 (data validation and settings)
- **Server**: uvicorn (ASGI server)
### Architecture
The codebase follows a clean, modular architecture:
```
app/
├── main.py # FastAPI app initialization, middleware, lifespan management
├── config.py # Settings management, model mapping, environment variables
├── routers/ # API endpoint handlers (chat, completions, embeddings, models)
├── services/ # Business logic (watsonx_service for API interactions)
├── models/ # Pydantic models for OpenAI-compatible schemas
└── utils/ # Helper functions (request/response transformers)
```
**Key Design Patterns**:
- **Service Layer**: `watsonx_service.py` encapsulates all watsonx.ai API interactions
- **Transformer Pattern**: `transformers.py` handles bidirectional conversion between OpenAI and watsonx formats
- **Singleton Services**: Global service instances (`watsonx_service`, `settings`) for shared state
- **Async/Await**: All I/O operations are asynchronous for better performance
- **Middleware**: Custom authentication middleware for optional API key validation
## Building and Running
### Prerequisites
```bash
# Python 3.9 or higher required
python --version
# IBM Cloud credentials needed:
# - IBM_CLOUD_API_KEY
# - WATSONX_PROJECT_ID
```
### Installation
```bash
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your IBM Cloud credentials
```
### Running the Server
```bash
# Development (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# Production (with workers)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4
# Using Python module
python -m app.main
```
### Docker Deployment
```bash
# Build image
docker build -t watsonx-openai-proxy .
# Run container
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy
# Using docker-compose
docker-compose up
```
### Testing
```bash
# Install test dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=app
```
## Development Conventions
### Code Style
- **Async First**: Use `async`/`await` for all I/O operations (HTTP requests, file operations)
- **Type Hints**: All functions should have type annotations for parameters and return values
- **Docstrings**: Use Google-style docstrings for functions and classes
- **Logging**: Use the `logging` module with appropriate log levels (info, warning, error)
### Error Handling
- Catch exceptions at router level and return OpenAI-compatible error responses
- Use `HTTPException` with proper status codes and error details
- Log errors with full context using `logger.error(..., exc_info=True)`
- Return structured error responses matching OpenAI's error format
### Configuration Management
- All configuration via environment variables (`.env` file)
- Use `pydantic-settings` for type-safe configuration
- Model mapping via `MODEL_MAP_*` environment variables
- Settings accessed through global `settings` instance
### Token Management
- Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
- Token refresh on 401 errors from watsonx.ai
- Thread-safe token refresh using `asyncio.Lock`
- Initial token obtained during application startup
### API Compatibility
- Maintain strict OpenAI API compatibility in request/response formats
- Use Pydantic models from `openai_models.py` for validation
- Transform requests/responses using functions in `transformers.py`
- Support both streaming and non-streaming responses
### Adding New Endpoints
1. Create router in `app/routers/` (e.g., `new_endpoint.py`)
2. Define Pydantic models in `app/models/openai_models.py`
3. Add transformation logic in `app/utils/transformers.py`
4. Add watsonx.ai API method in `app/services/watsonx_service.py`
5. Register router in `app/main.py` using `app.include_router()`
### Streaming Responses
- Use `StreamingResponse` with `media_type="text/event-stream"`
- Format chunks as Server-Sent Events using `format_sse_event()`
- Always send `[DONE]` message at the end of stream
- Handle errors gracefully and send error events in SSE format
### Model Mapping
- Map OpenAI model names to watsonx models via environment variables
- Format: `MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>`
- Example: `MODEL_MAP_GPT4=ibm/granite-4-h-small`
- Mapping applied in `settings.map_model()` before API calls
### Security Considerations
- Optional API key authentication via `API_KEY` environment variable
- Middleware validates Bearer token in Authorization header
- IBM Cloud API key stored securely in environment variables
- CORS configured via `ALLOWED_ORIGINS` (default: `*`)
### Logging Best Practices
- Use structured logging with context (model names, request IDs)
- Log level controlled by `LOG_LEVEL` environment variable
- Log token refresh events at INFO level
- Log API errors at ERROR level with full traceback
- Include request/response details for debugging
### Dependencies
- Keep `requirements.txt` minimal and pinned to specific versions
- FastAPI and Pydantic are core dependencies - avoid breaking changes
- httpx for async HTTP - prefer over requests/aiohttp
- Use `uvicorn[standard]` for production-ready server
## Important Implementation Notes
### watsonx.ai API Specifics
- Base URL format: `https://{cluster}.ml.cloud.ibm.com/ml/v1`
- API version parameter: `version=2024-02-13` (required on all requests)
- Chat endpoint: `/text/chat` (non-streaming) or `/text/chat_stream` (streaming)
- Text generation: `/text/generation`
- Embeddings: `/text/embeddings`
### Request/Response Transformation
- OpenAI messages → watsonx messages: Direct mapping with role/content
- watsonx responses → OpenAI format: Extract choices, usage, and metadata
- Streaming chunks: Parse SSE format, transform delta objects
- Generate unique IDs: `chatcmpl-{uuid}` for chat, `cmpl-{uuid}` for completions
### Common Pitfalls
- Don't forget to refresh tokens before they expire (50-minute interval)
- Always close httpx client on shutdown (`await watsonx_service.close()`)
- Handle both string and list formats for `stop` parameter
- Validate model IDs exist in watsonx.ai before making requests
- Set appropriate timeouts for long-running generation requests (300s default)
### Performance Optimization
- Reuse httpx client instance (don't create per request)
- Use connection pooling (httpx default behavior)
- Consider worker processes for production (`--workers 4`)
- Monitor token refresh to avoid rate limiting
## Environment Variables Reference
### Required
- `IBM_CLOUD_API_KEY`: IBM Cloud API key for authentication
- `WATSONX_PROJECT_ID`: watsonx.ai project ID
### Optional
- `WATSONX_CLUSTER`: Region (default: `us-south`)
- `HOST`: Server host (default: `0.0.0.0`)
- `PORT`: Server port (default: `8000`)
- `LOG_LEVEL`: Logging level (default: `info`)
- `API_KEY`: Optional proxy authentication key
- `ALLOWED_ORIGINS`: CORS origins (default: `*`)
- `MODEL_MAP_*`: Model name mappings
## API Endpoints
- `GET /` - API information and available endpoints
- `GET /health` - Health check (bypasses authentication)
- `GET /docs` - Interactive Swagger UI documentation
- `POST /v1/chat/completions` - Chat completions (streaming supported)
- `POST /v1/completions` - Text completions (legacy)
- `POST /v1/embeddings` - Generate embeddings
- `GET /v1/models` - List available models
- `GET /v1/models/{model_id}` - Get specific model info