224 lines
8.1 KiB
Markdown
224 lines
8.1 KiB
Markdown
# AGENTS.md
|
|
|
|
This file provides guidance to agents when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
**watsonx-openai-proxy** is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.
|
|
|
|
### Core Purpose
|
|
- Provide drop-in replacement for OpenAI API endpoints
|
|
- Translate OpenAI API requests to watsonx.ai API calls
|
|
- Handle IBM Cloud authentication and token management automatically
|
|
- Support streaming responses via Server-Sent Events (SSE)
|
|
|
|
### Technology Stack
|
|
- **Framework**: FastAPI (async web framework)
|
|
- **Language**: Python 3.9+
|
|
- **HTTP Client**: httpx (async HTTP client)
|
|
- **Validation**: Pydantic v2 (data validation and settings)
|
|
- **Server**: uvicorn (ASGI server)
|
|
|
|
### Architecture
|
|
|
|
The codebase follows a clean, modular architecture:
|
|
|
|
```
|
|
app/
|
|
├── main.py # FastAPI app initialization, middleware, lifespan management
|
|
├── config.py # Settings management, model mapping, environment variables
|
|
├── routers/ # API endpoint handlers (chat, completions, embeddings, models)
|
|
├── services/ # Business logic (watsonx_service for API interactions)
|
|
├── models/ # Pydantic models for OpenAI-compatible schemas
|
|
└── utils/ # Helper functions (request/response transformers)
|
|
```
|
|
|
|
**Key Design Patterns**:
|
|
- **Service Layer**: `watsonx_service.py` encapsulates all watsonx.ai API interactions
|
|
- **Transformer Pattern**: `transformers.py` handles bidirectional conversion between OpenAI and watsonx formats
|
|
- **Singleton Services**: Global service instances (`watsonx_service`, `settings`) for shared state
|
|
- **Async/Await**: All I/O operations are asynchronous for better performance
|
|
- **Middleware**: Custom authentication middleware for optional API key validation
|
|
|
|
## Building and Running
|
|
|
|
### Prerequisites
|
|
```bash
|
|
# Python 3.9 or higher required
|
|
python --version
|
|
|
|
# IBM Cloud credentials needed:
|
|
# - IBM_CLOUD_API_KEY
|
|
# - WATSONX_PROJECT_ID
|
|
```
|
|
|
|
### Installation
|
|
```bash
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Configure environment
|
|
cp .env.example .env
|
|
# Edit .env with your IBM Cloud credentials
|
|
```
|
|
|
|
### Running the Server
|
|
```bash
|
|
# Development (with auto-reload)
|
|
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
|
|
|
# Production (with workers)
|
|
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4
|
|
|
|
# Using Python module
|
|
python -m app.main
|
|
```
|
|
|
|
### Docker Deployment
|
|
```bash
|
|
# Build image
|
|
docker build -t watsonx-openai-proxy .
|
|
|
|
# Run container
|
|
docker run -p 8000:8000 --env-file .env watsonx-openai-proxy
|
|
|
|
# Using docker-compose
|
|
docker-compose up
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Install test dependencies
|
|
pip install pytest pytest-asyncio httpx
|
|
|
|
# Run tests
|
|
pytest tests/
|
|
|
|
# Run with coverage
|
|
pytest tests/ --cov=app
|
|
```
|
|
|
|
## Development Conventions
|
|
|
|
### Code Style
|
|
- **Async First**: Use `async`/`await` for all I/O operations (HTTP requests, file operations)
|
|
- **Type Hints**: All functions should have type annotations for parameters and return values
|
|
- **Docstrings**: Use Google-style docstrings for functions and classes
|
|
- **Logging**: Use the `logging` module with appropriate log levels (info, warning, error)
|
|
|
|
### Error Handling
|
|
- Catch exceptions at router level and return OpenAI-compatible error responses
|
|
- Use `HTTPException` with proper status codes and error details
|
|
- Log errors with full context using `logger.error(..., exc_info=True)`
|
|
- Return structured error responses matching OpenAI's error format
|
|
|
|
### Configuration Management
|
|
- All configuration via environment variables (`.env` file)
|
|
- Use `pydantic-settings` for type-safe configuration
|
|
- Model mapping via `MODEL_MAP_*` environment variables
|
|
- Settings accessed through global `settings` instance
|
|
|
|
### Token Management
|
|
- Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
|
|
- Token refresh on 401 errors from watsonx.ai
|
|
- Thread-safe token refresh using `asyncio.Lock`
|
|
- Initial token obtained during application startup
|
|
|
|
### API Compatibility
|
|
- Maintain strict OpenAI API compatibility in request/response formats
|
|
- Use Pydantic models from `openai_models.py` for validation
|
|
- Transform requests/responses using functions in `transformers.py`
|
|
- Support both streaming and non-streaming responses
|
|
|
|
### Adding New Endpoints
|
|
1. Create router in `app/routers/` (e.g., `new_endpoint.py`)
|
|
2. Define Pydantic models in `app/models/openai_models.py`
|
|
3. Add transformation logic in `app/utils/transformers.py`
|
|
4. Add watsonx.ai API method in `app/services/watsonx_service.py`
|
|
5. Register router in `app/main.py` using `app.include_router()`
|
|
|
|
### Streaming Responses
|
|
- Use `StreamingResponse` with `media_type="text/event-stream"`
|
|
- Format chunks as Server-Sent Events using `format_sse_event()`
|
|
- Always send `[DONE]` message at the end of stream
|
|
- Handle errors gracefully and send error events in SSE format
|
|
|
|
### Model Mapping
|
|
- Map OpenAI model names to watsonx models via environment variables
|
|
- Format: `MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>`
|
|
- Example: `MODEL_MAP_GPT4=ibm/granite-4-h-small`
|
|
- Mapping applied in `settings.map_model()` before API calls
|
|
|
|
### Security Considerations
|
|
- Optional API key authentication via `API_KEY` environment variable
|
|
- Middleware validates Bearer token in Authorization header
|
|
- IBM Cloud API key stored securely in environment variables
|
|
- CORS configured via `ALLOWED_ORIGINS` (default: `*`)
|
|
|
|
### Logging Best Practices
|
|
- Use structured logging with context (model names, request IDs)
|
|
- Log level controlled by `LOG_LEVEL` environment variable
|
|
- Log token refresh events at INFO level
|
|
- Log API errors at ERROR level with full traceback
|
|
- Include request/response details for debugging
|
|
|
|
### Dependencies
|
|
- Keep `requirements.txt` minimal and pinned to specific versions
|
|
- FastAPI and Pydantic are core dependencies - avoid breaking changes
|
|
- httpx for async HTTP - prefer over requests/aiohttp
|
|
- Use `uvicorn[standard]` for production-ready server
|
|
|
|
## Important Implementation Notes
|
|
|
|
### watsonx.ai API Specifics
|
|
- Base URL format: `https://{cluster}.ml.cloud.ibm.com/ml/v1`
|
|
- API version parameter: `version=2024-02-13` (required on all requests)
|
|
- Chat endpoint: `/text/chat` (non-streaming) or `/text/chat_stream` (streaming)
|
|
- Text generation: `/text/generation`
|
|
- Embeddings: `/text/embeddings`
|
|
|
|
### Request/Response Transformation
|
|
- OpenAI messages → watsonx messages: Direct mapping with role/content
|
|
- watsonx responses → OpenAI format: Extract choices, usage, and metadata
|
|
- Streaming chunks: Parse SSE format, transform delta objects
|
|
- Generate unique IDs: `chatcmpl-{uuid}` for chat, `cmpl-{uuid}` for completions
|
|
|
|
### Common Pitfalls
|
|
- Don't forget to refresh tokens before they expire (50-minute interval)
|
|
- Always close httpx client on shutdown (`await watsonx_service.close()`)
|
|
- Handle both string and list formats for `stop` parameter
|
|
- Validate model IDs exist in watsonx.ai before making requests
|
|
- Set appropriate timeouts for long-running generation requests (300s default)
|
|
|
|
### Performance Optimization
|
|
- Reuse httpx client instance (don't create per request)
|
|
- Use connection pooling (httpx default behavior)
|
|
- Consider worker processes for production (`--workers 4`)
|
|
- Monitor token refresh to avoid rate limiting
|
|
|
|
## Environment Variables Reference
|
|
|
|
### Required
|
|
- `IBM_CLOUD_API_KEY`: IBM Cloud API key for authentication
|
|
- `WATSONX_PROJECT_ID`: watsonx.ai project ID
|
|
|
|
### Optional
|
|
- `WATSONX_CLUSTER`: Region (default: `us-south`)
|
|
- `HOST`: Server host (default: `0.0.0.0`)
|
|
- `PORT`: Server port (default: `8000`)
|
|
- `LOG_LEVEL`: Logging level (default: `info`)
|
|
- `API_KEY`: Optional proxy authentication key
|
|
- `ALLOWED_ORIGINS`: CORS origins (default: `*`)
|
|
- `MODEL_MAP_*`: Model name mappings
|
|
|
|
## API Endpoints
|
|
|
|
- `GET /` - API information and available endpoints
|
|
- `GET /health` - Health check (bypasses authentication)
|
|
- `GET /docs` - Interactive Swagger UI documentation
|
|
- `POST /v1/chat/completions` - Chat completions (streaming supported)
|
|
- `POST /v1/completions` - Text completions (legacy)
|
|
- `POST /v1/embeddings` - Generate embeddings
|
|
- `GET /v1/models` - List available models
|
|
- `GET /v1/models/{model_id}` - Get specific model info
|