Add AGENTS.md documentation for AI agent guidance

2026-02-23 09:59:52 -05:00
commit 2e2b817435
21 changed files with 2513 additions and 0 deletions
@@ -0,0 +1,223 @@
 # AGENTS.md
 This file provides guidance to agents when working with code in this repository.
 ## Project Overview
 **watsonx-openai-proxy** is an OpenAI-compatible API proxy for IBM watsonx.ai. It enables any tool or application that supports the OpenAI API format to seamlessly work with watsonx.ai models.
 ### Core Purpose
 - Provide drop-in replacement for OpenAI API endpoints
 - Translate OpenAI API requests to watsonx.ai API calls
 - Handle IBM Cloud authentication and token management automatically
 - Support streaming responses via Server-Sent Events (SSE)
 ### Technology Stack
 - **Framework**: FastAPI (async web framework)
 - **Language**: Python 3.9+
 - **HTTP Client**: httpx (async HTTP client)
 - **Validation**: Pydantic v2 (data validation and settings)
 - **Server**: uvicorn (ASGI server)
 ### Architecture
 The codebase follows a clean, modular architecture:
 ```
 app/
 ├── main.py              # FastAPI app initialization, middleware, lifespan management
 ├── config.py            # Settings management, model mapping, environment variables
 ├── routers/             # API endpoint handlers (chat, completions, embeddings, models)
 ├── services/            # Business logic (watsonx_service for API interactions)
 ├── models/              # Pydantic models for OpenAI-compatible schemas
 └── utils/               # Helper functions (request/response transformers)
 ```
 **Key Design Patterns**:
 - **Service Layer**: `watsonx_service.py` encapsulates all watsonx.ai API interactions
 - **Transformer Pattern**: `transformers.py` handles bidirectional conversion between OpenAI and watsonx formats
 - **Singleton Services**: Global service instances (`watsonx_service`, `settings`) for shared state
 - **Async/Await**: All I/O operations are asynchronous for better performance
 - **Middleware**: Custom authentication middleware for optional API key validation
 ## Building and Running
 ### Prerequisites
 ```bash
 # Python 3.9 or higher required
 python --version
 # IBM Cloud credentials needed:
 # - IBM_CLOUD_API_KEY
 # - WATSONX_PROJECT_ID
 ```
 ### Installation
 ```bash
 # Install dependencies
 pip install -r requirements.txt
 # Configure environment
 cp .env.example .env
 # Edit .env with your IBM Cloud credentials
 ```
 ### Running the Server
 ```bash
 # Development (with auto-reload)
 uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
 # Production (with workers)
 uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4
 # Using Python module
 python -m app.main
 ```
 ### Docker Deployment
 ```bash
 # Build image
 docker build -t watsonx-openai-proxy .
 # Run container
 docker run -p 8000:8000 --env-file .env watsonx-openai-proxy
 # Using docker-compose
 docker-compose up
 ```
 ### Testing
 ```bash
 # Install test dependencies
 pip install pytest pytest-asyncio httpx
 # Run tests
 pytest tests/
 # Run with coverage
 pytest tests/ --cov=app
 ```
 ## Development Conventions
 ### Code Style
 - **Async First**: Use `async`/`await` for all I/O operations (HTTP requests, file operations)
 - **Type Hints**: All functions should have type annotations for parameters and return values
 - **Docstrings**: Use Google-style docstrings for functions and classes
 - **Logging**: Use the `logging` module with appropriate log levels (info, warning, error)
 ### Error Handling
 - Catch exceptions at router level and return OpenAI-compatible error responses
 - Use `HTTPException` with proper status codes and error details
 - Log errors with full context using `logger.error(..., exc_info=True)`
 - Return structured error responses matching OpenAI's error format
 ### Configuration Management
 - All configuration via environment variables (`.env` file)
 - Use `pydantic-settings` for type-safe configuration
 - Model mapping via `MODEL_MAP_*` environment variables
 - Settings accessed through global `settings` instance
 ### Token Management
 - Bearer tokens automatically refreshed every 50 minutes (expire at 60 minutes)
 - Token refresh on 401 errors from watsonx.ai
 - Thread-safe token refresh using `asyncio.Lock`
 - Initial token obtained during application startup
 ### API Compatibility
 - Maintain strict OpenAI API compatibility in request/response formats
 - Use Pydantic models from `openai_models.py` for validation
 - Transform requests/responses using functions in `transformers.py`
 - Support both streaming and non-streaming responses
 ### Adding New Endpoints
 1. Create router in `app/routers/` (e.g., `new_endpoint.py`)
 2. Define Pydantic models in `app/models/openai_models.py`
 3. Add transformation logic in `app/utils/transformers.py`
 4. Add watsonx.ai API method in `app/services/watsonx_service.py`
 5. Register router in `app/main.py` using `app.include_router()`
 ### Streaming Responses
 - Use `StreamingResponse` with `media_type="text/event-stream"`
 - Format chunks as Server-Sent Events using `format_sse_event()`
 - Always send `[DONE]` message at the end of stream
 - Handle errors gracefully and send error events in SSE format
 ### Model Mapping
 - Map OpenAI model names to watsonx models via environment variables
 - Format: `MODEL_MAP_<OPENAI_MODEL>=<WATSONX_MODEL_ID>`
 - Example: `MODEL_MAP_GPT4=ibm/granite-4-h-small`
 - Mapping applied in `settings.map_model()` before API calls
 ### Security Considerations
 - Optional API key authentication via `API_KEY` environment variable
 - Middleware validates Bearer token in Authorization header
 - IBM Cloud API key stored securely in environment variables
 - CORS configured via `ALLOWED_ORIGINS` (default: `*`)
 ### Logging Best Practices
 - Use structured logging with context (model names, request IDs)
 - Log level controlled by `LOG_LEVEL` environment variable
 - Log token refresh events at INFO level
 - Log API errors at ERROR level with full traceback
 - Include request/response details for debugging
 ### Dependencies
 - Keep `requirements.txt` minimal and pinned to specific versions
 - FastAPI and Pydantic are core dependencies - avoid breaking changes
 - httpx for async HTTP - prefer over requests/aiohttp
 - Use `uvicorn[standard]` for production-ready server
 ## Important Implementation Notes
 ### watsonx.ai API Specifics
 - Base URL format: `https://{cluster}.ml.cloud.ibm.com/ml/v1`
 - API version parameter: `version=2024-02-13` (required on all requests)
 - Chat endpoint: `/text/chat` (non-streaming) or `/text/chat_stream` (streaming)
 - Text generation: `/text/generation`
 - Embeddings: `/text/embeddings`
 ### Request/Response Transformation
 - OpenAI messages → watsonx messages: Direct mapping with role/content
 - watsonx responses → OpenAI format: Extract choices, usage, and metadata
 - Streaming chunks: Parse SSE format, transform delta objects
 - Generate unique IDs: `chatcmpl-{uuid}` for chat, `cmpl-{uuid}` for completions
 ### Common Pitfalls
 - Don't forget to refresh tokens before they expire (50-minute interval)
 - Always close httpx client on shutdown (`await watsonx_service.close()`)
 - Handle both string and list formats for `stop` parameter
 - Validate model IDs exist in watsonx.ai before making requests
 - Set appropriate timeouts for long-running generation requests (300s default)
 ### Performance Optimization
 - Reuse httpx client instance (don't create per request)
 - Use connection pooling (httpx default behavior)
 - Consider worker processes for production (`--workers 4`)
 - Monitor token refresh to avoid rate limiting
 ## Environment Variables Reference
 ### Required
 - `IBM_CLOUD_API_KEY`: IBM Cloud API key for authentication
 - `WATSONX_PROJECT_ID`: watsonx.ai project ID
 ### Optional
 - `WATSONX_CLUSTER`: Region (default: `us-south`)
 - `HOST`: Server host (default: `0.0.0.0`)
 - `PORT`: Server port (default: `8000`)
 - `LOG_LEVEL`: Logging level (default: `info`)
 - `API_KEY`: Optional proxy authentication key
 - `ALLOWED_ORIGINS`: CORS origins (default: `*`)
 - `MODEL_MAP_*`: Model name mappings
 ## API Endpoints
 - `GET /` - API information and available endpoints
 - `GET /health` - Health check (bypasses authentication)
 - `GET /docs` - Interactive Swagger UI documentation
 - `POST /v1/chat/completions` - Chat completions (streaming supported)
 - `POST /v1/completions` - Text completions (legacy)
 - `POST /v1/embeddings` - Generate embeddings
 - `GET /v1/models` - List available models
 - `GET /v1/models/{model_id}` - Get specific model info
@@ -0,0 +1,20 @@
 FROM python:3.11-slim
 WORKDIR /app
 # Install dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy application code
 COPY app ./app
 # Expose port
 EXPOSE 8000
 # Health check
 HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import httpx; httpx.get('http://localhost:8000/health')"
 # Run the application
 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,353 @@
 # watsonx-openai-proxy
 OpenAI-compatible API proxy for IBM watsonx.ai. This proxy allows you to use watsonx.ai models with any tool or application that supports the OpenAI API format.
 ## Features
 - ✅ **Full OpenAI API Compatibility**: Drop-in replacement for OpenAI API
 - ✅ **Chat Completions**: `/v1/chat/completions` with streaming support
 - ✅ **Text Completions**: `/v1/completions` (legacy endpoint)
 - ✅ **Embeddings**: `/v1/embeddings` for text embeddings
 - ✅ **Model Listing**: `/v1/models` endpoint
 - ✅ **Streaming Support**: Server-Sent Events (SSE) for real-time responses
 - ✅ **Model Mapping**: Map OpenAI model names to watsonx models
 - ✅ **Automatic Token Management**: Handles IBM Cloud authentication automatically
 - ✅ **CORS Support**: Configurable cross-origin resource sharing
 - ✅ **Optional API Key Authentication**: Secure your proxy with an API key
 ## Quick Start
 ### Prerequisites
 - Python 3.9 or higher
 - IBM Cloud account with watsonx.ai access
 - IBM Cloud API key
 - watsonx.ai Project ID
 ### Installation
 1. Clone or download this directory:
 ```bash
 cd watsonx-openai-proxy
 ```
 2. Install dependencies:
 ```bash
 pip install -r requirements.txt
 ```
 3. Configure environment variables:
 ```bash
 cp .env.example .env
 # Edit .env with your credentials
 ```
 4. Run the server:
 ```bash
 python -m app.main
 ```
 Or with uvicorn:
 ```bash
 uvicorn app.main:app --host 0.0.0.0 --port 8000
 ```
 The server will start at `http://localhost:8000`
 ## Configuration
 ### Environment Variables
 Create a `.env` file with the following variables:
 ```bash
 # Required: IBM Cloud Configuration
 IBM_CLOUD_API_KEY=your_ibm_cloud_api_key_here
 WATSONX_PROJECT_ID=your_watsonx_project_id_here
 WATSONX_CLUSTER=us-south  # Options: us-south, eu-de, eu-gb, jp-tok, au-syd, ca-tor
 # Optional: Server Configuration
 HOST=0.0.0.0
 PORT=8000
 LOG_LEVEL=info
 # Optional: API Key for Proxy Authentication
 API_KEY=your_optional_api_key_for_proxy_authentication
 # Optional: CORS Configuration
 ALLOWED_ORIGINS=*  # Comma-separated or * for all
 # Optional: Model Mapping
 MODEL_MAP_GPT4=ibm/granite-4-h-small
 MODEL_MAP_GPT35=ibm/granite-3-8b-instruct
 MODEL_MAP_GPT4_TURBO=meta-llama/llama-3-3-70b-instruct
 MODEL_MAP_TEXT_EMBEDDING_ADA_002=ibm/slate-125m-english-rtrvr
 ```
 ### Model Mapping
 You can map OpenAI model names to watsonx models using environment variables:
 ```bash
 MODEL_MAP_<OPENAI_MODEL_NAME>=<WATSONX_MODEL_ID>
 ```
 For example:
 - `MODEL_MAP_GPT4=ibm/granite-4-h-small` maps `gpt-4` to `ibm/granite-4-h-small`
 - `MODEL_MAP_GPT35_TURBO=ibm/granite-3-8b-instruct` maps `gpt-3.5-turbo` to `ibm/granite-3-8b-instruct`
 ## Usage
 ### With OpenAI Python SDK
 ```python
 from openai import OpenAI
 # Point to your proxy
 client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-proxy-api-key"  # Optional, if you set API_KEY in .env
 )
 # Use as normal
 response = client.chat.completions.create(
    model="ibm/granite-3-8b-instruct",  # Or use mapped name like "gpt-4"
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
 )
 print(response.choices[0].message.content)
 ```
 ### With Streaming
 ```python
 stream = client.chat.completions.create(
    model="ibm/granite-3-8b-instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
 )
 for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
 ```
 ### With cURL
 ```bash
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-proxy-api-key" \
  -d '{
    "model": "ibm/granite-3-8b-instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
 ```
 ### Embeddings
 ```python
 response = client.embeddings.create(
    model="ibm/slate-125m-english-rtrvr",
    input="Your text to embed"
 )
 print(response.data[0].embedding)
 ```
 ## Available Endpoints
 - `GET /` - API information
 - `GET /health` - Health check
 - `GET /docs` - Interactive API documentation (Swagger UI)
 - `POST /v1/chat/completions` - Chat completions
 - `POST /v1/completions` - Text completions (legacy)
 - `POST /v1/embeddings` - Generate embeddings
 - `GET /v1/models` - List available models
 - `GET /v1/models/{model_id}` - Get model information
 ## Supported Models
 The proxy supports all watsonx.ai models available in your project, including:
 ### Chat Models
 - IBM Granite models (3.x, 4.x series)
 - Meta Llama models (3.x, 4.x series)
 - Mistral models
 - Other models available on watsonx.ai
 ### Embedding Models
 - `ibm/slate-125m-english-rtrvr`
 - `ibm/slate-30m-english-rtrvr`
 See `/v1/models` endpoint for the complete list.
 ## Authentication
 ### Proxy Authentication (Optional)
 If you set `API_KEY` in your `.env` file, clients must provide it:
 ```python
 client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-proxy-api-key"
 )
 ```
 ### IBM Cloud Authentication
 The proxy handles IBM Cloud authentication automatically using your `IBM_CLOUD_API_KEY`. Bearer tokens are:
 - Automatically obtained on startup
 - Refreshed every 50 minutes (tokens expire after 60 minutes)
 - Refreshed on 401 errors
 ## Deployment
 ### Docker (Recommended)
 Create a `Dockerfile`:
 ```dockerfile
 FROM python:3.11-slim
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY app ./app
 COPY .env .
 EXPOSE 8000
 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
 ```
 Build and run:
 ```bash
 docker build -t watsonx-openai-proxy .
 docker run -p 8000:8000 --env-file .env watsonx-openai-proxy
 ```
 ### Production Deployment
 For production, consider:
 1. **Use a production ASGI server**: The included uvicorn is suitable, but configure workers:
   ```bash
   uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4
   ```
 2. **Set up HTTPS**: Use a reverse proxy like nginx or Caddy
 3. **Configure CORS**: Set `ALLOWED_ORIGINS` to specific domains
 4. **Enable API key authentication**: Set `API_KEY` in environment
 5. **Monitor logs**: Set `LOG_LEVEL=info` or `warning` in production
 6. **Use environment secrets**: Don't commit `.env` file, use secret management
 ## Troubleshooting
 ### 401 Unauthorized
 - Check that `IBM_CLOUD_API_KEY` is valid
 - Verify your IBM Cloud account has watsonx.ai access
 - Check server logs for token refresh errors
 ### Model Not Found
 - Verify the model ID exists in watsonx.ai
 - Check that your project has access to the model
 - Use `/v1/models` endpoint to see available models
 ### Connection Errors
 - Verify `WATSONX_CLUSTER` matches your project's region
 - Check firewall/network settings
 - Ensure watsonx.ai services are accessible
 ### Streaming Issues
 - Some models may not support streaming
 - Check client library supports SSE (Server-Sent Events)
 - Verify network doesn't buffer streaming responses
 ## Development
 ### Running Tests
 ```bash
 # Install dev dependencies
 pip install pytest pytest-asyncio httpx
 # Run tests
 pytest tests/
 ```
 ### Code Structure
 ```
 watsonx-openai-proxy/
 ├── app/
 │   ├── main.py              # FastAPI application
 │   ├── config.py            # Configuration management
 │   ├── routers/             # API endpoint routers
 │   │   ├── chat.py          # Chat completions
 │   │   ├── completions.py   # Text completions
 │   │   ├── embeddings.py    # Embeddings
 │   │   └── models.py        # Model listing
 │   ├── services/            # Business logic
 │   │   └── watsonx_service.py  # watsonx.ai API client
 │   ├── models/              # Pydantic models
 │   │   └── openai_models.py    # OpenAI-compatible schemas
 │   └── utils/               # Utilities
 │       └── transformers.py     # Request/response transformers
 ├── tests/                   # Test files
 ├── requirements.txt         # Python dependencies
 ├── .env.example            # Environment template
 └── README.md               # This file
 ```
 ## Contributing
 Contributions are welcome! Please:
 1. Fork the repository
 2. Create a feature branch
 3. Make your changes
 4. Add tests if applicable
 5. Submit a pull request
 ## License
 Apache 2.0 License - See LICENSE file for details.
 ## Related Projects
 - [watsonx-unofficial-aisdk-provider](../wxai-provider/) - Vercel AI SDK provider for watsonx.ai
 - [OpenCode watsonx plugin](../.opencode/plugins/) - Token management plugin for OpenCode
 ## Disclaimer
 This is **not an official IBM product**. It's a community-maintained proxy for integrating watsonx.ai with OpenAI-compatible tools. watsonx.ai is a trademark of IBM.
 ## Support
 For issues and questions:
 - Check the [Troubleshooting](#troubleshooting) section
 - Review server logs (`LOG_LEVEL=debug` for detailed logs)
 - Open an issue in the repository
 - Consult [IBM watsonx.ai documentation](https://www.ibm.com/docs/en/watsonx-as-a-service)
@@ -0,0 +1,3 @@
 """watsonx-openai-proxy application package."""
 __version__ = "1.0.0"
@@ -0,0 +1,91 @@
 """Configuration management for watsonx-openai-proxy."""
 import os
 from typing import Dict, Optional
 from pydantic_settings import BaseSettings, SettingsConfigDict
 class Settings(BaseSettings):
    """Application settings loaded from environment variables."""
    # IBM Cloud Configuration
    ibm_cloud_api_key: str
    watsonx_project_id: str
    watsonx_cluster: str = "us-south"
    # Server Configuration
    host: str = "0.0.0.0"
    port: int = 8000
    log_level: str = "info"
    # API Configuration
    api_key: Optional[str] = None
    allowed_origins: str = "*"
    # Token Management
    token_refresh_interval: int = 3000  # 50 minutes in seconds
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
        extra="allow",  # Allow extra fields for model mapping
    )
    @property
    def watsonx_base_url(self) -> str:
        """Construct the watsonx.ai base URL from cluster."""
        return f"https://{self.watsonx_cluster}.ml.cloud.ibm.com/ml/v1"
    @property
    def cors_origins(self) -> list[str]:
        """Parse CORS origins from comma-separated string."""
        if self.allowed_origins == "*":
            return ["*"]
        return [origin.strip() for origin in self.allowed_origins.split(",")]
    def get_model_mapping(self) -> Dict[str, str]:
        """Extract model mappings from environment variables.
        Looks for variables like MODEL_MAP_GPT4=ibm/granite-4-h-small
        and creates a mapping dict.
        """
        import re
        mapping = {}
        # Check os.environ first
        for key, value in os.environ.items():
            if key.startswith("MODEL_MAP_"):
                model_name = key.replace("MODEL_MAP_", "")
                model_name = re.sub(r'([A-Z]+)(\d+)', r'\1-\2', model_name)
                openai_model = model_name.lower().replace("_", "-")
                mapping[openai_model] = value
        # Also check pydantic's extra fields (from .env file)
        # These come in as lowercase: model_map_gpt4 instead of MODEL_MAP_GPT4
        extra = getattr(self, '__pydantic_extra__', {}) or {}
        for key, value in extra.items():
            if key.startswith("model_map_"):
                # Convert back to uppercase for processing
                model_name = key.replace("model_map_", "").upper()
                model_name = re.sub(r'([A-Z]+)(\d+)', r'\1-\2', model_name)
                openai_model = model_name.lower().replace("_", "-")
                mapping[openai_model] = value
        return mapping
    def map_model(self, openai_model: str) -> str:
        """Map an OpenAI model name to a watsonx model ID.
        Args:
            openai_model: OpenAI model name (e.g., "gpt-4", "gpt-3.5-turbo")
        Returns:
            Corresponding watsonx model ID, or the original name if no mapping exists
        """
        model_map = self.get_model_mapping()
        return model_map.get(openai_model, openai_model)
 # Global settings instance
 settings = Settings()
@@ -0,0 +1,157 @@
 """Main FastAPI application for watsonx-openai-proxy."""
 import logging
 from contextlib import asynccontextmanager
 from fastapi import FastAPI, Request, status
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse
 from app.config import settings
 from app.routers import chat, completions, embeddings, models
 from app.services.watsonx_service import watsonx_service
 # Configure logging
 logging.basicConfig(
    level=getattr(logging, settings.log_level.upper()),
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
 )
 logger = logging.getLogger(__name__)
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage application lifespan events."""
    # Startup
    logger.info("Starting watsonx-openai-proxy...")
    logger.info(f"Cluster: {settings.watsonx_cluster}")
    logger.info(f"Project ID: {settings.watsonx_project_id[:8]}...")
    # Initialize token
    try:
        await watsonx_service._refresh_token()
        logger.info("Initial bearer token obtained successfully")
    except Exception as e:
        logger.error(f"Failed to obtain initial bearer token: {e}")
        raise
    yield
    # Shutdown
    logger.info("Shutting down watsonx-openai-proxy...")
    await watsonx_service.close()
 # Create FastAPI app
 app = FastAPI(
    title="watsonx-openai-proxy",
    description="OpenAI-compatible API proxy for IBM watsonx.ai",
    version="1.0.0",
    lifespan=lifespan,
 )
 # Add CORS middleware
 app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.cors_origins,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 # Optional API key authentication middleware
@app.middleware("http")
 async def authenticate(request: Request, call_next):
    """Authenticate requests if API key is configured."""
    if settings.api_key:
        # Skip authentication for health check
        if request.url.path == "/health":
            return await call_next(request)
        # Check Authorization header
        auth_header = request.headers.get("Authorization")
        if not auth_header:
            return JSONResponse(
                status_code=status.HTTP_401_UNAUTHORIZED,
                content={
                    "error": {
                        "message": "Missing Authorization header",
                        "type": "authentication_error",
                        "code": "missing_authorization",
                    }
                },
            )
        # Validate API key
        if not auth_header.startswith("Bearer "):
            return JSONResponse(
                status_code=status.HTTP_401_UNAUTHORIZED,
                content={
                    "error": {
                        "message": "Invalid Authorization header format",
                        "type": "authentication_error",
                        "code": "invalid_authorization_format",
                    }
                },
            )
        token = auth_header[7:]  # Remove "Bearer " prefix
        if token != settings.api_key:
            return JSONResponse(
                status_code=status.HTTP_401_UNAUTHORIZED,
                content={
                    "error": {
                        "message": "Invalid API key",
                        "type": "authentication_error",
                        "code": "invalid_api_key",
                    }
                },
            )
    return await call_next(request)
 # Include routers
 app.include_router(chat.router, tags=["Chat"])
 app.include_router(completions.router, tags=["Completions"])
 app.include_router(embeddings.router, tags=["Embeddings"])
 app.include_router(models.router, tags=["Models"])
@app.get("/health")
 async def health_check():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "service": "watsonx-openai-proxy",
        "cluster": settings.watsonx_cluster,
    }
@app.get("/")
 async def root():
    """Root endpoint with API information."""
    return {
        "service": "watsonx-openai-proxy",
        "description": "OpenAI-compatible API proxy for IBM watsonx.ai",
        "version": "1.0.0",
        "endpoints": {
            "chat": "/v1/chat/completions",
            "completions": "/v1/completions",
            "embeddings": "/v1/embeddings",
            "models": "/v1/models",
            "health": "/health",
        },
        "documentation": "/docs",
    }
 if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        "app.main:app",
        host=settings.host,
        port=settings.port,
        log_level=settings.log_level,
        reload=False,
    )
@@ -0,0 +1,31 @@
 """OpenAI-compatible data models."""
 from app.models.openai_models import (
    ChatMessage,
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatCompletionChunk,
    CompletionRequest,
    CompletionResponse,
    EmbeddingRequest,
    EmbeddingResponse,
    ModelsResponse,
    ModelInfo,
    ErrorResponse,
    ErrorDetail,
 )
 __all__ = [
    "ChatMessage",
    "ChatCompletionRequest",
    "ChatCompletionResponse",
    "ChatCompletionChunk",
    "CompletionRequest",
    "CompletionResponse",
    "EmbeddingRequest",
    "EmbeddingResponse",
    "ModelsResponse",
    "ModelInfo",
    "ErrorResponse",
    "ErrorDetail",
 ]
@@ -0,0 +1,213 @@
 """OpenAI-compatible request and response models."""
 from typing import Any, Dict, List, Literal, Optional, Union
 from pydantic import BaseModel, Field
 # ============================================================================
 # Chat Completions Models
 # ============================================================================
 class ChatMessage(BaseModel):
    """A chat message in the conversation."""
    role: Literal["system", "user", "assistant", "function", "tool"]
    content: Optional[str] = None
    name: Optional[str] = None
    function_call: Optional[Dict[str, Any]] = None
    tool_calls: Optional[List[Dict[str, Any]]] = None
 class FunctionCall(BaseModel):
    """Function call specification."""
    name: str
    arguments: str
 class ToolCall(BaseModel):
    """Tool call specification."""
    id: str
    type: Literal["function"]
    function: FunctionCall
 class ChatCompletionRequest(BaseModel):
    """OpenAI chat completion request."""
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = Field(default=1.0, ge=0, le=2)
    top_p: Optional[float] = Field(default=1.0, ge=0, le=1)
    n: Optional[int] = Field(default=1, ge=1)
    stream: Optional[bool] = False
    stop: Optional[Union[str, List[str]]] = None
    max_tokens: Optional[int] = Field(default=None, ge=1)
    presence_penalty: Optional[float] = Field(default=0, ge=-2, le=2)
    frequency_penalty: Optional[float] = Field(default=0, ge=-2, le=2)
    logit_bias: Optional[Dict[str, float]] = None
    user: Optional[str] = None
    functions: Optional[List[Dict[str, Any]]] = None
    function_call: Optional[Union[str, Dict[str, str]]] = None
    tools: Optional[List[Dict[str, Any]]] = None
    tool_choice: Optional[Union[str, Dict[str, Any]]] = None
 class ChatCompletionChoice(BaseModel):
    """A single chat completion choice."""
    index: int
    message: ChatMessage
    finish_reason: Optional[str] = None
    logprobs: Optional[Dict[str, Any]] = None
 class ChatCompletionUsage(BaseModel):
    """Token usage information."""
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
 class ChatCompletionResponse(BaseModel):
    """OpenAI chat completion response."""
    id: str
    object: Literal["chat.completion"] = "chat.completion"
    created: int
    model: str
    choices: List[ChatCompletionChoice]
    usage: ChatCompletionUsage
    system_fingerprint: Optional[str] = None
 class ChatCompletionChunkDelta(BaseModel):
    """Delta content in streaming response."""
    role: Optional[str] = None
    content: Optional[str] = None
    function_call: Optional[Dict[str, Any]] = None
    tool_calls: Optional[List[Dict[str, Any]]] = None
 class ChatCompletionChunkChoice(BaseModel):
    """A single streaming chunk choice."""
    index: int
    delta: ChatCompletionChunkDelta
    finish_reason: Optional[str] = None
    logprobs: Optional[Dict[str, Any]] = None
 class ChatCompletionChunk(BaseModel):
    """OpenAI streaming chat completion chunk."""
    id: str
    object: Literal["chat.completion.chunk"] = "chat.completion.chunk"
    created: int
    model: str
    choices: List[ChatCompletionChunkChoice]
    system_fingerprint: Optional[str] = None
 # ============================================================================
 # Completions Models (Legacy)
 # ============================================================================
 class CompletionRequest(BaseModel):
    """OpenAI completion request (legacy)."""
    model: str
    prompt: Union[str, List[str], List[int], List[List[int]]]
    suffix: Optional[str] = None
    max_tokens: Optional[int] = Field(default=16, ge=1)
    temperature: Optional[float] = Field(default=1.0, ge=0, le=2)
    top_p: Optional[float] = Field(default=1.0, ge=0, le=1)
    n: Optional[int] = Field(default=1, ge=1)
    stream: Optional[bool] = False
    logprobs: Optional[int] = Field(default=None, ge=0, le=5)
    echo: Optional[bool] = False
    stop: Optional[Union[str, List[str]]] = None
    presence_penalty: Optional[float] = Field(default=0, ge=-2, le=2)
    frequency_penalty: Optional[float] = Field(default=0, ge=-2, le=2)
    best_of: Optional[int] = Field(default=1, ge=1)
    logit_bias: Optional[Dict[str, float]] = None
    user: Optional[str] = None
 class CompletionChoice(BaseModel):
    """A single completion choice."""
    text: str
    index: int
    logprobs: Optional[Dict[str, Any]] = None
    finish_reason: Optional[str] = None
 class CompletionResponse(BaseModel):
    """OpenAI completion response."""
    id: str
    object: Literal["text_completion"] = "text_completion"
    created: int
    model: str
    choices: List[CompletionChoice]
    usage: ChatCompletionUsage
 # ============================================================================
 # Embeddings Models
 # ============================================================================
 class EmbeddingRequest(BaseModel):
    """OpenAI embedding request."""
    model: str
    input: Union[str, List[str], List[int], List[List[int]]]
    encoding_format: Optional[Literal["float", "base64"]] = "float"
    dimensions: Optional[int] = None
    user: Optional[str] = None
 class EmbeddingData(BaseModel):
    """A single embedding result."""
    object: Literal["embedding"] = "embedding"
    embedding: List[float]
    index: int
 class EmbeddingUsage(BaseModel):
    """Token usage for embeddings."""
    prompt_tokens: int
    total_tokens: int
 class EmbeddingResponse(BaseModel):
    """OpenAI embedding response."""
    object: Literal["list"] = "list"
    data: List[EmbeddingData]
    model: str
    usage: EmbeddingUsage
 # ============================================================================
 # Models List
 # ============================================================================
 class ModelInfo(BaseModel):
    """Information about a single model."""
    id: str
    object: Literal["model"] = "model"
    created: int
    owned_by: str
 class ModelsResponse(BaseModel):
    """List of available models."""
    object: Literal["list"] = "list"
    data: List[ModelInfo]
 # ============================================================================
 # Error Models
 # ============================================================================
 class ErrorDetail(BaseModel):
    """Error detail information."""
    message: str
    type: str
    param: Optional[str] = None
    code: Optional[str] = None
 class ErrorResponse(BaseModel):
    """OpenAI error response."""
    error: ErrorDetail
@@ -0,0 +1,5 @@
 """API routers for watsonx-openai-proxy."""
 from app.routers import chat, completions, embeddings, models
 __all__ = ["chat", "completions", "embeddings", "models"]
@@ -0,0 +1,156 @@
 """Chat completions endpoint router."""
 import json
 import uuid
 from typing import Union
 from fastapi import APIRouter, HTTPException, Request
 from fastapi.responses import StreamingResponse
 from app.models.openai_models import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ErrorResponse,
    ErrorDetail,
 )
 from app.services.watsonx_service import watsonx_service
 from app.utils.transformers import (
    transform_messages_to_watsonx,
    transform_tools_to_watsonx,
    transform_watsonx_to_openai_chat,
    transform_watsonx_to_openai_chat_chunk,
    format_sse_event,
 )
 from app.config import settings
 import logging
 logger = logging.getLogger(__name__)
 router = APIRouter()
@router.post(
    "/v1/chat/completions",
    response_model=Union[ChatCompletionResponse, ErrorResponse],
    responses={
        200: {"model": ChatCompletionResponse},
        400: {"model": ErrorResponse},
        401: {"model": ErrorResponse},
        500: {"model": ErrorResponse},
    },
 )
 async def create_chat_completion(
    request: ChatCompletionRequest,
    http_request: Request,
 ):
    """Create a chat completion using OpenAI-compatible API.
    This endpoint accepts OpenAI-formatted requests and translates them
    to watsonx.ai API calls.
    """
    try:
        # Map model name if needed
        watsonx_model = settings.map_model(request.model)
        logger.info(f"Chat completion request: {request.model} -> {watsonx_model}")
        # Transform messages
        watsonx_messages = transform_messages_to_watsonx(request.messages)
        # Transform tools if present
        watsonx_tools = transform_tools_to_watsonx(request.tools)
        # Handle streaming
        if request.stream:
            return StreamingResponse(
                stream_chat_completion(
                    watsonx_model,
                    watsonx_messages,
                    request,
                    watsonx_tools,
                ),
                media_type="text/event-stream",
            )
        # Non-streaming response
        watsonx_response = await watsonx_service.chat_completion(
            model_id=watsonx_model,
            messages=watsonx_messages,
            temperature=request.temperature or 1.0,
            max_tokens=request.max_tokens,
            top_p=request.top_p or 1.0,
            stop=request.stop if isinstance(request.stop, list) else [request.stop] if request.stop else None,
            tools=watsonx_tools,
        )
        # Transform response
        openai_response = transform_watsonx_to_openai_chat(
            watsonx_response,
            request.model,
        )
        return openai_response
    except Exception as e:
        logger.error(f"Error in chat completion: {str(e)}", exc_info=True)
        raise HTTPException(
            status_code=500,
            detail={
                "error": {
                    "message": str(e),
                    "type": "internal_error",
                    "code": "internal_error",
                }
            },
        )
 async def stream_chat_completion(
    watsonx_model: str,
    watsonx_messages: list,
    request: ChatCompletionRequest,
    watsonx_tools: list = None,
 ):
    """Stream chat completion responses.
    Args:
        watsonx_model: The watsonx model ID
        watsonx_messages: Transformed messages
        request: Original OpenAI request
        watsonx_tools: Transformed tools
    Yields:
        Server-Sent Events with chat completion chunks
    """
    request_id = f"chatcmpl-{uuid.uuid4().hex[:24]}"
    try:
        async for chunk in watsonx_service.chat_completion_stream(
            model_id=watsonx_model,
            messages=watsonx_messages,
            temperature=request.temperature or 1.0,
            max_tokens=request.max_tokens,
            top_p=request.top_p or 1.0,
            stop=request.stop if isinstance(request.stop, list) else [request.stop] if request.stop else None,
            tools=watsonx_tools,
        ):
            # Transform chunk to OpenAI format
            openai_chunk = transform_watsonx_to_openai_chat_chunk(
                chunk,
                request.model,
                request_id,
            )
            # Send as SSE
            yield format_sse_event(openai_chunk.model_dump_json())
        # Send [DONE] message
        yield format_sse_event("[DONE]")
    except Exception as e:
        logger.error(f"Error in streaming chat completion: {str(e)}", exc_info=True)
        error_response = ErrorResponse(
            error=ErrorDetail(
                message=str(e),
                type="internal_error",
                code="stream_error",
            )
        )
        yield format_sse_event(error_response.model_dump_json())
@@ -0,0 +1,109 @@
 """Text completions endpoint router (legacy)."""
 import uuid
 from typing import Union
 from fastapi import APIRouter, HTTPException, Request
 from app.models.openai_models import (
    CompletionRequest,
    CompletionResponse,
    ErrorResponse,
    ErrorDetail,
 )
 from app.services.watsonx_service import watsonx_service
 from app.utils.transformers import transform_watsonx_to_openai_completion
 from app.config import settings
 import logging
 logger = logging.getLogger(__name__)
 router = APIRouter()
@router.post(
    "/v1/completions",
    response_model=Union[CompletionResponse, ErrorResponse],
    responses={
        200: {"model": CompletionResponse},
        400: {"model": ErrorResponse},
        401: {"model": ErrorResponse},
        500: {"model": ErrorResponse},
    },
 )
 async def create_completion(
    request: CompletionRequest,
    http_request: Request,
 ):
    """Create a text completion using OpenAI-compatible API (legacy).
    This endpoint accepts OpenAI-formatted completion requests and translates
    them to watsonx.ai text generation API calls.
    """
    try:
        # Map model name if needed
        watsonx_model = settings.map_model(request.model)
        logger.info(f"Completion request: {request.model} -> {watsonx_model}")
        # Handle prompt (can be string or list)
        if isinstance(request.prompt, list):
            if len(request.prompt) == 0:
                raise HTTPException(
                    status_code=400,
                    detail={
                        "error": {
                            "message": "Prompt cannot be empty",
                            "type": "invalid_request_error",
                            "code": "invalid_prompt",
                        }
                    },
                )
            # For now, just use the first prompt
            # TODO: Handle multiple prompts with n parameter
            prompt = request.prompt[0] if isinstance(request.prompt[0], str) else ""
        else:
            prompt = request.prompt
        # Note: Streaming not implemented for completions yet
        if request.stream:
            raise HTTPException(
                status_code=400,
                detail={
                    "error": {
                        "message": "Streaming not supported for completions endpoint",
                        "type": "invalid_request_error",
                        "code": "streaming_not_supported",
                    }
                },
            )
        # Call watsonx text generation
        watsonx_response = await watsonx_service.text_generation(
            model_id=watsonx_model,
            prompt=prompt,
            temperature=request.temperature or 1.0,
            max_tokens=request.max_tokens,
            top_p=request.top_p or 1.0,
            stop=request.stop if isinstance(request.stop, list) else [request.stop] if request.stop else None,
        )
        # Transform response
        openai_response = transform_watsonx_to_openai_completion(
            watsonx_response,
            request.model,
        )
        return openai_response
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Error in completion: {str(e)}", exc_info=True)
        raise HTTPException(
            status_code=500,
            detail={
                "error": {
                    "message": str(e),
                    "type": "internal_error",
                    "code": "internal_error",
                }
            },
        )
@@ -0,0 +1,114 @@
 """Embeddings endpoint router."""
 from typing import Union
 from fastapi import APIRouter, HTTPException, Request
 from app.models.openai_models import (
    EmbeddingRequest,
    EmbeddingResponse,
    ErrorResponse,
    ErrorDetail,
 )
 from app.services.watsonx_service import watsonx_service
 from app.utils.transformers import transform_watsonx_to_openai_embeddings
 from app.config import settings
 import logging
 logger = logging.getLogger(__name__)
 router = APIRouter()
@router.post(
    "/v1/embeddings",
    response_model=Union[EmbeddingResponse, ErrorResponse],
    responses={
        200: {"model": EmbeddingResponse},
        400: {"model": ErrorResponse},
        401: {"model": ErrorResponse},
        500: {"model": ErrorResponse},
    },
 )
 async def create_embeddings(
    request: EmbeddingRequest,
    http_request: Request,
 ):
    """Create embeddings using OpenAI-compatible API.
    This endpoint accepts OpenAI-formatted embedding requests and translates
    them to watsonx.ai embeddings API calls.
    """
    try:
        # Map model name if needed
        watsonx_model = settings.map_model(request.model)
        logger.info(f"Embeddings request: {request.model} -> {watsonx_model}")
        # Handle input (can be string or list)
        if isinstance(request.input, str):
            inputs = [request.input]
        elif isinstance(request.input, list):
            if len(request.input) == 0:
                raise HTTPException(
                    status_code=400,
                    detail={
                        "error": {
                            "message": "Input cannot be empty",
                            "type": "invalid_request_error",
                            "code": "invalid_input",
                        }
                    },
                )
            # Handle list of strings or list of token IDs
            if isinstance(request.input[0], str):
                inputs = request.input
            else:
                # Token IDs not supported yet
                raise HTTPException(
                    status_code=400,
                    detail={
                        "error": {
                            "message": "Token ID input not supported",
                            "type": "invalid_request_error",
                            "code": "unsupported_input_type",
                        }
                    },
                )
        else:
            raise HTTPException(
                status_code=400,
                detail={
                    "error": {
                        "message": "Invalid input type",
                        "type": "invalid_request_error",
                        "code": "invalid_input_type",
                    }
                },
            )
        # Call watsonx embeddings
        watsonx_response = await watsonx_service.embeddings(
            model_id=watsonx_model,
            inputs=inputs,
        )
        # Transform response
        openai_response = transform_watsonx_to_openai_embeddings(
            watsonx_response,
            request.model,
        )
        return openai_response
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Error in embeddings: {str(e)}", exc_info=True)
        raise HTTPException(
            status_code=500,
            detail={
                "error": {
                    "message": str(e),
                    "type": "internal_error",
                    "code": "internal_error",
                }
            },
        )
@@ -0,0 +1,120 @@
 """Models endpoint router."""
 import time
 from fastapi import APIRouter
 from app.models.openai_models import ModelsResponse, ModelInfo
 from app.config import settings
 import logging
 logger = logging.getLogger(__name__)
 router = APIRouter()
 # Predefined list of available models
 # This can be extended or made dynamic based on watsonx.ai API
 AVAILABLE_MODELS = [
    # Granite Models
    "ibm/granite-3-1-8b-base",
    "ibm/granite-3-2-8b-instruct",
    "ibm/granite-3-3-8b-instruct",
    "ibm/granite-3-8b-instruct",
    "ibm/granite-4-h-small",
    "ibm/granite-8b-code-instruct",
    # Llama Models
    "meta-llama/llama-3-1-70b-gptq",
    "meta-llama/llama-3-1-8b",
    "meta-llama/llama-3-2-11b-vision-instruct",
    "meta-llama/llama-3-2-90b-vision-instruct",
    "meta-llama/llama-3-3-70b-instruct",
    "meta-llama/llama-3-405b-instruct",
    "meta-llama/llama-4-maverick-17b-128e-instruct-fp8",
    # Mistral Models
    "mistral-large-2512",
    "mistralai/mistral-medium-2505",
    "mistralai/mistral-small-3-1-24b-instruct-2503",
    # Other Models
    "openai/gpt-oss-120b",
    # Embedding Models
    "ibm/slate-125m-english-rtrvr",
    "ibm/slate-30m-english-rtrvr",
 ]
@router.get(
    "/v1/models",
    response_model=ModelsResponse,
 )
 async def list_models():
    """List available models in OpenAI-compatible format.
    Returns a list of models that can be used with the API.
    Includes both the actual watsonx model IDs and any mapped names.
    """
    created_time = int(time.time())
    models = []
    # Add all available watsonx models
    for model_id in AVAILABLE_MODELS:
        models.append(
            ModelInfo(
                id=model_id,
                created=created_time,
                owned_by="ibm-watsonx",
            )
        )
    # Add mapped model names (e.g., gpt-4 -> ibm/granite-4-h-small)
    model_mapping = settings.get_model_mapping()
    for openai_name, watsonx_id in model_mapping.items():
        if watsonx_id in AVAILABLE_MODELS:
            models.append(
                ModelInfo(
                    id=openai_name,
                    created=created_time,
                    owned_by="ibm-watsonx",
                )
            )
    return ModelsResponse(data=models)
@router.get(
    "/v1/models/{model_id}",
    response_model=ModelInfo,
 )
 async def retrieve_model(model_id: str):
    """Retrieve information about a specific model.
    Args:
        model_id: The model ID to retrieve
    Returns:
        Model information
    """
    # Map the model if needed
    watsonx_model = settings.map_model(model_id)
    # Check if model exists
    if watsonx_model not in AVAILABLE_MODELS:
        from fastapi import HTTPException
        raise HTTPException(
            status_code=404,
            detail={
                "error": {
                    "message": f"Model '{model_id}' not found",
                    "type": "invalid_request_error",
                    "code": "model_not_found",
                }
            },
        )
    return ModelInfo(
        id=model_id,
        created=int(time.time()),
        owned_by="ibm-watsonx",
    )
@@ -0,0 +1,5 @@
 """Services for watsonx-openai-proxy."""
 from app.services.watsonx_service import watsonx_service, WatsonxService
 __all__ = ["watsonx_service", "WatsonxService"]
@@ -0,0 +1,316 @@
 """Service for interacting with IBM watsonx.ai APIs."""
 import asyncio
 import time
 from typing import AsyncIterator, Dict, List, Optional
 import httpx
 from app.config import settings
 import logging
 logger = logging.getLogger(__name__)
 class WatsonxService:
    """Service for managing watsonx.ai API interactions."""
    def __init__(self):
        self.base_url = settings.watsonx_base_url
        self.project_id = settings.watsonx_project_id
        self.api_key = settings.ibm_cloud_api_key
        self._bearer_token: Optional[str] = None
        self._token_expiry: Optional[float] = None
        self._token_lock = asyncio.Lock()
        self._client: Optional[httpx.AsyncClient] = None
    async def _get_client(self) -> httpx.AsyncClient:
        """Get or create HTTP client."""
        if self._client is None:
            self._client = httpx.AsyncClient(timeout=300.0)
        return self._client
    async def close(self):
        """Close the HTTP client."""
        if self._client:
            await self._client.aclose()
            self._client = None
    async def _refresh_token(self) -> str:
        """Get a fresh bearer token from IBM Cloud IAM."""
        async with self._token_lock:
            # Check if token is still valid
            if self._bearer_token and self._token_expiry:
                if time.time() < self._token_expiry - 300:  # 5 min buffer
                    return self._bearer_token
            logger.info("Refreshing IBM Cloud bearer token...")
            client = await self._get_client()
            response = await client.post(
                "https://iam.cloud.ibm.com/identity/token",
                headers={"Content-Type": "application/x-www-form-urlencoded"},
                data=f"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={self.api_key}",
            )
            if response.status_code != 200:
                raise Exception(f"Failed to get bearer token: {response.text}")
            data = response.json()
            self._bearer_token = data["access_token"]
            self._token_expiry = time.time() + data.get("expires_in", 3600)
            logger.info(f"Bearer token refreshed. Expires in {data.get('expires_in', 3600)} seconds")
            return self._bearer_token
    async def _get_headers(self) -> Dict[str, str]:
        """Get headers with valid bearer token."""
        token = await self._refresh_token()
        return {
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
            "Accept": "application/json",
        }
    async def chat_completion(
        self,
        model_id: str,
        messages: List[Dict],
        temperature: float = 1.0,
        max_tokens: Optional[int] = None,
        top_p: float = 1.0,
        stop: Optional[List[str]] = None,
        stream: bool = False,
        tools: Optional[List[Dict]] = None,
        **kwargs,
    ) -> Dict:
        """Create a chat completion using watsonx.ai.
        Args:
            model_id: The watsonx model ID
            messages: List of chat messages
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
            top_p: Nucleus sampling parameter
            stop: Stop sequences
            stream: Whether to stream the response
            tools: Tool/function definitions
            **kwargs: Additional parameters
        Returns:
            Chat completion response
        """
        headers = await self._get_headers()
        client = await self._get_client()
        # Build watsonx request
        payload = {
            "model_id": model_id,
            "project_id": self.project_id,
            "messages": messages,
            "parameters": {
                "temperature": temperature,
                "top_p": top_p,
            },
        }
        if max_tokens:
            payload["parameters"]["max_tokens"] = max_tokens
        if stop:
            payload["parameters"]["stop_sequences"] = stop if isinstance(stop, list) else [stop]
        if tools:
            payload["tools"] = tools
        url = f"{self.base_url}/text/chat"
        params = {"version": "2024-02-13"}
        if stream:
            params["stream"] = "true"
        response = await client.post(
            url,
            headers=headers,
            json=payload,
            params=params,
        )
        if response.status_code != 200:
            raise Exception(f"watsonx API error: {response.status_code} - {response.text}")
        return response.json()
    async def chat_completion_stream(
        self,
        model_id: str,
        messages: List[Dict],
        temperature: float = 1.0,
        max_tokens: Optional[int] = None,
        top_p: float = 1.0,
        stop: Optional[List[str]] = None,
        tools: Optional[List[Dict]] = None,
        **kwargs,
    ) -> AsyncIterator[Dict]:
        """Stream a chat completion using watsonx.ai.
        Args:
            model_id: The watsonx model ID
            messages: List of chat messages
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
            top_p: Nucleus sampling parameter
            stop: Stop sequences
            tools: Tool/function definitions
            **kwargs: Additional parameters
        Yields:
            Chat completion chunks
        """
        headers = await self._get_headers()
        client = await self._get_client()
        # Build watsonx request
        payload = {
            "model_id": model_id,
            "project_id": self.project_id,
            "messages": messages,
            "parameters": {
                "temperature": temperature,
                "top_p": top_p,
            },
        }
        if max_tokens:
            payload["parameters"]["max_tokens"] = max_tokens
        if stop:
            payload["parameters"]["stop_sequences"] = stop if isinstance(stop, list) else [stop]
        if tools:
            payload["tools"] = tools
        url = f"{self.base_url}/text/chat_stream"
        params = {"version": "2024-02-13"}
        async with client.stream(
            "POST",
            url,
            headers=headers,
            json=payload,
            params=params,
        ) as response:
            if response.status_code != 200:
                text = await response.aread()
                raise Exception(f"watsonx API error: {response.status_code} - {text.decode()}")
            async for line in response.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]  # Remove "data: " prefix
                    if data.strip() == "[DONE]":
                        break
                    try:
                        import json
                        yield json.loads(data)
                    except json.JSONDecodeError:
                        continue
    async def text_generation(
        self,
        model_id: str,
        prompt: str,
        temperature: float = 1.0,
        max_tokens: Optional[int] = None,
        top_p: float = 1.0,
        stop: Optional[List[str]] = None,
        **kwargs,
    ) -> Dict:
        """Generate text completion using watsonx.ai.
        Args:
            model_id: The watsonx model ID
            prompt: The input prompt
            temperature: Sampling temperature
            max_tokens: Maximum tokens to generate
            top_p: Nucleus sampling parameter
            stop: Stop sequences
            **kwargs: Additional parameters
        Returns:
            Text generation response
        """
        headers = await self._get_headers()
        client = await self._get_client()
        payload = {
            "model_id": model_id,
            "project_id": self.project_id,
            "input": prompt,
            "parameters": {
                "temperature": temperature,
                "top_p": top_p,
            },
        }
        if max_tokens:
            payload["parameters"]["max_new_tokens"] = max_tokens
        if stop:
            payload["parameters"]["stop_sequences"] = stop if isinstance(stop, list) else [stop]
        url = f"{self.base_url}/text/generation"
        params = {"version": "2024-02-13"}
        response = await client.post(
            url,
            headers=headers,
            json=payload,
            params=params,
        )
        if response.status_code != 200:
            raise Exception(f"watsonx API error: {response.status_code} - {response.text}")
        return response.json()
    async def embeddings(
        self,
        model_id: str,
        inputs: List[str],
        **kwargs,
    ) -> Dict:
        """Generate embeddings using watsonx.ai.
        Args:
            model_id: The watsonx embedding model ID
            inputs: List of texts to embed
            **kwargs: Additional parameters
        Returns:
            Embeddings response
        """
        headers = await self._get_headers()
        client = await self._get_client()
        payload = {
            "model_id": model_id,
            "project_id": self.project_id,
            "inputs": inputs,
        }
        url = f"{self.base_url}/text/embeddings"
        params = {"version": "2024-02-13"}
        response = await client.post(
            url,
            headers=headers,
            json=payload,
            params=params,
        )
        if response.status_code != 200:
            raise Exception(f"watsonx API error: {response.status_code} - {response.text}")
        return response.json()
 # Global service instance
 watsonx_service = WatsonxService()
@@ -0,0 +1,21 @@
 """Utility functions for watsonx-openai-proxy."""
 from app.utils.transformers import (
    transform_messages_to_watsonx,
    transform_tools_to_watsonx,
    transform_watsonx_to_openai_chat,
    transform_watsonx_to_openai_chat_chunk,
    transform_watsonx_to_openai_completion,
    transform_watsonx_to_openai_embeddings,
    format_sse_event,
 )
 __all__ = [
    "transform_messages_to_watsonx",
    "transform_tools_to_watsonx",
    "transform_watsonx_to_openai_chat",
    "transform_watsonx_to_openai_chat_chunk",
    "transform_watsonx_to_openai_completion",
    "transform_watsonx_to_openai_embeddings",
    "format_sse_event",
 ]
@@ -0,0 +1,272 @@
 """Utilities for transforming between OpenAI and watsonx formats."""
 import time
 import uuid
 from typing import Any, Dict, List, Optional
 from app.models.openai_models import (
    ChatMessage,
    ChatCompletionChoice,
    ChatCompletionUsage,
    ChatCompletionResponse,
    ChatCompletionChunk,
    ChatCompletionChunkChoice,
    ChatCompletionChunkDelta,
    CompletionChoice,
    CompletionResponse,
    EmbeddingData,
    EmbeddingUsage,
    EmbeddingResponse,
 )
 def transform_messages_to_watsonx(messages: List[ChatMessage]) -> List[Dict[str, Any]]:
    """Transform OpenAI messages to watsonx format.
    Args:
        messages: List of OpenAI ChatMessage objects
    Returns:
        List of watsonx-compatible message dicts
    """
    watsonx_messages = []
    for msg in messages:
        watsonx_msg = {
            "role": msg.role,
        }
        if msg.content:
            watsonx_msg["content"] = msg.content
        if msg.name:
            watsonx_msg["name"] = msg.name
        if msg.tool_calls:
            watsonx_msg["tool_calls"] = msg.tool_calls
        if msg.function_call:
            watsonx_msg["function_call"] = msg.function_call
        watsonx_messages.append(watsonx_msg)
    return watsonx_messages
 def transform_tools_to_watsonx(tools: Optional[List[Dict]]) -> Optional[List[Dict]]:
    """Transform OpenAI tools to watsonx format.
    Args:
        tools: List of OpenAI tool definitions
    Returns:
        List of watsonx-compatible tool definitions
    """
    if not tools:
        return None
    # watsonx uses similar format to OpenAI for tools
    return tools
 def transform_watsonx_to_openai_chat(
    watsonx_response: Dict[str, Any],
    model: str,
    request_id: Optional[str] = None,
 ) -> ChatCompletionResponse:
    """Transform watsonx chat response to OpenAI format.
    Args:
        watsonx_response: Response from watsonx chat API
        model: Model name to include in response
        request_id: Optional request ID, generates one if not provided
    Returns:
        OpenAI-compatible ChatCompletionResponse
    """
    response_id = request_id or f"chatcmpl-{uuid.uuid4().hex[:24]}"
    created = int(time.time())
    # Extract choices
    choices = []
    watsonx_choices = watsonx_response.get("choices", [])
    for idx, choice in enumerate(watsonx_choices):
        message_data = choice.get("message", {})
        message = ChatMessage(
            role=message_data.get("role", "assistant"),
            content=message_data.get("content"),
            tool_calls=message_data.get("tool_calls"),
            function_call=message_data.get("function_call"),
        )
        choices.append(
            ChatCompletionChoice(
                index=idx,
                message=message,
                finish_reason=choice.get("finish_reason"),
            )
        )
    # Extract usage
    usage_data = watsonx_response.get("usage", {})
    usage = ChatCompletionUsage(
        prompt_tokens=usage_data.get("prompt_tokens", 0),
        completion_tokens=usage_data.get("completion_tokens", 0),
        total_tokens=usage_data.get("total_tokens", 0),
    )
    return ChatCompletionResponse(
        id=response_id,
        created=created,
        model=model,
        choices=choices,
        usage=usage,
    )
 def transform_watsonx_to_openai_chat_chunk(
    watsonx_chunk: Dict[str, Any],
    model: str,
    request_id: str,
 ) -> ChatCompletionChunk:
    """Transform watsonx streaming chunk to OpenAI format.
    Args:
        watsonx_chunk: Streaming chunk from watsonx
        model: Model name to include in response
        request_id: Request ID for this stream
    Returns:
        OpenAI-compatible ChatCompletionChunk
    """
    created = int(time.time())
    # Extract choices
    choices = []
    watsonx_choices = watsonx_chunk.get("choices", [])
    for idx, choice in enumerate(watsonx_choices):
        delta_data = choice.get("delta", {})
        delta = ChatCompletionChunkDelta(
            role=delta_data.get("role"),
            content=delta_data.get("content"),
            tool_calls=delta_data.get("tool_calls"),
            function_call=delta_data.get("function_call"),
        )
        choices.append(
            ChatCompletionChunkChoice(
                index=idx,
                delta=delta,
                finish_reason=choice.get("finish_reason"),
            )
        )
    return ChatCompletionChunk(
        id=request_id,
        created=created,
        model=model,
        choices=choices,
    )
 def transform_watsonx_to_openai_completion(
    watsonx_response: Dict[str, Any],
    model: str,
    request_id: Optional[str] = None,
 ) -> CompletionResponse:
    """Transform watsonx text generation response to OpenAI completion format.
    Args:
        watsonx_response: Response from watsonx text generation API
        model: Model name to include in response
        request_id: Optional request ID, generates one if not provided
    Returns:
        OpenAI-compatible CompletionResponse
    """
    response_id = request_id or f"cmpl-{uuid.uuid4().hex[:24]}"
    created = int(time.time())
    # Extract results
    results = watsonx_response.get("results", [])
    choices = []
    for idx, result in enumerate(results):
        choices.append(
            CompletionChoice(
                text=result.get("generated_text", ""),
                index=idx,
                finish_reason=result.get("stop_reason"),
            )
        )
    # Extract usage
    usage_data = watsonx_response.get("usage", {})
    usage = ChatCompletionUsage(
        prompt_tokens=usage_data.get("prompt_tokens", 0),
        completion_tokens=usage_data.get("generated_tokens", 0),
        total_tokens=usage_data.get("prompt_tokens", 0) + usage_data.get("generated_tokens", 0),
    )
    return CompletionResponse(
        id=response_id,
        created=created,
        model=model,
        choices=choices,
        usage=usage,
    )
 def transform_watsonx_to_openai_embeddings(
    watsonx_response: Dict[str, Any],
    model: str,
 ) -> EmbeddingResponse:
    """Transform watsonx embeddings response to OpenAI format.
    Args:
        watsonx_response: Response from watsonx embeddings API
        model: Model name to include in response
    Returns:
        OpenAI-compatible EmbeddingResponse
    """
    # Extract results
    results = watsonx_response.get("results", [])
    data = []
    for idx, result in enumerate(results):
        embedding = result.get("embedding", [])
        data.append(
            EmbeddingData(
                embedding=embedding,
                index=idx,
            )
        )
    # Calculate usage
    input_token_count = watsonx_response.get("input_token_count", 0)
    usage = EmbeddingUsage(
        prompt_tokens=input_token_count,
        total_tokens=input_token_count,
    )
    return EmbeddingResponse(
        data=data,
        model=model,
        usage=usage,
    )
 def format_sse_event(data: str) -> str:
    """Format data as Server-Sent Event.
    Args:
        data: JSON string to send
    Returns:
        Formatted SSE string
    """
    return f"data: {data}\n\n"
@@ -0,0 +1,27 @@
 version: '3.8'
 services:
  watsonx-openai-proxy:
    build: .
    ports:
      - "8000:8000"
    environment:
      - IBM_CLOUD_API_KEY=${IBM_CLOUD_API_KEY}
      - WATSONX_PROJECT_ID=${WATSONX_PROJECT_ID}
      - WATSONX_CLUSTER=${WATSONX_CLUSTER:-us-south}
      - HOST=0.0.0.0
      - PORT=8000
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - API_KEY=${API_KEY:-}
      - ALLOWED_ORIGINS=${ALLOWED_ORIGINS:-*}
      - MODEL_MAP_GPT4=${MODEL_MAP_GPT4:-ibm/granite-4-h-small}
      - MODEL_MAP_GPT35=${MODEL_MAP_GPT35:-ibm/granite-3-8b-instruct}
      - MODEL_MAP_GPT4_TURBO=${MODEL_MAP_GPT4_TURBO:-meta-llama/llama-3-3-70b-instruct}
      - MODEL_MAP_TEXT_EMBEDDING_ADA_002=${MODEL_MAP_TEXT_EMBEDDING_ADA_002:-ibm/slate-125m-english-rtrvr}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "python", "-c", "import httpx; httpx.get('http://localhost:8000/health')"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5s
@@ -0,0 +1,183 @@
 """Example usage of watsonx-openai-proxy with OpenAI Python SDK."""
 import os
 from openai import OpenAI
 # Configure the client to use the proxy
 client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key=os.getenv("API_KEY", "not-needed-if-proxy-has-no-auth"),
 )
 def example_chat_completion():
    """Example: Basic chat completion."""
    print("\n=== Chat Completion Example ===")
    response = client.chat.completions.create(
        model="ibm/granite-3-8b-instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        ],
        temperature=0.7,
        max_tokens=100,
    )
    print(f"Response: {response.choices[0].message.content}")
    print(f"Tokens used: {response.usage.total_tokens}")
 def example_streaming_chat():
    """Example: Streaming chat completion."""
    print("\n=== Streaming Chat Example ===")
    stream = client.chat.completions.create(
        model="ibm/granite-3-8b-instruct",
        messages=[
            {"role": "user", "content": "Tell me a short story about a robot."},
        ],
        stream=True,
        max_tokens=200,
    )
    print("Response: ", end="", flush=True)
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()
 def example_with_model_mapping():
    """Example: Using mapped model names."""
    print("\n=== Model Mapping Example ===")
    # If you configured MODEL_MAP_GPT4=ibm/granite-4-h-small in .env
    # you can use "gpt-4" and it will be mapped automatically
    response = client.chat.completions.create(
        model="gpt-4",  # This gets mapped to ibm/granite-4-h-small
        messages=[
            {"role": "user", "content": "Explain quantum computing in one sentence."},
        ],
        max_tokens=50,
    )
    print(f"Response: {response.choices[0].message.content}")
 def example_embeddings():
    """Example: Generate embeddings."""
    print("\n=== Embeddings Example ===")
    response = client.embeddings.create(
        model="ibm/slate-125m-english-rtrvr",
        input=[
            "The quick brown fox jumps over the lazy dog.",
            "Machine learning is a subset of artificial intelligence.",
        ],
    )
    print(f"Generated {len(response.data)} embeddings")
    print(f"Embedding dimension: {len(response.data[0].embedding)}")
    print(f"First embedding (first 5 values): {response.data[0].embedding[:5]}")
 def example_list_models():
    """Example: List available models."""
    print("\n=== List Models Example ===")
    models = client.models.list()
    print(f"Available models: {len(models.data)}")
    print("\nFirst 5 models:")
    for model in models.data[:5]:
        print(f"  - {model.id}")
 def example_completion_legacy():
    """Example: Legacy completion endpoint."""
    print("\n=== Legacy Completion Example ===")
    response = client.completions.create(
        model="ibm/granite-3-8b-instruct",
        prompt="Once upon a time, in a land far away,",
        max_tokens=50,
        temperature=0.8,
    )
    print(f"Completion: {response.choices[0].text}")
 def example_with_functions():
    """Example: Function calling (if supported by model)."""
    print("\n=== Function Calling Example ===")
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather in a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                        },
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    try:
        response = client.chat.completions.create(
            model="ibm/granite-3-8b-instruct",
            messages=[
                {"role": "user", "content": "What's the weather like in Boston?"},
            ],
            tools=tools,
            tool_choice="auto",
        )
        message = response.choices[0].message
        if message.tool_calls:
            print(f"Function called: {message.tool_calls[0].function.name}")
            print(f"Arguments: {message.tool_calls[0].function.arguments}")
        else:
            print(f"Response: {message.content}")
    except Exception as e:
        print(f"Function calling may not be supported by this model: {e}")
 if __name__ == "__main__":
    print("watsonx-openai-proxy Usage Examples")
    print("=" * 50)
    try:
        # Run examples
        example_chat_completion()
        example_streaming_chat()
        example_embeddings()
        example_list_models()
        example_completion_legacy()
        # Optional examples (may require specific configuration)
        # example_with_model_mapping()
        # example_with_functions()
        print("\n" + "=" * 50)
        print("All examples completed successfully!")
    except Exception as e:
        print(f"\nError: {e}")
        print("\nMake sure:")
        print("1. The proxy server is running (python -m app.main)")
        print("2. Your .env file is configured correctly")
        print("3. You have the OpenAI Python SDK installed (pip install openai)")
@@ -0,0 +1,7 @@
 fastapi==0.115.0
 uvicorn[standard]==0.32.0
 pydantic==2.9.2
 pydantic-settings==2.6.0
 httpx==0.27.2
 python-dotenv==1.0.1
 python-multipart==0.0.12
@@ -0,0 +1,87 @@
 """Basic tests for watsonx-openai-proxy."""
 import pytest
 from fastapi.testclient import TestClient
 from app.main import app
 client = TestClient(app)
 def test_health_check():
    """Test the health check endpoint."""
    response = client.get("/health")
    assert response.status_code == 200
    data = response.json()
    assert data["status"] == "healthy"
    assert "cluster" in data
 def test_root_endpoint():
    """Test the root endpoint."""
    response = client.get("/")
    assert response.status_code == 200
    data = response.json()
    assert data["service"] == "watsonx-openai-proxy"
    assert "endpoints" in data
 def test_list_models():
    """Test listing available models."""
    response = client.get("/v1/models")
    assert response.status_code == 200
    data = response.json()
    assert data["object"] == "list"
    assert len(data["data"]) > 0
    assert all(model["object"] == "model" for model in data["data"])
 def test_retrieve_model():
    """Test retrieving a specific model."""
    response = client.get("/v1/models/ibm/granite-3-8b-instruct")
    assert response.status_code == 200
    data = response.json()
    assert data["id"] == "ibm/granite-3-8b-instruct"
    assert data["object"] == "model"
 def test_retrieve_nonexistent_model():
    """Test retrieving a model that doesn't exist."""
    response = client.get("/v1/models/nonexistent-model")
    assert response.status_code == 404
 # Note: The following tests require valid IBM Cloud credentials
 # and should be run with pytest markers or in integration tests
@pytest.mark.skip(reason="Requires valid IBM Cloud credentials")
 def test_chat_completion():
    """Test chat completion endpoint."""
    response = client.post(
        "/v1/chat/completions",
        json={
            "model": "ibm/granite-3-8b-instruct",
            "messages": [
                {"role": "user", "content": "Hello!"}
            ],
        },
    )
    assert response.status_code == 200
    data = response.json()
    assert "choices" in data
    assert len(data["choices"]) > 0
@pytest.mark.skip(reason="Requires valid IBM Cloud credentials")
 def test_embeddings():
    """Test embeddings endpoint."""
    response = client.post(
        "/v1/embeddings",
        json={
            "model": "ibm/slate-125m-english-rtrvr",
            "input": "Test text",
        },
    )
    assert response.status_code == 200
    data = response.json()
    assert "data" in data
    assert len(data["data"]) > 0
		`@@ -0,0 +1,3 @@`
							`"""watsonx-openai-proxy application package."""`

							`__version__ = "1.0.0"`