OllamaFlow - Scale Your Ollama Infrastructure

Why OllamaFlow?

🎯

Multiple Virtual Endpoints

Create multiple virtual Ollama or OpenAI-compatible endpoints, each mapping to a series of backend instances of Ollama, vLLM, SharpAI, and more!

⚖️

Smart Load Balancing

Distribute requests intelligently across healthy backends using round-robin or random strategies

🔄

Automatic Model Sync

Ensure all Ollama backends have required models with automatic discovery and parallel downloads

🔒

Security and Compliance

Control how your hardware resources are used and ensure API request conformance to confidently scale

❤️

Health Monitoring

Real-time health checks with configurable thresholds ensure requests only go to healthy backends

📊

Reduce Downtime

Seamlessly handle backend failures through healthchecks and request proxying to healthy endpoints automatically

Compatibility with Ollama and OpenAI APIs

OllamaFlow translates Ollama and OpenAI API requests based on what your backend supports, while adding intelligent routing, high availability, and management

Ollama APIs

/api/generate Text generation

/api/chat Chat completions

/api/pull Model pulling

/api/push Model pushing

/api/show Model information

/api/tags List models

/api/ps Running models

/api/embed Generate embeddings

/api/delete Delete models

OpenAI APIs

/v1/completions Completions

/v1/chat/completions Chat completions

/v1/embeddings Generate embeddings

Supported Backends

OllamaFlow works seamlessly with multiple AI inference platforms:

Ollama Local LLM inference

vLLM High-performance inference

SharpAI .NET local inference

OpenAI Cloud AI services

Any OpenAI-Compatible API Flexible integration

Security & Control

Fine-grained security controls to ensure AI endpoints are used in accordance with your objectives, allowing you to confidently scale

🔒

API Restrictions

Enable or disable embeddings and completions APIs, and enforce request parameter compliance - models, temperature, context size, and others

🎛️

Label-Based Control

Force requests to use specific backends by incoming labels to ensure compliance with regulation or to leverage systems with specific attributes

🛡️

Multi-Tenant Isolation

Configure multiple frontends with different security policies to safely serve different teams or customers

Use Cases

Scalable Inference

Perfect for GPU systems and dense CPU systems like those powered by Ampere processors

GPU Cluster Management

Distribute AI workloads across multiple GPU servers for maximum performance and utilization

High Availability

Ensure your AI services stay online 24/7 with automatic failover and health monitoring

Development & Testing

Easily switch between different model configurations and test various deployment scenarios

Cost Optimization

Maximize hardware utilization across your infrastructure by intelligently routing requests

Multi-Tenant Scenarios

Isolate workloads while sharing infrastructure through multiple frontend configurations

Get Started in Minutes

Using Docker

# Pull the image
docker pull jchristn/ollamaflow:v1.1.0

# Run with configuration
docker run -d -p 43411:43411 \
  -v $(pwd)/ollamaflow.json:/app/ollamaflow.json \
  -v $(pwd)/ollamaflow.db:/app/ollamaflow.db \
  jchristn/ollamaflow:v1.1.0

Using .NET

# Clone and build
git clone https://github.com/ollamaflow/ollamaflow.git
cd ollamaflow/src && dotnet build

# Run
cd OllamaFlow.Server/bin/Debug/net8.0
dotnet OllamaFlow.Server.dll

Complete Postman collection included for easy API testing!

View on GitHub Download Postman Collection

Scale Your AI Infrastructure