Scale Your Ollama Infrastructure

Intelligent load balancing and model orchestration that transforms multiple Ollama instances into a unified, high-availability AI inference cluster

MIT Licensed Free & Open Source .NET 8.0

Why OllamaFlow?

🎯

Multiple Virtual Endpoints

Create multiple virtual Ollama endpoints, each mapping to a series of backend Ollama instances

⚖️

Smart Load Balancing

Distribute requests intelligently across healthy backends using round-robin or random strategies

🔄

Automatic Model Sync

Ensure all backends have required models with automatic discovery and parallel downloading

❤️

Health Monitoring

Real-time health checks with configurable thresholds ensure requests only go to healthy backends

📊

Reduce Downtime

Seamlessly handle backend failures through healthchecks and request proxying to healthy endpoints automatically

🛠️

RESTful Admin API

Full control through a comprehensive management API with bearer token authentication

Compatibility with Ollama API

OllamaFlow transparently proxies all Ollama API endpoints while adding intelligent routing and management capabilities

/api/generate Text generation
/api/chat Chat completions
/api/pull Model pulling
/api/push Model pushing
/api/show Model information
/api/tags List models
/api/ps Running models
/api/embed Generate embeddings
/api/delete Delete models

Key Features

Load Balancing

  • Round-robin and random distribution
  • Request routing based on health
  • Automatic failover
  • Configurable rate limiting

Model Management

  • Automatic model discovery
  • Intelligent synchronization
  • Dynamic model requirements
  • Parallel downloads

Enterprise Ready

  • Bearer token authentication
  • Comprehensive logging
  • Docker & Compose ready
  • SQLite persistence

Use Cases

Scalable CPU Inference

Perfect for dense CPU systems like Ampere processors, enabling cost-effective AI inference

GPU Cluster Management

Distribute AI workloads across multiple GPU servers for maximum performance and utilization

High Availability

Ensure your AI services stay online 24/7 with automatic failover and health monitoring

Development & Testing

Easily switch between different model configurations and test various deployment scenarios

Cost Optimization

Maximize hardware utilization across your infrastructure by intelligently routing requests

Multi-Tenant Scenarios

Isolate workloads while sharing infrastructure through multiple frontend configurations

Get Started in Minutes

Using Docker

# Pull the image
docker pull ollamaflow/ollamaflow

# Run with configuration
docker run -d -p 43411:43411 \
  -v $(pwd)/ollamaflow.json:/app/ollamaflow.json \
  ollamaflow/ollamaflow

Using .NET

# Clone and build
git clone https://github.com/ollamaflow/ollamaflow.git
cd ollamaflow/src && dotnet build

# Run
cd OllamaFlow.Server/bin/Debug/net8.0
dotnet OllamaFlow.Server.dll

Complete Postman collection included for easy API testing!