Scale Your Ollama Infrastructure

Intelligent load balancing and model orchestration that transforms multiple Ollama instances into a unified, high-availability AI inference cluster

MIT Licensed Free & Open Source .NET 8.0

Why OllamaFlow?

🎯

Multiple Virtual Endpoints

Create multiple virtual Ollama endpoints, each mapping to a series of backend Ollama instances

⚖️

Smart Load Balancing

Distribute requests intelligently across healthy backends using round-robin or random strategies

🔄

Automatic Model Sync

Ensure all backends have required models with automatic discovery and parallel downloading

❤️

Health Monitoring

Real-time health checks with configurable thresholds ensure requests only go to healthy backends

📊

Reduce Downtime

Seamlessly handle backend failures through healthchecks and request proxying to healthy endpoints automatically

🛠️

RESTful Admin API

Full control through a comprehensive management API with bearer token authentication

Compatibility with Ollama API

OllamaFlow transparently proxies all Ollama API endpoints while adding intelligent routing and management capabilities

/api/generate Text generation
/api/chat Chat completions
/api/pull Model pulling
/api/push Model pushing
/api/show Model information
/api/tags List models
/api/ps Running models
/api/embed Generate embeddings
/api/delete Delete models

Key Features

Load Balancing

  • Round-robin and random distribution
  • Request routing based on health
  • Automatic failover
  • Configurable rate limiting

Model Management

  • Automatic model discovery
  • Intelligent synchronization
  • Dynamic model requirements
  • Parallel downloads

Enterprise Ready

  • Bearer token authentication
  • Comprehensive logging
  • Docker & Compose ready
  • SQLite persistence

Use Cases

GPU Cluster Management

Distribute AI workloads across multiple GPU servers for maximum performance and utilization

Scalable CPU Inference

Perfect for dense CPU systems like Ampere processors, enabling cost-effective AI inference

High Availability

Ensure your AI services stay online 24/7 with automatic failover and health monitoring

Development & Testing

Easily switch between different model configurations and test various deployment scenarios

Cost Optimization

Maximize hardware utilization across your infrastructure by intelligently routing requests

Multi-Tenant Scenarios

Isolate workloads while sharing infrastructure through multiple frontend configurations

Get Started in Minutes

Using Docker

# Pull the image
docker pull ollamaflow/ollamaflow

# Run with configuration
docker run -d -p 43411:43411 \
  -v $(pwd)/ollamaflow.json:/app/ollamaflow.json \
  ollamaflow/ollamaflow

Using .NET

# Clone and build
git clone https://github.com/ollamaflow/ollamaflow.git
cd ollamaflow && dotnet build

# Run
cd OllamaFlow.Server/bin/Debug/net8.0
dotnet OllamaFlow.Server.dll

Complete Postman collection included for easy API testing!