Intelligent load balancing and model orchestration that transforms multiple Ollama instances into a unified, high-availability AI inference cluster
Create multiple virtual Ollama endpoints, each mapping to a series of backend Ollama instances
Distribute requests intelligently across healthy backends using round-robin or random strategies
Ensure all backends have required models with automatic discovery and parallel downloading
Real-time health checks with configurable thresholds ensure requests only go to healthy backends
Seamlessly handle backend failures through healthchecks and request proxying to healthy endpoints automatically
Full control through a comprehensive management API with bearer token authentication
OllamaFlow transparently proxies all Ollama API endpoints while adding intelligent routing and management capabilities
/api/generate
Text generation
/api/chat
Chat completions
/api/pull
Model pulling
/api/push
Model pushing
/api/show
Model information
/api/tags
List models
/api/ps
Running models
/api/embed
Generate embeddings
/api/delete
Delete models
Distribute AI workloads across multiple GPU servers for maximum performance and utilization
Perfect for dense CPU systems like Ampere processors, enabling cost-effective AI inference
Ensure your AI services stay online 24/7 with automatic failover and health monitoring
Easily switch between different model configurations and test various deployment scenarios
Maximize hardware utilization across your infrastructure by intelligently routing requests
Isolate workloads while sharing infrastructure through multiple frontend configurations
Complete Postman collection included for easy API testing!