Intelligent load-balancing, model orchestration, API control, and API request conformance to transform multiple AI inferencing and embeddings instances info a unified, high-availability fabric
Create multiple virtual Ollama or OpenAI-compatible endpoints, each mapping to a series of backend instances of Ollama, vLLM, SharpAI, and more!
Distribute requests intelligently across healthy backends using round-robin or random strategies
Ensure all Ollama backends have required models with automatic discovery and parallel downloads
Control how your hardware resources are used and ensure API request conformance to confidently scale
Real-time health checks with configurable thresholds ensure requests only go to healthy backends
Seamlessly handle backend failures through healthchecks and request proxying to healthy endpoints automatically
OllamaFlow translates Ollama and OpenAI API requests based on what your backend supports, while adding intelligent routing, high availability, and management
/api/generate
Text generation
/api/chat
Chat completions
/api/pull
Model pulling
/api/push
Model pushing
/api/show
Model information
/api/tags
List models
/api/ps
Running models
/api/embed
Generate embeddings
/api/delete
Delete models
/v1/completions
Completions
/v1/chat/completions
Chat completions
/v1/embeddings
Generate embeddings
Fine-grained security controls to ensure AI endpoints are used in accordance with your objectives, allowing you to confidently scale
Enable or disable embeddings and completions APIs, and enforce request parameter compliance - models, temperature, context size, and others
Force requests to use specific backends by incoming labels to ensure compliance with regulation or to leverage systems with specific attributes
Configure multiple frontends with different security policies to safely serve different teams or customers
The OllamaFlow API Explorer is a web-based tool designed for testing, debugging, and evaluating AI inference APIs in a browser-based experience.
Perfect for GPU systems and dense CPU systems like those powered by Ampere processors
Distribute AI workloads across multiple GPU servers for maximum performance and utilization
Ensure your AI services stay online 24/7 with automatic failover and health monitoring
Easily switch between different model configurations and test various deployment scenarios
Maximize hardware utilization across your infrastructure by intelligently routing requests
Isolate workloads while sharing infrastructure through multiple frontend configurations
Complete Postman collection included for easy API testing!