Dockerizing Local LLMs (Ollama) for Cloud Deployments
Introduction
Running Large Language Models locally during development is easy thanks to tools like Ollama. But when it's time to deploy your AI application to a production server (like AWS EC2, GCP, or RunPod), relying on local binary installations is a recipe for disaster.
By containerizing Ollama with Docker, you ensure that your deployment is immutable, reproducible, and scalable. However, there are two major hurdles when running LLMs in Docker: passing the GPU through to the container and persisting massive model weights.
Step 1: The NVIDIA Container Toolkit
Before touching Docker, your Linux host must have the NVIDIA Container Toolkit installed. Docker containers are isolated by default; they cannot see your physical GPU. The toolkit bridges this gap.
# Example installation for Ubuntu
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 2: The Docker Compose File
We use docker-compose.yml to define our infrastructure as code. Notice the deploy block—this is what tells Docker to use the NVIDIA drivers. Without it, Ollama will silently fall back to CPU execution, making it 100x slower.
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-gpu
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
ollama_data:
Why Volumes Matter
The ollama_data:/root/.ollama volume is critical. Models like Llama 3 can be 4GB to 40GB in size. If you don't mount a persistent volume, the container will lose the downloaded weights every time it restarts, forcing a massive re-download.
Conclusion & Deployment
With this configuration, you can run docker compose up -d on any cloud instance equipped with a GPU and immediately start serving high-performance inference via port 11434.
Check out the full code scaffold in our repository for advanced tricks, like pre-baking models directly into the Dockerfile!