PRODUCTION INFRASTRUCTURE

Dockerizing Local LLMs (Ollama) for Cloud Deployments

By FS AI Hub

2026-06-27

2 min read

Introduction

Running Large Language Models locally during development is easy thanks to tools like Ollama. But when it's time to deploy your AI application to a production server (like AWS EC2, GCP, or RunPod), relying on local binary installations is a recipe for disaster.

By containerizing Ollama with Docker, you ensure that your deployment is immutable, reproducible, and scalable. However, there are two major hurdles when running LLMs in Docker: passing the GPU through to the container and persisting massive model weights.

Step 1: The NVIDIA Container Toolkit

Before touching Docker, your Linux host must have the NVIDIA Container Toolkit installed. Docker containers are isolated by default; they cannot see your physical GPU. The toolkit bridges this gap.

# Example installation for Ubuntu
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 2: The Docker Compose File

We use docker-compose.yml to define our infrastructure as code. Notice the deploy block—this is what tells Docker to use the NVIDIA drivers. Without it, Ollama will silently fall back to CPU execution, making it 100x slower.

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-gpu
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ollama_data:

Why Volumes Matter

The ollama_data:/root/.ollama volume is critical. Models like Llama 3 can be 4GB to 40GB in size. If you don't mount a persistent volume, the container will lose the downloaded weights every time it restarts, forcing a massive re-download.

Conclusion & Deployment

With this configuration, you can run docker compose up -d on any cloud instance equipped with a GPU and immediately start serving high-performance inference via port 11434.

Check out the full code scaffold in our repository for advanced tricks, like pre-baking models directly into the Dockerfile!