LLM Inference on the Olivia HPC

Olivia is the latest and greatest Sigma2 high-performance computing (HPC) system for Norwegian research. It is excellent at running open-weight large language model (LLM) inference workloads. Think of running AI models similar to ChatGPT, but inside a Norwegian underground mine instead of the American clouds.

It is, however, not a straightforward process, as is tradition in the HPC world. This guide will help you get started with running LLM inference on Olivia.

Highlights

The Olivia HPC is excellent for running LLM inference workloads, but it's not necessarily worth the hassle for everyone.
You need to fill out a form to get access and wait, sometimes for months.
Olivia is an HPC environment using Slurm and Apptainer; you need to be familiar with the UNIX command line, but it's nothing too complex.
This guide explains how to run either Ollama or vLLM LLM servers on Olivia.
Ollama is easy and convenient, vLLM is faster.

Is using Olivia to run LLM inference worth the hassle?

Olivia is fast and great. It features many NVIDIA GH200-powered nodes, basically some of the best AI hardware one can use in 2025. But you may not need Olivia.

Perhaps you can run your LLM inference workloads on your local machine. A recent MacBook Pro is a very capable LLM inference machine, for example. If you made the mistake to go with the standard SINTEF HP laptop, well, sorry. It's probably not going to be very good at running LLMs. By the way, a phone is faster than this HP laptop. Exhibit A and Exhibit B.

Using your own hardware restricts you to the smaller LLMs, but you may not need the biggest models anyway. Is your research really going to benefit from running DeepSeek V3.1 671B or Qwen3 235B instead of a smaller but still good Gemma3 27B or Qwen3 8B? Results may be slightly better, but at what cost?

Also, current small models are better than much larger models from months ago. And while the current larger models are better, they are likely going to be surpassed by smaller models in the near future. Perhaps your research is more about the methodology than the absolute best results you achieved on a specific date.

Perhaps you don't have data privacy requirements and can use a cloud provider instead, such as Azure, GCP, or AWS. Perhaps OpenRouter.ai or HuggingFace's Inference API can serve your needs. They need your credit card, but dealing with credit card expenses is a better experience than Sigma2's forms, in my humble opinion.

Finally, we have a "chatgpt-at-home" server running LLM inference on-premises at SINTEF. It's definitely not as fast as Olivia, but it's easy to use and available within seconds, and it can run pretty large models.

But if you need the best performance and want to run the largest open-weight LLMs yourself on someone else's computer, Olivia is the way to go.

The Art of Getting Access to Olivia

The first step is to apply for a project. Everything needs a project number at Sigma2 too. The best is to follow the Sigma2 documentation, as the process is long and cumbersome: Sigma2 - How to apply for resources.

Please know that you need the following to apply:

Be a permanent SINTEF employee.
Have an active research project funded by the Norwegian Research Council.
Be good at planning ahead, as the approval process is done twice a year only.
Have somewhat of a plan about what you are going to do and how many resources you will need, for years.

If you are like me, you will likely find it challenging to estimate how many CPU and GPU hours you will use over the next two years. I personally don't predict the future very well, so I have no idea about the future resource needs of LLM inference workloads a few months from now.

My suggestion is to humidify your favourite finger and try to feel the wind with it. Then, try to focus your eyesight on the tip of your nose. Finally, fill the resource usage estimates table with numbers that are not very high and feel reasonable. You are likely not going to need a lot of hours from Sigma2's point of view. I'm guessing that this section of the form is more about identifying projects that need to run very large and long computational tasks. Your LLM inference workloads using a few hours are not going to be a problem. And you can always apply for more resources later, if needed. You don't have to wait months to request more resources.

Once your project is approved and started, you need to add users to the project. This involves another Sigma2 application form and some delays, but this is usually much faster. These users should be approved by the manager of the project.

When logging in to the "Metacenter Administration System" (MAS) as an administrator, you should select the "Feide - Norwegian educational institutions" option, then search and select "SINTEF" as the organisation, and finally click on the "Bruk arbeids - eller skolekonto" button with the Microsoft Windows logo, as SINTEF loves Microsoft. When logging in as a user, it's a more classic username and password form, with TOTP (the annoying 6-digit codes that change all the time).

If everything goes well, you will eventually manage to successfully SSH into Olivia: ssh your_username@olivia.sigma2.no. Congratulations!

Please note that SSH public key authentication is not enabled; it's TOTP and password only. The TOTP is likely named "NRIS: your_username" in your authenticator app.

If you have trouble getting access, don't hesitate to contact the Sigma2 support. They are very helpful.

Olivia uses Slurm, a Job Scheduler

At this stage, I assume that you know what SSH is, you didn't panic when you didn't get visual feedback when typing your password, and you have successfully logged in to Olivia. By the way, Microsoft Windows comes with a built-in SSH client nowadays, so you don't need to install anything weird like PuTTY.

I recommend digging into the extensive Sigma2 documentation, but here is some information to get you started.

First of all, Olivia, like many research HPCs, uses Slurm. Slurm is not very complex, but you may want to read about it a bit. In short, it's a job scheduler. You submit jobs and they run when resources are available. It may take a few seconds to a few days for a job to start. A job is a script and some resource requests. For example, you can request 200 CPUs and 4 GPUs for 2 hours, with 1024GB of memory, on 2 nodes. You can run many jobs in parallel.

Unlike commercial cloud providers, there isn't over-provisioning on Olivia: if you request 100 CPUs, you will get the actual 100 physical CPUs. If you only need 8 of them, the 92 others will idle and no one else can use them. There is an art to requesting the right amount of resources. You shouldn't ask too much because you will be billed for it, and it's wasteful, but you shouldn't ask too little because you may have poor performance.

Basic Slurm commands

sinfo - Show the current status of the cluster. You are probably interested in the accel partition, and how many gpu-1 nodes are currently available (idle state).
squeue - Show the global job queue. It can be pretty long, but it may give you an idea of how busy the cluster is.
squeue -u $USER - Show the queue, only for your user.
sbatch your_job.batch - Submit a job script to the scheduler. We will come to the job scripts later.
scancel job_id - Cancel a job with the given id.
srun -p accel -N 1 --gres=gpu:2 --cpus-per-task=32 --mem=128G -t 01:00:00 --account=your_project_id --job-name=$USER_interactive --pty bash - Start an interactive job session with 2 GPUs, 32 CPUs, 128GB of memory, for 1 hour. You may have to wait in the queue before the session starts. This can be useful for testing and debugging, but interactive sessions are not recommended in general. The job will stop once you exit the bash shell, if you cancel it, or if the time limit is reached. The project ID likely follows the "nn[0-9]+k" pattern.
scontrol update job=<job_id> TimeLimit=HH:MM:SS - Update the time limit of a running job.

Estimating the Resource Usage for LLM Inference

When running LLMs, we can roughly estimate the resource usage based on the model size and the context size. You may want to adjust the resource requests based on the actual usage you observe, but you need to start somewhere. Otherwise, you can also request a full node; it's a bit wasteful, but it's simpler.

But let's assume you want to be more precise in your Slurm job specifications. If your model weights are 40GB on disk, it means that you will need at least about 40GB of memory to fit the model. You then need to take into account the context size. The context size is how many tokens the model can see at once. The bigger the context size, the more memory you need.

Thanks to the flash attention algorithm, the context memory usage is no longer quadratic; it's now linear in the context size, though with a big constant factor. The actual size requirements depend on the model architecture, and you will need to read the model documentation. It's usually on the order of 2GB to 16GB, but it can be more. You should also plan for an extra few GB to be safe.

For bigger models, the weights may not fit entirely in one GPU's memory. Then, they can be split across multiple GPUs on the same node or multiple GPUs on different nodes. You could also split and use CPU and system RAM, but this is much slower and I'm not convinced you should attempt that on Olivia. Using the CPU and the slow RAM as spillover is something hobbyists with more time than resources. In terms of power consumption and performance, it's not great.

In this guide, we will focus on single-node multi-GPU inference, as it's the easiest to set up and faster. Multi-node inference is very much possible in Olivia, but it's a lot more complex to set up, and I'm not sure the need is there for open-weight LLM inference at the moment. It's more interesting to stick to smaller LLMs that fit in one node, in my opinion.

By the way, Olivia has a similar external network bandwidth than the SINTEF standard HP laptop has between its CPU and RAM. I find that mesmerizing. Multi-nodes inference is likely going to perform pretty well if you decide to go for it, but this is beyond the scope of this guide.

Also, the Olivia GH200 nodes are marketed as having 120GB of memory per GPU. In practice, I noticed that the usable memory is about 97GB (97871MiB). To simplify, I would say that you can fit models that weigh up to 360GB per node. I went with 360 because it's a cool number, but the exact figure is in this ballpark. This is more than good enough for most open-weight LLMs.

When it comes to system RAM rather than GPU memory, each GPU node has about 858GB of RAM available, but you can request less. I usually go with the requirements of GPU memory plus a small buffer of a few GB. For CPUs, each node has 288 cores, but 16 or 32 should be more than enough for GPU-powered LLM inference. Use more if you notice that you are CPU-bound.

Which LLM server? Ollama, vLLM, or others?

There are quite a few LLM servers out there. But we can focus on the two interesting ones: Ollama and vLLM.

Ollama is more popular, is simpler to install, has more features, and has a more polished user experience. vLLM runs inference much faster on Olivia. They both provide an OpenAI-compatible HTTP REST API.

I would recommend going with Ollama if you want a simple set-up that is about the same whether it runs on your laptop or on Olivia. As Ollama reuses llama.cpp under the hood, you could use llama.cpp instead, but Ollama adds nice features such as model downloading from a registry, and hot and fast model swapping.

vLLM has been more challenging to set up on Olivia, as it involved Sigma2's support help, but now it should work smoothly for everyone. vLLM can only run one model that you pre-downloaded at a time. It can also take a little while to start. This is clearly less flexible, but you should use vLLM if you want the best performance today.

We will cover both Ollama and vLLM in this guide.

Installing Ollama on Olivia

Olivia is a bit sneaky as its login nodes, the computers you SSH into, use an AMD64 architecture, while the GPU compute nodes use an ARM64 architecture. I think this is a design mistake by Sigma2, the login nodes for the GPU nodes should be ARM64, but that's how it is.

A lot of installation scripts and software detect the architecture automatically, but cannot guess that you want to install ARM64 software while being on an AMD64 machine. Do not use the automatic installation script from Ollama on the login nodes; use the following instead:

# Download Ollama's latest version for ARM64
wget "https://ollama.com/download/ollama-linux-arm64.tgz"

# Extract the archive in your home directory
mkdir -p ~/ollama
tar -xvzf ollama-linux-arm64.tgz -C ~/ollama

# Test the installation
export ACCOUNT_ID="nn12068k"  # Replace with your project ID
srun -p accel -N 1 --gres=gpu:0 -t 00:01:00 --job-name=test-ollama-version --account=$ACCOUNT_ID --mem=1G --cpus-per-task=1 ~/ollama/bin/ollama --version

As you can see, Ollama is simple to install as it packages everything it needs.

Run Ollama on Olivia

Replace <your_project_id> with your actual project ID in the script below.
Adjust the number of GPUs as needed in the gres=gpu:X option.
Also adjust the time limit --time=HH:MM:SS as needed, depending on how long you want Ollama to keep running when you forget to stop it. You can always update the time limit of a running job later with scontrol update job=<job_id> TimeLimit=HH:MM:SS.

LLMs can use a lot of storage and you may want to place them in your project folder instead of your home folder. This is done using the OLLAMA_MODELS environment variable.

ollama-server.sbatch

#!/bin/bash
#SBATCH --job-name=ollama-server
#SBATCH --partition=accel
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=01:00:00
#SBATCH --output=ollama_server_%j.log
#SBATCH --error=ollama_server_%j.err
#SBATCH --account=<your_project_id>

set -euo pipefail

# Configure Ollama environment
export OLLAMA_MODELS="/cluster/work/projects/<your_project_id>/ollama_models"
mkdir -p $OLLAMA_MODELS

# Increase the context length, which is ridiculously small by default. Adjust as needed.
export OLLAMA_CONTEXT_LENGTH=16384

# Most models have flash attention enabled by default nowadays, but it's good to have it explicit
export OLLAMA_FLASH_ATTENTION=1

# to use less memory for context with flash attention. Probably worth it with minimal quality losses. Comment it out if in doubt.
export OLLAMA_KV_CACHE_TYPE=q8_0

# We are running on beefy hardware, so we can afford those values. The number of parallel requests where it stops being faster is model-dependent.
export OLLAMA_MAX_LOADED_MODELS=8
export OLLAMA_NUM_PARALLEL=8

# Expose the server to everyone without authentication, MongoDB style.
export OLLAMA_HOST=0.0.0.0:11434

ml load NRIS/GPU

# Start the Ollama server
srun ~/ollama/bin/ollama serve

To submit the job script

sbatch ollama-server.sbatch

To stop the server, cancel the job:

scancel $JOB_ID

To extend the time limit of the running job

scontrol update job=$JOB_ID TimeLimit=HH:MM:SS

To see the logs

tail -f ollama_server_$JOB_ID.log
# and
tail -f ollama_server_$JOB_ID.err

Retrieve the IPv4 address of the Ollama server

squeue -u $USER # to find the node name
nslookup <node_name>  # e.g. nslookup gpu-1-33

Do not assume that the node number in its name is the same as the last octet of its IP address. It is not always the case on Olivia. For example, while gpu-1-57 has 10.168.0.57, gpu-1-33 has 10.168.0.23.

Check Connectivity to the Ollama server

curl http://<node_ip>:11434/ # e.g. http://10.168.0.23:11434/
# should print "Ollama is running"


curl http://<node_name>:11434/ # works on login nodes, but fails on the GPU nodes due to
# DNS issues within Olivia's HTTP proxy. It's always DNS, so use the IPv4 address instead.

Run Ollama CLI

# replace <node_ip> and <your_project_id> accordingly
srun -p accel -N 1 --gres=gpu:0 -t 00:15:00 --job-name=test-ollama-version --account=<your_project_id> --mem=1G --cpus-per-task=1 --pty --export=ALL,OLLAMA_HOST=http://<node_ip>:11434 ~/ollama/bin/ollama run gemma3

Ollama Usage Examples

from ollama import Client

client = Client(host="http://<node_ip>:11434")
response = client.chat(
    model="gpt-oss:120b",
    messages=[{"role":"user","content":"What is the capital of Norway?"}],
)
print(response.message.content)

or using the OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(base_url="http://<node_ip>:11434/v1", api_key="")
response = client.chat.completions.create(
    model="gpt-oss:120b",
    messages=[{"role":"user","content":"What is the capital of Norway?"}],
)
print(response.choices[0].message.content)

Please note that you can also use the Ollama HTTP API directly using your favourite HTTP client.

Installing vLLM on Olivia

vLLM is already installed on Olivia, thanks to Sigma2's support team.

It is an Apptainer container located at /cluster/apptainer/vllm/vllm.sif. Apptainer containers are similar to the common OCI/Docker containers, but they are not compatible with them; they are less isolated and designed for HPCs. You don't have to know much more about them. You can see Apptainer containers as glorified .tar.gz archives containing software.

Some models will work out-of-the-box, while others may need a bit more configuration. Notably, if you are using a recent OpenAI model such as gpt-oss-20b or gpt-oss-120b, you need to download some tiktoken files as they fail to be fetched automatically due to Olivia's network restrictions.

mkdir /cluster/work/projects/<your_project_id>/vllm_harmony_workaround
cd /cluster/work/projects/<your_project_id>/vllm_harmony_workaround
wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
cd -

Running vLLM on Olivia

Replace <your_project_id> with your actual project ID in the script below.
Replace <model_name> with the actual model name you want to run, e.g. openai/gpt-oss-120b.
Adjust the number of GPUs as needed in the gres=gpu:X option, and the --tensor_parallel_size=X option.
Also adjust the time limit --time=HH:MM:SS as needed, depending on how long you want vLLM to keep running when you forget to stop it. You can always update the time limit of a running job later with scontrol update job=<job_id> TimeLimit=HH:MM:SS.
Also consider adjusting the other vLLM options as needed.

vLLM has a cache folder that may use a lot of storage, so you may want to configure it using the VLLM_CACHE_ROOT environment variable. It's in ~/.cache/vllm by default. Models are downloaded from HuggingFace automatically and stored by default in ~/.cache/huggingface/hub. Set HF_HOME to change this location.

vllm-server.sbatch

#!/bin/bash
#SBATCH --job-name=vllm-server
#SBATCH --partition=accel
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=01:00:00
#SBATCH --output=vllm_server_%j.log
#SBATCH --error=vllm_server_%j.err
#SBATCH --account=<your_project_id>

set -euo pipefail
export PROJECT_PATH=/cluster/work/projects/<your_project_id>

export VLLM_CACHE_ROOT=$PROJECT_PATH/vllm_cache
mkdir -p $VLLM_CACHE_ROOT
export HF_HOME=$PROJECT_PATH/huggingface_home
mkdir -p $HF_HOME

# Disable vLLM telemetry
export VLLM_DO_NOT_TRACK=1

export MODEL_NAME="<model_name>"
# export MODEL_NAME="openai/gpt-oss-120b"

# Harmony workaround if you use recent OpenAI models
export TIKTOKEN_ENCODINGS_BASE=$PROJECT_PATH/vllm_harmony_workaround

ml load NRIS/GPU
ml load CUDA

srun apptainer exec \
  --nv --bind $CUDA_HOME:/usr/local/cuda \
  --bind /cluster \
  /cluster/apptainer/vllm/vllm.sif \
  vllm serve $MODEL_NAME \
  --tool-call-parser hermes \
  --tensor_parallel_size=4 \
  --enable-expert-parallel \
  --swap-space 16 \
  --max-num-seqs 1024 \
  --max-model-len 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 --port 8000

Please note that starting vLLM will take a little while. It should eventually display "Application startup complete." in the logs when it's ready to accept requests.

vLLM Job Usage

The vLLM job script can be used similarly to the Ollama one:

sbatch vllm-server.sbatch
scancel $JOB_ID
scontrol update job=$JOB_ID TimeLimit=HH:MM:SS
tail -f vllm_server_$JOB_ID.log
tail -f vllm_server_$JOB_ID.err

squeue -u $USER
nslookup <node_name>
curl http://<node_ip>:8000/

vLLM API

vLLM features an OpenAI-compatible HTTP REST API. See the vLLM documentation.

Testing the API with curl:

curl http://<node_ip>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "what is the capital of norway?"}
    ],
    "max_tokens": 1024
  }'

Or with the OpenAI Python client:

from openai import OpenAI

client = OpenAI(base_url="http://<node_ip>:8000/v1", api_key="")
response = client.chat.completions.create(
    model="gpt-oss:120b",
    messages=[{"role":"user","content":"What is the capital of Norway?"}],
)
print(response.choices[0].message.content)

Authentication and Access Control

As you may have noticed, both Ollama and vLLM servers are started without any authentication or access control. This means that anyone with access to Olivia can connect to your LLM servers over the internal network.

This may be acceptable for some use cases. The threats are limited:

One Olivia user could run inference on your Olivia LLM jobs while they run, instead of running LLM servers on their own project. Unlikely and not very harmful.
They could manage Ollama to download more models or delete models you downloaded. It's annoying but not a huge security issue. It may fail your jobs or you may run out of storage.
vLLM trusts remote code in its configuration, but you are the one specifying the model path on HuggingFace.

Starting vLLM or Ollama servers on localhost only instead of 0.0.0.0, with an authentication HTTP proxy in front, would be more secure. You would need to reserve the whole node for yourself, and it adds complexity. I think it's not worth it for most use cases.