wir bloggen über software_

Setting Up An ML HPC Server (Part 2 – Driver Setup and Running Language Models with llama.cpp)

2026-01-27T00:00:00-06:00

Recap

In Part 1, we showed you how we set up the hardware for our new HPC server. In Part 2, we will now continue with the software.

Installing the CUDA Toolkit and NVIDIA Drivers

While NVIDIA drivers are provided in the non-free (restricted) Ubuntu repository, they turn out to be outdated. We therefore take the current drivers from NVIDIA:

NVIDIA provides a package repository for Ubuntu. Installation is thus possible without any special effort.
For CUDA, NVIDIA ensures a high degree of backward compatibility. The CUDA version 12.8, which was the most up-to-date at the time, supports the much older GPU architecture “Pascal” of the Tesla P40 GPUs.
An up-to-date version of CUDA and all required drivers is a prerequisite for working with the latest AI tools.

The NVIDIA Installation Guide for the CUDA toolkit lists various possibilities and paths:

Package Manager vs. Runfile Installation: Installation via the Package Manager is more convenient and has better system integration.
Local Repo Installation vs. Network Repo Installation: As long as the machine has internet access, the network repo option is better. This gives us the latest updates from NVIDIA via apt upgrade.
We choose the proprietary packages and not the open source packages, because in many respects the open source implementation is still lagging behind.

Our first step should be to check whether any packages are already installed which could lead to conflicts:

$ dpkg -l | grep nvidia
$ dpkg -l | grep cuda

This output should be empty. If it is not, any existing packages must be uninstalled.

According to section 3.8.3 from the installation guide, the GPG key and the repository can both be set up by installing a deb package. (For $UBUNTU_VERSION, the respective version is followed by the pattern "ubuntu2404".)

$ wget https://developer.download.nvidia.com/compute/cuda/repos/$UBUNTU_VERSION/x86_64/cuda-keyring_1.1-1_all.deb

# dpkg -i cuda-keyring_1.1-1_all.deb

The repository is then automatically created under /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list. After an apt update, the only requirement to install the CUDA toolkit is first installing a meta-package:

# apt install cuda-toolkit

The second step is to install the kernel modules. NVIDIA provides another guide for this. The repository is already set up, the meta package can be installed directly:

# apt install cuda-drivers

This also installs a whole range of packages that are only needed for desktops with displays. At the moment a compute-only variant is offered for Fedora, Suse and Debian, but not yet for Ubuntu. That’s it. After a reboot, all drivers should be set up.

The CUDA binaries are located in /usr/local/cuda-12.8/bin and should be included in the PATH as described in Section 10.1.1 Environment Setup. An extension of LD_LIBRARY_PATH should not be necessary, as the configuration has already been done by the corresponding Ubuntu package (/etc/ld.so.conf.d/988_cuda-12.conf).

Verification

Below you will find a few useful commands which can be used to verify the installation. To verify that the desired driver version has been loaded:

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  570.133.20  Sun Apr 13 04:50:56 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

To verify that the CUDA-Complier has been installed:

$ nvcc –version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Fri_Feb_21_20:23:50_PST_2025 Cuda compilation tools, release 12.8, V12.8.93 Build cuda_12.8.r12.8/compiler.35583870_0

To access the NVIDIA System Management Interface (SMI):

$ nvidia-smi`
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      Off |   00000000:04:00.0 Off |                  Off |
| N/A   23C    P8              9W /  250W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      Off |   00000000:05:00.0 Off |                  Off |
| N/A   24C    P8              9W /  250W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Getting Started

For a stress test of the graphics cards, we use gpu-burn. As the name implies, this pushes the power consumption of the GPUs almost to the limit of 250 watts per unit. In the iDRAC you can see the impressive thermal effects this has: The temperature in the chassis (which has a volume of approx. 16 liters) rapidly rises to 60 °C. You don’t need maintenance software to notice this: The fans become noticeably louder, and it begins to smell iffy.

Inference with llama.cpp

There is a number of software options which run current open source language models. Examples include KoboldCpp and ollama, both of which have one thing in common: they rely on the library llama.cpp, which does all the “hard work” in the background.

For initial tests, it is a good idea to work directly with llama.cpp – for several reasons:

Maximum control during configuration and optimization
Detailed output of parameters and hardware properties
New models often require up-to-date features – which usually are added to llama.cpp first

It is recommended to compile llama.cpp directly from the sources:

$ cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/opt/llama-cpp -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON
$ cmake --build build --config Release -j 16

Optionally, install the binaries according to /opt/llama-cpp:

$ sudo cmake --install build

Compiling from sources

When compiling from sources, it is not uncommon to be confronted by cryptic errors. However, this is not a reason for concern.

The basic prerequisite is development tools such as cmake and g++, which must be installed on the system. The build-essential meta-package, which bundles the most important tools, is recommended.

More often than not so-called dev packages for required libraries are also missing. Unfortunately, it is often not specified directly which package has to be installed – this information must be > derived from the error message. For example, llama.cpp expects the curl development files that are included in the libcurl4-openssl-dev package.

Download the models

Language models can be found on https://huggingface.co/. To use a model in llama.cpp, it must be in gguf format. You can easily do the conversion yourself. For this purpose, llama.cpp includes the tools convert_hf_to_gguf.py and llama-quantize. For popular models, however, you can often find pre-converted models in gguf format on Huggingface.

Next there is the matter of deciding on quantization. With smaller quantization, the model consumes less vRAM, execution becomes faster, but this comes at a price of reduced performance. You can start with the largest version that fits into the vRAM and then reduce it if the speed is not sufficient. A 4-bit quantization is usually a good compromise: the vRAM is used efficiently while the losses from performance remain low.

We can load finished gguf files with the Hugging Face CLI tool, which can then be installed with the following command:

$ pip install -U "huggingface_hub"

Alternatively, one can use uvx to run the most current version of the tool directly. A corresponding alias might look something like this:

$ alias hf="uvx --from huggingface_hub hf"

The actual download to the current directory would look something like this:

$ hf download bartowski/Qwen_Qwen3-30B-A3B-GGUF --include "Qwen_Qwen3-30B-A3B-Q4_K_M.gguf" --local-dir .

Chat in the command line

llama.cpp provides a number of command line tools. With llama-cli you can directly start a chat:

$ llama-cli -m ~/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -co -cnv -fa -ngl 99

-m: Path to the GGUF model file.
-co: Colored output for better differentiation of input and output.
-cnv: Activates Conversation Mode.
-fa: Turns on Flash Attention (if supported).
-ngl 99: Allows up to 99 model layers to be computed on the GPU.

The -ngl 99 parameter tells llama.cpp to handle up to 99 model layers on the GPU. In practice, this means that all layers are offloaded – as the output also confirms:

offloaded 49/49 layers to GPU If not all layers fit on the GPU, the remaining ones are processed on the CPU – which, as is to be expected, leads to significant performance losses.

Benchmarks

A main goal of this setup is to get a sufficient amount of vRAM at a reasonable price. But at the end of the day, we don’t just want to load huge models, we also want to execute them quickly. The rack should stand up to comparisons with what friends and colleagues put on their desks in terms of consumer hardware. A front-runner is the Apple Silicon M series, with its unified memory architecture. We use a detailed community benchmark of the various Apple products as our reference. Below is a test of a smaller model, Llama 7B v2. It can be run parallel multiple times on our setup, but the maximum speed is still crucial. With the tool llama-bench we determine two speeds:

pp (prompt parsing): Reading the question/prompt
tg (text generation): Generating the response

$ llama-bench -m llama-2-7b. Q8_0.gguf -m llama-2-7b. Q4_0.gguf -p 512 -n 128 -ngl 99 -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1	pp512	1024.05 ± 0.74
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1	tg128	36.18 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1073.58 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	55.80 ± 0.83

build: 578754b3 (5117)

Prompt parsing speed of the P40 compared to Apple Silicon M‑series products for the Llama 7B v2 model in 8‑bit quantization (in tokens per second)

Text generation comparison (Llama 7B v2 model in 8‑bit quantization)

Prompt parsing comparison (Llama 7B v2 model in 4‑bit quantization)

Text generation comparison (Llama 7B v2 model in 4‑bit quantization)

The P40 performs well in the area of prompt parsing. In text production, it is above the Pro series and in the lower range of the Max series.

Result

We have built a powerful HPC server with a total of 96 GB of vRAM, creating a solid basis for demanding AI and data processing projects – all without blowing the bank.

Rental server alternatives

Depending on the scenario, renting AI GPU servers might be a viable option. If you only need selective training phases or want to test prototypes on short notice, you can benefit from hour-based rental offers. However, long-term, continuous use requires dedicated resources with full access, and then the costs can add up quickly. It is not uncommon to see setups costing up to EUR 1,000 per month, which makes setting up your own server more economically attractive in the long run.

Practical framework

However, while operating a high-performance server has its advantages, it also has its infrastructural requirements:

Space & volume: Under full load, a suitable server room with sufficient ventilation is required.
Energy requirements: Continuous occupancy can result in electricity costs in the range of EUR 100 per month or more.
Maintenance & updates: You are responsible for driver and software updates, as well as repair and replacement of defective hardware.
Limited driver/software compatibility of used hardware: In our case, the P40’s “Pascal” GPU architecture was still supported well enough for us to run current models. However, we are already encountering deprecation warnings. Before purchasing, the current support for drivers (CUDA) and the important libraries and frameworks (especially PyTorch) should be researched.

Added value for the team

For our team, the current setup is a real asset:

Free experimentation: No approval processes or time pressure from expensive rental hours.
Data sovereignty: Local LLMs and AI models make it possible to securely process confidential data.
Full control: We determine the hardware, the software and the access rights. Anything goes, no matter how obscure.

Einrichtung eines ML-HPC-Servers (Teil 2 - Treibereinrichtung und Sprachmodelle)

2026-01-27T00:00:00-06:00

Rückblick

In Teil 1 haben wir gezeigt, wie wir die Hardware unseres neuen HPC-Servers eingerichtet haben. Im zweiten Teil geht es jetzt mit der Software weiter.

Installation des CUDA-Toolkits und der NVIDIA Treiber

Ubuntu stellt im non-free (restricted) Repository NVIDIA-Treiber bereit, die sich aber als veraltet herausstellen. Wir nehmen daher die aktuellen Treiber von NVIDIA:

NVIDIA stellt ein Package-Repository für Ubuntu bereit. Die Installation ist damit ohne besonderen Aufwand möglich.
Für CUDA stellt NVIDIA ein hohes Maß an Abwärtskompatibilität sicher. Die zu der Zeit aktuelle CUDA-Version 12.8 unterstützt die deutlich ältere GPU-Architektur „Pascal“ der Tesla P40 GPUs.
Eine aktuelle Version von CUDA und passenden Treibern ist Voraussetzung, um mit aktuellem KI-Tooling arbeiten zu können.

Der NVIDIA-Installationsguide für das CUDA-Toolkit führt diverse Möglichkeiten und Pfade auf:

Package Manager vs. Runfile Installation: Die Installation über den Package Manager ist bequemer und hat bessere Systemintegration.
Local Repo Installation vs. Network Repo Installation: Solange die Maschine Internetzugang hat, ist die Network-Repo-Option besser. Damit erhalten wir per apt upgrade die aktuellen Updates von NVIDIA.
Wir wählen die proprietären Pakete und nicht die Open-Source-Pakete, da die Open-Source-Implementation in vielen Punkten noch deutlich zurücksteht.

Im ersten Schritt sollte geprüft werden, ob bereits Pakete installiert sind, die zu Konflikten führen könnten:

$ dpkg -l | grep nvidia
$ dpkg -l | grep cuda

Diese Ausgabe sollte leer sein. Falls nicht, können bestehende Pakte deinstalliert werden.

Entsprechend dem Abschnitt 3.8.3 aus dem Installationsguide lässt sich der GPG Key und das Repository durch die Installation eines deb-Packages einrichten. (Für $UBUNTU_VERSION wird die jeweilige Version nach dem Muster „ubuntu2404“ eingesetzt.)

$ wget https://developer.download.nvidia.com/compute/cuda/repos/$UBUNTU_VERSION/x86_64/cuda-keyring_1.1-1_all.deb

# dpkg -i cuda-keyring_1.1-1_all.deb

Das Repository wird damit automatisch unter /etc/apt/sources.list.d/cuda-ubuntu2404-x86_64.list angelegt.

Nach einem apt update muss für die Installation des CUDA-Toolkits nur noch ein Meta-Package installiert werden:

# apt install cuda-toolkit

Der zweite Schritt ist die Installation der Kernel-Module. Dafür stellt NVIDIA eine weitere Anleitung bereit. Das Repository ist schon eingerichtet, es kann direkt das Meta-Package installiert werden:

# apt install cuda-drivers

Dies installiert auch eine ganze Reihe an Paketen, die eigentlich nur für Desktops mit Display benötigt werden. Eine compute-only-Variante wird für Fedora, Suse und Debian angeboten, aber zum aktuellen Zeitpunkt nicht für Ubuntu. Das war’s auch schon. Nach einem Reboot sollten alle Treiber eingerichtet sein.

Die CUDA Binaries befinden sich in /usr/local/cuda-12.8/bin und sollten wie im Abschnitt 10.1.1. Environment Setup beschrieben in den PATH aufgenommen werden. Eine Erweiterung von LD_LIBRARY_PATH sollte nicht nötig sein, da die Konfiguration durch das entsprechende Ubuntu-Package schon erfolgt ist (/etc/ld.so.conf.d/988_cuda-12.conf).

Verifikation

Um die Installation zu verifizieren, führen wir hier ein paar nützliche Befehle auf.

Prüfen, ob die gewünschte Treiberversion geladen wurde:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  570.133.20  Sun Apr 13 04:50:56 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

Die Installation des CUDA-Compilers NVCC verifizieren:

$ nvcc –version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Fri_Feb_21_20:23:50_PST_2025 Cuda compilation tools, release 12.8, V12.8.93 Build cuda_12.8.r12.8/compiler.35583870_0

Das NVIDIA System Management Interface (SMI) aufrufen:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      Off |   00000000:04:00.0 Off |                  Off |
| N/A   23C    P8              9W /  250W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      Off |   00000000:05:00.0 Off |                  Off |
| N/A   24C    P8              9W /  250W |       5MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Erste Schritte

Für einen Belastungstest der Grafikkarten nehmen wir gpu-burn. Wie der Name bereits vermuten lässt, treibt dies den Stromverbrauch der GPUs fast ans Limit von 250 Watt pro Einheit. Im iDRAC sieht man sehr eindrücklich, welche thermischen Auswirkungen das hat: Die Temperatur im Chassis (Volumen ca. 16 Liter) steigt in kürzester Zeit auf 60 °C. Um dies zu bemerken, braucht man allerdings keine Wartungssoftware: Die Lüfter werden markant lauter und schriller und es riecht leicht brenzlig.

Inferenz mit llama.cpp

Um aktuelle Open-Source-Sprachmodelle auszuführen, gibt es eine Reihe populärer Software wie KoboldCpp und ollama, die eins gemeinsam haben: Sie setzen auf die Bibliothek llama.cpp, die im Hintergrund die eigentliche „harte Arbeit“ übernimmt. Für erste Tests bietet es sich an, direkt mit llama.cpp zu arbeiten – und das aus mehreren Gründen:

maximale Kontrolle bei der Konfiguration und Optimierung
detaillierte Ausgabe von Parametern und Hardwareeigenschaften
neue Modelle erfordern oft aktuelle Features – und die landen meist zuerst in llama.cpp

Es empfiehlt sich, llama.cpp direkt aus den Quellen zu kompilieren:

$ cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/opt/llama-cpp -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON
$ cmake --build build --config Release -j 16

Optional, Installation der Binaries nach /opt/llama-cpp:

$ sudo cmake --install build

Kompilieren aus Quellen

Beim Kompilieren aus den Quellen wird man nicht selten erstmal mit kryptischen Fehlern konfrontiert. Davon sollte man sich jedoch nicht verunsichern lassen.

Grundvoraussetzung sind Entwicklungswerkzeuge wie cmake und g++, die auf dem System installiert sein müssen. Empfehlenswert ist das Meta-Paket build-essential, das die wichtigsten Tools bündelt.

In der Regel fehlen zusätzlich sogenannte Dev-Pakete für benötigte Bibliotheken. Leider wird dabei oft nicht direkt angegeben, welches Paket konkret installiert werden muss – diese Information muss man aus der Fehlermeldung ableiten. So erwartet beispielsweise llama.cpp die Entwicklungsdateien für curl, die im Paket libcurl4-openssl-dev enthalten sind.

Download der Modelle

Sprachmodelle findet man auf https://huggingface.co/. Um ein Model in llama.cpp zu nutzen, muss es im gguf-Format vorliegen. Die Konvertierung kann man problemlos selbst durchführen. Dazu liefert llama.cpp die Tools convert_hf_to_gguf.py und llama-quantize mit. Für populäre Modelle findet man aber oft schon vorkonvertierte Modelle im gguf-Format auf Huggingface.

Jetzt gilt es noch, sich für eine Quantisierung zu entscheiden. Mit kleinerer Quantisierung verbraucht das Modell weniger vRAM, die Ausführung wird schneller, aber die Leistungsfähigkeit ist vermindert. Starten kann man mit der größten Ausführung, die noch in den vRAM passt und dann verringern, falls die Geschwindigkeit nicht ausreichend ist. Eine 4-Bit-Quantisierung ist in der Regel ein guter Kompromiss: Der vRAM wird effizient genutzt, während die Leistungseinbußen gering bleiben.

Fertige gguf-Files können wir mit dem Hugging Face CLI-Tool laden, das sich mit folgendem Befehl installieren lässt:

$ pip install -U "huggingface_hub"

Alternativ kann die aktuelle Version des Tools auch ohne Installation mit uvx direkt ausgeführt werden. Ein entsprechender Alias könnte so aussehen:

$ alias hf="uvx --from huggingface_hub hf"

Der eigentliche Download in das aktuelle Verzeichnis sieht dann zum Beispiel so aus:

$ hf download bartowski/Qwen_Qwen3-30B-A3B-GGUF --include "Qwen_Qwen3-30B-A3B-Q4_K_M.gguf" --local-dir .

Chat in der Commandozeile

llama.cpp stellt eine Reihe von Kommandozeilen-Tools zur Verfügung. Mit llama-cli lässt sich direkt ein Chat starten:

$ llama-cli -m ~/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -co -cnv -fa -ngl 99

Die Bedeutung der Commandozeilenparameter:

-m: Pfad zur GGUF-Modell-Datei.
-co: Farbige Ausgabe zur besseren Unterscheidung von Eingaben und Antworten.
-cnv: Aktiviert den Gesprächsmodus (Conversation Mode).
-fa: Schaltet Flash Attention ein (wenn vom Modell unterstützt).
-ngl 99: Lässt bis zu 99 Modell-Layer auf der GPU berechnen.

Der Parameter -ngl 99 weist llama.cpp an, bis zu 99 Modell-Layer auf der GPU zu verarbeiten. In der Praxis bedeutet es, dass sämtliche Layer ausgelagert werden – wie auch die Ausgabe bestätigt:

offloaded 49/49 layers to GPU

Falls nicht alle Layer auf die GPU passen, werden die verbleibenden auf der CPU verarbeitet – was erwartungsgemäß zu deutlichen Performanceeinbußen führt.

Benchmarks

Ein Hauptziel dieses Setups ist es, eine ausreichende Menge vRAM zu einem angemessenen Preis zu bekommen. Aber am Ende des Tages möchten wir nicht nur riesige Modelle laden, sondern diese auch zügig ausführen. Das Rack sollte sich an dem messen lassen, was sich Freunde und Kollegen an Consumer-Hardware so auf den Schreibtisch stellen. Ein Spitzenreiter ist dabei die Apple-Silicon-M-Serie, mit ihrer Unified Memory Architecture.

Als Referenz nutzen wir ein ausführliches Community-Benchmark der verschiedenen Apple-Produkte. Hier wird ein kleineres Modell (Llama 7B v2) getestet. Es lässt sich auf unserem Setup vielfach parallel ausführen, aber dennoch ist auch die maximale Geschwindigkeit entscheidend.

Mit dem Tool llama-bench ermitteln wir zwei Geschwindigkeiten

pp (prompt parsing): Lesen der Frage/des Prompts, was ja auch mal länger sein kann
tg (text generation): Erzeugen der Antwort

$ llama-bench -m llama-2-7b.Q8_0.gguf -m llama-2-7b.Q4_0.gguf -p 512 -n 128 -ngl 99 -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1	pp512	1024.05 ± 0.74
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	1	tg128	36.18 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	1073.58 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	55.80 ± 0.83

build: 578754b3 (5117)

Prompt-Parsing Geschwindigkeit des P40 im Vergleich zu Produkten der Apple-Silicon-M-Serie für das Llama 7B v2 Modell in 8-Bit Quantisierung (in Token pro Sekunde)

Text-Generierung im Vergleich (Llama 7B v2 Modell in 8-Bit Quantisierung)

Prompt-Parsing im Vergleich (Llama 7B v2 Modell in 4-Bit Quantisierung)

Text-Generierung im Vergleich (Llama 7B v2 Modell in 4-Bit Quantisierung)

Im Bereich Prompt-Parsing schlägt sich der P40 hervorragend. Bei der Textproduktion liegt er oberhalb der Pro-Reihe und im unteren Bereich der Max-Reihe.

Fazit

Mit einem überschaubaren Budget haben wir einen leistungsfähigen HPC-Server mit insgesamt 96 GB vRAM aufgebaut und damit eine solide Basis für anspruchsvolle KI- und Datenverarbeitungsprojekte geschaffen.

Mietserveralternativen

Der Einsatz gemieteter KI-GPU-Server kann je nach Szenario eine sinnvolle Option sein. Wer lediglich punktuell Trainingsphasen benötigt oder kurzfristig Prototypen testen möchte, profitiert von stundenbasierten Mietangeboten. Bei langfristigem, kontinuierlichem Einsatz, insbesondere bei dedizierten Ressourcen mit vollem Zugriff, summieren sich die Kosten jedoch schnell. Im Bereich von 1.000 EUR pro Monat sind solche Setups keine Seltenheit, was den eigenen Betrieb auf Dauer wirtschaftlich attraktiver macht.

Praktische Rahmenbedingungen

Der Betrieb eines leistungsstarken Servers bringt jedoch nicht nur Vorteile, sondern auch infrastrukturelle Anforderungen mit sich:

Platz & Lautstärke: Unter Volllast ist ein geeigneter Serverraum mit ausreichender Belüftung notwendig.
Energiebedarf: Dauerhafte Auslastung kann Stromkosten im Bereich von 100 EUR pro Monat oder mehr verursachen.
Wartung & Updates: Treiber- und Softwareupdates, Reparatur und Erneuerung defekter Hardware liegen in der eigenen Verantwortung.
Begrenzte Treiber-/Softwarekompatibilität gebrauchter Hardware: In unserem Fall wurde die GPU-Architektur „Pascal“ des P40 noch gut unterstützt und wir können aktuelle Modelle betreiben. Wir sind hier aber gerade an der Grenze und an einigen Stellen gibt es schon Deprecation-Warnungen. Vor der Anschaffung sollte der aktuelle Support für Treiber (CUDA) und den wichtigen Bibliotheken und Frameworks (insbesondere PyTorch) recherchiert werden.

Mehrwert für das Team

Für unser Team ist der aktuelle Aufbau ein echter Gewinn:

Freies Experimentieren: Keine Genehmigungsprozesse oder Zeitdruck durch teure Mietstunden.
Datensouveränität: Lokale LLMs und KI-Modelle ermöglichen es, auch vertrauliche Daten sicher zu verarbeiten.
Volle Kontrolle: Wir bestimmen über Hardware, Software und Zugriffsrechte. Alles kann ausprobiert werden, egal wie obskur.

Setting Up an ML HPC Server (Part 1 - Hardware)

2025-12-03T00:00:00-06:00

Motivation

Many powerful AI models such as gpt-oss or DeepSeek are now published as open source. Powerful graphics cards (GPUs) are required in order to operate current and larger models at high performance. The decisive criterion here is the available graphics memory (vRAM). High-end gaming GPUs are equipped with up to 24GB of vRAM. However, this is not sufficient for larger language models. Professional cards such as the NVIDIA H100 Tensor Core GPU have 80 GB of vRAM, but currently cost around €30,000. Our goal was to build a machine learning computer on which medium-sized models could be operated locally without using cloud providers, which would be as powerful as possible, but on a manageable budget. The choice fell on a Dell PowerEdge C4130 rack server with two Nvidia Tesla P40 GPUs, 64 Xeon cores, 128GB RAM and 800GB hot-swap disks. The acquisition costs for the used hardware amounted to a total of 1550 €. In 2020 the P40 GPUs were in the upper performance class and continue to be provided with driver updates by Nvidia. How their performance has stood the test of time is revealed in the benchmarks in the second part of the article. For now we’ll describe the structure of the basic system without starting up the GPUs. The goal is to create a working environment that can be operated completely without physical access. The server hardware has some interesting featutres which allow such access, and we will now take a closer look at these features.

Initial assessment

The chassis of the C4130 is designed for mounting in a 19” rack, with a height unit (1U) and a depth of almost 90cm. It has 2 redundant 2 kW power supplies, one of which unfortunately suffered damage during shipping.

Delivery damage to the power supply.

While we had no issues with the seller exchanging the damaged goods, the matching C19 power cables were not included and had to be reordered. The machine is completely designed for remote maintenance, so it usually no longer requires on-site presence after installation in the data center. It also has 2 Gigabit Ethernet ports and a maintenance port. It can be accessed via VGA and USB, but we do not use this due to the lack of a suitable VGA adapter. The handbook documents the various access routes. When switched on for the first time, the LEDs on the front and back of the chassis flash orange. Ideally they should be solid blue, so the system doesn’t feel completely healthy. The maintenance access (iDRAC) has a somewhat old-fashioned web interface on the factory-set IP 192.168.0.120. Commendably, you can use the maintenance port on a switch as well as on a laptop (auto-sense), for which you have to manually select an IP address on the same LAN as the laptop. The iDRAC is completely independent of the main system and can be accessed as soon as the chassis receives power. In the diagnostics area, the condition of all components is visible. In our case, as expected, the removed power supply is flagged, and a fan is also defective, which is why the status LEDs flash orange. Speaking of fans: There are 8 built-in cooling units, each with 2 fans. Due to the low height (1U is about 4.5cm), they already spin at idle at 8,000 rpm. The limit is about 20,000 rpm, which is unpleasantly loud. Colleagues present in the room quickly left after it had been switched on. Other interior features: a 128 GB main memory, 64 cores in 2 Xeon E5-2697A processors, and two 800 GB hot-swappable SSDs (1.8” uSATA). When you remove the lid of the chassis, your eye is immediately caught by the 4 GPU bays directly in front of the fans. There are several slots free for more main memory, and there is still room for more hard drives at the back. The opening and reclosing of the chassis is logged by the iDRAC, even when it is switched off. In the iDRAC there is a VNC console which allows access to the BIOS and other diagnostic tools. We performed a detailed memory test, which ended after several hours without returning any errors.

iDRAC interface has that look and feel of the 90s.

Before the first boot of the main system, we change the boot order in the BIOS and disable the default network start (PXE). Thanks to this we avoid long pauses at startup.

Before you can turn your attention to GPUs, a basic operating system is required. The choice fell on Ubuntu because it is both commonly used and supplied by Nvidia with current GPU drivers and libraries. We are looking for: Encryption on both SSDs (cryptsetup + LUKS); LVM with 2 physical volumes; and within it logical partitions for /, /var and /home. We decide against RAID1 on the hot-swappable disks in favor of more usable space for our AI models. We start the Ubuntu server installer from a USB stick and access it via the VNC console in iDRAC. Caution is advised when entering passwords during installation: The keyboard layout of the VNC viewer in the iDRAC console is neither German nor English, but instead a wild mixture. We noticed that the VNC console didn’t not run stable, with the connection not always working. A cold start might help. The Ubuntu installer is somewhat overwhelmed with our partitioning requests: it apparently fails because the two encrypted disks are to be combined into one LVM volume (LVM = Logical Volume Manager). We work around the problem by initially setting up only an encrypted SSD with an LVM root volume. This means that the initial installation is complete within 5 minutes after a reboot.

LVM allows us to change volume sizes in the file system relatively easily afterwards, as well as to include additional disks. The necessary connections are already available in the chassis.

Manual setup of the second hard drive We would like to have /home on the second (still unformatted) disk /dev/sdb, as we want to have plenty of room for our AI models. To do this, we create an encrypted partition:

# parted /dev/sdb mklabel gpt
# parted -a optimal /dev/sdb mkpart primary 0% 100%
# cryptsetup luxFormat /dev/sdb1

To be able to unlock both disks with the same password, we use the script decrypt_keyctl (included in cryptsetup). It takes keyctl from the keyutils package, which we however still need to install manually. Then it is entered in /etc/crypttab for both disks:

# apt install keyutils
# cat /etc/crypttab
dm_crypt-0 UUID=035c6de5-99df-4e81-ba49-578d6b97c4cf none luks,keyscript=decrypt_keyctl
crypt_sdb1 UUID=97675b26-983a-42f8-8e2c-a5edb0fb051f none luks,keyscript=decrypt_keyctl
# update-initramfs -u
# reboot

The next time the machine is restarted, both disks are decoded as planned. We occupy the now available space entirely with /home in another physical LVM volume. In theory, LVM could be dispensed with for a single partition, however it allows us to change the distribution of the disks later if necessary.

# pvcreate /dev/mapper/crypt_sdb1
# vgcreate data-vg /dev/mapper/crypt_sdb1
# lvcreate -n data-home -l 100%FREE data-vg
# mkfs.ext4 /dev/data-vg/data-home
# cat /etc/fstab
...
/dev/disk/by-uuid/8209347b-0ddd-47f8-a5ba-b505cb822085 /home ext4 defaults 0 1

Normally, the password for encrypted hard drives is required at system startup. However, this will no longer be accessible as soon as the machine is placed in the rack. We therefore install dropbear-initramfs to be able to unlock the disks via SSH. Deviating from usual procedure, we convert the existing OpenSSH host keys to Dropbear format and install them in initramfs, so that we can use the normal SSH port (22) for unlocking without causing any key conflicts.

# /usr/lib/dropbear/dropbearconvert openssh dropbear \ /etc/ssh/ssh_host_ecdsa_key \ /etc/dropbear/initramfs/dropbear_ecdsa_host_key
# /usr/lib/dropbear/dropbearconvert openssh dropbear \ /etc/ssh/ssh_host_ed25519_key \ /etc/dropbear/initramfs/dropbear_ed25519_host_key
# /usr/lib/dropbear/dropbearconvert openssh dropbear \ /etc/ssh/ssh_host_rsa_key \ /etc/dropbear/initramfs/dropbear_rsa_host_key

Finally, the public keys of all administrators are entered in /etc/dropbear/initramfs/authorized_keys and the ramdisk is updated:

# update-initramfs -u
# reboot

Et voilà, after a reboot, the disks can be unlocked via SSH.

Compulsory reworking

During the final system cleanup, we stupidly overlooked the fact that cryptsetup-initramfs is not a manually selected package and it is automatically uninstalled. As a result, the system no longer boots because the root partition cannot be decrypted. Luckily a rescue system is hidden in the help menu of the Ubuntu installer. From there, we manually mount the installed filesystem and reinstall cryptsetup-initramfs in the chroot. Now the machine starts again.

Einrichtung eines ML-HPC-Servers (Teil 1 - Hardware)

2025-12-02T00:00:00-06:00

Motivation

Viele mächtige KI-Modelle wie gpt-oss oder DeepSeek werden mittlerweile als Open Source veröffentlicht. Um aktuelle und größere Modelle performant zu betreiben, werden leistungsfähige Grafikkarten (GPUs) benötigt. Ein maßgebliches Kriterium ist dabei der verfügbare Grafikspeicher (vRAM).

Gaming-GPUs der oberen Preisklasse sind mit bis zu 24 GB vRAM ausgestattet. Das ist für größere Sprachmodelle jedoch nicht ausreichend. Professionelle Karten wie die NVIDIA H100 Tensor Core GPU haben 80 GB vRAM, kosten aber derzeit ca. 30.000 €. Unser Ziel war es, mit überschaubarem Budget einen möglichst leistungsfähigen Rechner für Machine-Learning aufzubauen, auf dem mittelgroße Modelle lokal betrieben werden können, ohne Nutzung von Cloud-Anbietern.

Die Wahl fiel auf einen Dell PowerEdge C4130 Rack Server mit zwei Nvidia Tesla P40 GPUs, 64 Xeon-Kernen, 128GB RAM und 800GB Hot-Swap Platten. Die Anschaffungskosten für die gebrauchte Hardware betragen in Summe 1550 €. Die P40-GPUs waren um 2020 in der oberen Leistungsklasse und werden weiterhin von Nvidia mit Treiber-Updates versorgt. Was man damit heute noch anfangen kann, verraten die Benchmarks im zweiten Teil des Artikels.

Der erste Teil beschreibt den Aufbau des Grundsystems ohne Inbetriebnahme der GPUs. Das Ziel ist, eine lauffähige Umgebung zu bekommen, die komplett ohne physischen Zugang betreibbar ist. Dafür hat die bestellte Server-Hardware einige interessante Eigenheiten, die wir näher betrachten.

Erstbegutachtung

Das Chassis des C4130 ist für Montage in einem 19” Rack bestimmt, es hat eine Höheneinheit (1U) und eine Tiefe von fast 90cm. Es besitzt 2 redundante 2 kW-Netzteile, von denen eines leider einen unübersehbaren Transportschaden hat.

Lieferschaden am Netzteil.

Ein Austausch durch den Händler erfolgt problemlos. Die passenden C19-Stromkabel liegen dummerweise nicht bei und müssen ebenfalls nachbestellt werden. Die Maschine ist komplett für Fernwartung ausgelegt, also erfordert sie nach Einbau im Rechenzentrum (RZ) normalerweise keine Präsenz mehr vor Ort. Dazu hat sie 2 Gigabit Ethernet-Anschlüsse und einen Wartungs-Port. Man kann auch über VGA und USB darauf zugreifen, worauf wir mangels passendem VGA-Adapter jedoch verzichten. Im Handbuch sind die verschiedenen Zugangswege dokumentiert. Beim erstmaligen Einschalten fallen die orange blinkenden LEDs an Vorder- und Rückseite des Chassis auf. Normalerweise sollten sie konstant blau leuchten, das System fühlt sich also nicht völlig gesund.

Der Wartungszugang (iDRAC) hat eine etwas altbackene Weboberfläche auf der werksseitig eingestellten IP 192.168.0.120. Löblicherweise kann man den Wartungs-Port sowohl an einem Switch als auch an einem Laptop benutzen (auto-sense), wofür am Laptop manuell eine IP im selben LAN gewählt werden muss.

Das iDRAC ist komplett unabhängig vom Hauptsystem und erreichbar, sobald das Chassis Strom bekommt. Im Diagnosebereich ist der Zustand aller Komponenten sichtbar, in unserem Fall wird erwartungsgemäß das ausgebaute Netzteil beanstandet, außerdem ist ein Lüfter defekt, weswegen die Status-LEDs orange blinken. Apropos Lüfter: Eingebaut sind 8 Stück mit jeweils 2 Ventilatoren. Aufgrund der geringen Bauhöhe (1U sind ca. 4,5cm) drehen diese schon im Leerlauf mit 8.000 U/min, das Limit sind ca. 20.000 U/min, also richtig unangenehm laut. Anwesende Kollegen verließen nach dem Einschalten zügig den Raum.

Weitere Innenausstattung: 128 GB Hauptspeicher, 64 Kerne in 2 Xeon E5-2697A-Prozessoren, zwei 800 GB hot-Swap-fähige SSDs (1,8” uSATA). Wenn man den Deckel des Chassis abnimmt, fallen sofort die 4 GPU-Einschübe direkt vor den Lüftern ins Auge. Für mehr Hauptspeicher sind etliche Steckplätze frei, hinten ist noch Platz für weitere Festplatten. Das Öffnen und Wiederverschließen des Chassis wird vom iDRAC protokolliert, auch in ausgeschaltetem Zustand. Im iDRAC gibt es eine VNC-Konsole, die u.a. Zugriff auf das BIOS und weitere Diagnose-Werkzeuge erlaubt. Wir machen einen ausführlichen Speichertest, der nach mehreren Stunden ohne Fehler endet.

iDRAC-Oberfläche im Look&Feel der 90er Jahre.

Vor dem ersten Start des Hauptsystems ändern wir noch die Boot-Reihenfolge im BIOS, denn dort ist Netzwerkstart (PXE) voreingestellt. Wir deaktivieren es, um lange Pausen beim Start zu vermeiden.

Linux-Basisinstallation

Bevor man sich den GPUs zuwenden kann, wird ein Basis-Betriebssystem benötigt. Die Wahl fiel auf Ubuntu, weil es gängig ist und von Nvidia mit aktuellen GPU-Treibern und –Bibliotheken versorgt wird.

Wir hätten gerne:

Verschlüsselung auf beiden SSDs (cryptsetup + LUKS),
darüber LVM mit 2 physischen Volumes,
und darin logische Partitionen für /, /var und /home.

Auf ein RAID1 der Hot-Swap-Platten verzichten wir zugunsten von mehr nutzbarem Platz für KI-Modelle. Wir starten den Ubuntu-Server-Installer von einem USB-Stick und greifen über die VNC-Konsole im iDRAC darauf zu. Bei der Eingabe von Kennworten während der Installation ist Vorsicht geboten: Die Tastaturbelegung des VNC-Viewers in der iDRAC-Konsole ist eigenwillig, weder deutsch noch englisch, sondern eine wilde Mixtur.

Uns fällt auf, dass die VNC-Konsole nicht ganz stabil läuft, manchmal funktioniert der Verbindungsaufbau nicht. Ein Kaltstart kann weiterhelfen.

Der Ubuntu-Installer ist mit unseren Partitionierungswünschen etwas überfordert, es scheitert offenbar an den zwei verschlüsselten Platten, die zu einem LVM-Volume (LVM = Logical Volume Manager) zusammengefasst werden sollen. Wir umgehen das Problem, indem wir zunächst nur eine verschlüsselte SSD mit einem LVM Root-Volume einrichten. Damit ist die Erstinstallation in 5 Minuten nach einem Neustart abgeschlossen.

LVM erlaubt uns, nachträglich die Volume-Größen im Dateisystem relativ einfach zu ändern oder zusätzliche Platten einzubinden. Dafür sind im Chassis die passenden Anschlüsse bereits vorhanden.

Manuelle Einrichtung der zweiten Festplatte

Wir hätten gerne /home auf der zweiten (noch unformatierten) Platte /dev/sdb, da wir reichlich Platz für KI-Modelle haben wollen. Dazu legen wir eine verschlüsselte Partition an:

# parted /dev/sdb mklabel gpt
# parted -a optimal /dev/sdb mkpart primary 0% 100%
# cryptsetup luksFormat /dev/sdb1

Um beide Platten mit demselben Passwort entsperren zu können, benutzen wir das Skript decrypt_keyctl (in cryptsetup enthalten). Es benötigt keyctl aus dem Paket keyutils, das wir noch manuell installieren müssen. Anschließend wird es für beide Platten in /etc/crypttab eingetragen:

# apt install keyutils
# cat /etc/crypttab
dm_crypt-0 UUID=035c6de5-99df-4e81-ba49-578d6b97c4cf none luks,keyscript=decrypt_keyctl
crypt_sdb1 UUID=97675b26-983a-42f8-8e2c-a5edb0fb051f none luks,keyscript=decrypt_keyctl
# update-initramfs -u
# reboot

Beim nächsten Neustart der Maschine werden wunschgemäß beide Platten entschlüsselt. Den nun verfügbaren Platz belegen wir vollständig mit /home in einem weiteren physischen LVM-Volume. Auf LVM könnte man für eine einzelne Partition im Prinzip auch verzichten, aber es erlaubt uns, gegebenenfalls später die Aufteilung der Platten zu ändern.

# pvcreate /dev/mapper/crypt_sdb1
# vgcreate data-vg /dev/mapper/crypt_sdb1
# lvcreate -n data-home -l 100%FREE data-vg
# mkfs.ext4 /dev/data-vg/data-home
# cat /etc/fstab
...
/dev/disk/by-uuid/8209347b-0ddd-47f8-a5ba-b505cb822085 /home ext4 defaults 0 1

Normalerweise wird beim Systemstart das Kennwort für verschlüsselte Festplatten auf der Konsole verlangt. Diese wird jedoch nicht mehr zugänglich sein, sobald die Maschine ins Rack kommt. Wir installieren daher dropbear-initramfs, um die Platten über SSH entsperren zu können. Abweichend von der üblichen Vorgehensweise konvertieren wir die vorhandenen OpenSSH Host Keys ins Dropbear-Format und installierten sie ins initramfs, so dass wir zur Entsperrung den normalen SSH-Port 22 ohne Schlüsselkonflikte nutzen können.

# /usr/lib/dropbear/dropbearconvert openssh dropbear \
/etc/ssh/ssh_host_ecdsa_key \
/etc/dropbear/initramfs/dropbear_ecdsa_host_key
# /usr/lib/dropbear/dropbearconvert openssh dropbear \
/etc/ssh/ssh_host_ed25519_key \
/etc/dropbear/initramfs/dropbear_ed25519_host_key
# /usr/lib/dropbear/dropbearconvert openssh dropbear \
/etc/ssh/ssh_host_rsa_key \
/etc/dropbear/initramfs/dropbear_rsa_host_key

Zuletzt werden öffentliche Schlüssel der Administratoren in /etc/dropbear/initramfs/authorized_keys eingetragen und die Ramdisk aktualisiert:

# update-initramfs -u
# reboot

Voilà, nach einem Neustart lassen sich die Platten auch über SSH entsperren.

Unfreiwillige Nacharbeiten

Bei der abschließenden Bereinigung des Systems übersehen wir dummerweise, dass cryptsetup-initramfs kein manuell gewähltes Paket ist und automatisch deinstalliert wird. Daraufhin startet das System nicht mehr, weil die Root-Partition nicht entschlüsselt werden kann.

Ein vollständiges Rettungssystem ist im Hilfemenü des Ubuntu-Installers versteckt. Von dort hängen wir das installierte Dateisystem manuell ein und installieren cryptsetup-initramfs im chroot noch einmal. Nun startet die Maschine wieder.

Für den nächsten Schritt montieren wir die P40-GPUs in die Einschübe 1+2. Deren Einrichtung und die Messung der Rechenleistung werden im zweiten Teil beschrieben.

Automated Security Testing: Playwright for Robust Web Security

2025-11-13T00:00:00-06:00

Introduction

With automated end-to-end tests, you can not only find bugs, but also regularly check if your software is compliant with security standards. Automation brings several advantages:

Automated security tests provide reliable verification that security features are working as intended.
They help keep security mechanisms stable during further development and detect unwanted regressions at an early stage.
Writing automated tests allows you to look at your software from the perspective of potential attackers.

In this article, we use concrete examples to show how Playwright can be used to reliably test security-relevant aspects such as Content Security Policy (CSP), clickjacking or Cross-Site Request Forgery (CSRF).

Approach: Playwright end-to-end security testing

We will now focus on reviewing selected security aspects using automated end-to-end testing. These tests can be implemented alongside end-to-end feature tests as they can run in the same pipeline as these “normal” tests. Therefore, their development feels like the development of application feature tests. To show you how to check some aspects with the help of Playwright, we shall use an example which is relevant to CSP (Content Security Policy). The CSP is sent in the header of an HTML response, and it is configured during development of the frontend. If you are therefore intending to check the CSP, it is a good idea to call up the page as part of a test and perform the checks there. Playwright is currently the most common tool for end-to-end testing of a web application. By and large, the same approaches and methods can be used for security testing as are used for end-to-end testing for new features. In our tests for the CSP, we want to check various aspects.

Content Security Policy Review

The first aspect concerns simple access to the page being checked. The first thing we want to do is make sure that no CSP is being violated by the existing implementation. Therefore, we enter the page and check that no warning appears in the browser’s console. With a small helper function, we can capture browser console error messages produced during our Playwright test and store them in an array. We simply pass the page and the target array to the function, and its implementation appends any console errors to that array as they occur.

function logBrowserErrors(page: Page, errors: string[]) {
  page.on("console", (messsage) => {
    if (messsage.type() === "error") {
      errors.push(messsage.text());
    }
  });
}

Therefore, after calling up our page to be checked, we can validate that no CSP warnings or other error messages were triggered on the page. The check can be done using Playwright’s expect function.

expect(errors).toHaveLength(0);

When Playwright calls up the page, we also get a response to this call. This contains the CSP attributes in the header. We write these values to a so-called validation file, which is filled with the current CSP attributes when the test is run for the first time. These values must initially be critically checked for the expected values. If there are deviations from the expected values, the CSP must be adjusted so that the values in the validation file match the expected values.

Once the validation file has been released each subsequent run of the test, be it run locally or in a pipeline, compares the contents of the file to the obtained attributes. If a deviation is detected, the test fails. In this way, all changes to the CSP are reliably detected. If you plan to make changes to the CSP, the file can be adapted. In all remaining cases it is checked why the CSP has been changed, and it can be decided whether the change needs to be reversed or whether it can be kept.

Here’s an example of what the contents of such a validation file look like:

{
  "cspHeaderValues": [
    "default-src 'self'",
    "connect-src 'self'",
    "script-src 'nonce-[NONCE]' 'strict-dynamic' 'wasm-unsafe-eval'",
    "style-src-elem 'self' 'nonce-[NONCE]'",
    "style-src-attr 'unsafe-inline'",
    "img-src 'self' blob: data:",
    "font-src 'self' data:",
    "object-src 'none'",
    "base-uri 'self'",
    "form-action 'self'",
    "frame-ancestors 'none'"
  ]
}

We have masked the nonce values in this file because they are regenerated in each run and therefore the test cannot test for a concrete nonce value.

async function validateCSPData(
  response: Response,
  page: Page,
) {
  const cspHeaderValues =
    (await response.allHeaders())["content-security-policy"] ?? "";
  if (cspHeaderValues === "") {
    throw new Error("CSP must not be empty.");
  }
  const hasMetaCSP = await checkMetaCSP(page);
  expect(hasMetaCSP).toBeFalsy();
  const snapshot: Record<string, string[]> = {};
  snapshot.cspHeaderValues = cspHeaderValues
    .split(/;\s*/)
    .filter((str) => str !== "");
  await compareActualWithValidationFile(snapshot);
}

In the validateCSPData method shown, you can see our implementation for validating the CSP attributes. All we have to do is pass the page and the response of the page’s call to the method. The method extracts the proportion which affects the CSP from the response. In an initial validation, we make sure the CSP is not empty. We then run another check and validate that there are no meta CSP attributes in the HTML part of the response, as we have decided not to allow meta CSP attributes and we must check to avoid conflicts between the CSP in the header and in the meta-attributes. At the end of the method, we format the CSP attributes and pass them to our method, which compares the values with those in the file mentioned above.

Check CSP Warning

In a further step, we manipulate the HTML part of our page to be checked in order to verify that the expected CSP warnings appear in the browser’s console. In our example we add the following line to the HTML body of the page:

<script src="https://bad.test/evil.js" async=""></script>

This manipulation simulates an attack via XSS (Cross-Site-Scripting). In such an attack, “malicious code”, usually in the form of JavaScript, is injected into a website. If the code were to be executed, sensitive data could be tapped. Therefore, it is important to check that if code were to be injected into the page, it would not, under any circumstances, be executed. We manipulate the HTML body using the route method, which we apply to Playwright’s page object:

async function setupRouteWithModifiedBody(
  page: Page
) {
  await page.route(
    page.url(),
    async (route) => {
      const response = await route.fetch();
      let bodyForModification = await response.text();
      bodyForModification = bodyForModification.replace(
        "</body>",
        `<script src="https://bad.test/evil.js" async=""></script></body>`,
      );
      await route.fulfill({
        response,
        body: bodyForModification,
      });
    }
  );
}

In this method, we manipulate the call to the page to be checked. We’ll apply the route method to the URL of the page, manipulating the HTML body in the process. In the route method, we specify the URL we want to manipulate as the first parameter. As a second parameter, we define the instructions that cause the body to be manipulated. To do this, we first use route.fetch to store the actual response to queries about the page in a variable. We then change this answer by adding a “bad” script at the end. Using route.fulfill, we instruct Playwright to return the manipulated body when the page is accessed. After the method has been called in the test, every call to the page is intercepted by Playwright and the HTML body of the response is replaced by the manipulated body. If the script should be called due to an insufficient CSP, we also use Playwright’s route method. This redirects the call for the script to a script that we have defined:

async function setupRouteForEvilScript(page: Page) {
  await page.route("https://bad.test/evil.js", async (route) => {
    const jsContent = `console.log("Hello world!");`;
    await route.fulfill({
      status: 200,
      contentType: "application/javascript",
      body: jsContent,
    });
  });
}

If the page with the manipulated body is called up during the test execution, a warning is issued in the console of the browser and the “evil” script is not loaded.

The screenshot was taken during the test execution and in it you can see several violated CSP rules. These error messages are written to the array mentioned at the beginning. They are validated in a separate file, just as the CSP in the header of the HTML response. If the error message changes or does not appear at all during a test execution, the test will fail, and a cause and solution must be sought.

Prevent clickjacking with CSP

CSP can also be used to prevent “malicious” websites from embedding our page into their website using an iframe element, a so-called clickjacking attack. By embedding the website, our site is overlaid by the malicious website and neither the users nor we as the operator recognize that functions are unintentionally executed on the site. To prevent this, “frame-ancestors `none`” is added to the CSP. This will cause any embedding attempts to fail. For our test, we created a minimal website that includes an iframe element on our page. We used the route method again.

async function setupRouteForIframeSite(page: Page) {
  const body = `<!DOCTYPE html>
    <head>
      <meta charset="utf-8">
      <title>ClickJacking Test</title>
    </head>
    <body><iframe src="${page.url()}"</body>
    </html>`;
  await page.route("https://bad.test/clickjacking", (route) =>
    route.fulfill({
      contentType: "text/html;charset=utf-8",
      body,
    }),
  );
}

The method setupRouteForIframeSite works so that when the URL “https://bad.test/clickjacking” is called in the test, the page defined in the method is called. If the CSP is configured correctly, then the iframe element will not work. In addition, an error message is displayed on the page in the console.

This can be seen in the screenshot above. The error message also specifies the breached CSP “frame-ancestors `none`”. This error message is written to a validation file as described above and checked each time the test runs.

Test CSRF attack

Finally, we present a CSRF scenario that can be checked by means of end-to-end tests in Playwright. The first step is Playwright logging in to the software to be tested. For this test, we have created two minimal websites that send a query to our software to be tested when you click on a link. However, this is not obvious to a user at first glance. For demonstration purposes or test purposes, we used a state-changing GET request.

We test both a cross-origin and a same-site case.

The first website has a different domain from the page being tested. The second website has a subdomain of our page as a URL (as pictured above). As you can see, for the purposes of this test it has been kept very minimal and essentially only contains the malicious link. When Playwright clicks on the link in the test, we always check that an error message appears when calling up the link. In addition, we use Playwright’s route method to monitor the endpoint that is attacked by the malicious calls, in this case clicking on the link.

async function monitorAttackedEndpoint(
  page: Page,
) {
  await page.route(attackedEndpoint, async (route) => {
    const response = await route.fetch();
    expect(response.status()).toBe(403);

    await route.fulfill({ response: response });
  });
}

One of the methods of preventing such an attack is the use of CSRF cookies. This prevents the endpoint from responding to the malicious request, as the malicious site does not have access to the CSRF cookies that must be sent along for a successful request. An http-403 error code is returned in our software when an attempted CSRF attack occurs. We check this using the method presented above.

Final Reflections

In this article we used some examples to show how security aspects for web applications, including CSP or CSRF, can be tested automatically in conjunction with Playwright through end-to-end tests. It was shown how some different aspects can be tested, such as the presence of the expected CSP in the http response. The tests can be adapted to the needs of different web applications and thus can be used across projects. The tests presented are only a small excerpt of possible security tests that can be automated. Other aspects of security, such as access authorizations or brute force attacks, can also be tested automatically with the help of end-to-end tests by Playwright.

Sicherheit automatisiert testen: Mit Playwright zu robuster Web Security

2025-11-13T00:00:00-06:00

Einleitung

Mit automatisierten Ende-zu-Ende-Tests lassen sich nicht nur Bugs finden, sondern auch regelmäßig die Einhaltung von Sicherheitsmaßnahmen überprüfen. Das hat eine Reihe von Vorteilen:

Automatisierte Security-Tests überprüfen zuverlässig, ob Sicherheitsfunktionen wie vorgesehen funktionieren.
Sie helfen dabei, Sicherheitsmechanismen während der Weiterentwicklung stabil zu halten und ungewollte Regressionen frühzeitig zu erkennen.
Beim Schreiben automatisierter Tests wird die Perspektive potenzieller Angreifer eingenommen.

In diesem Artikel zeigen wir anhand konkreter Beispiele, wie sich mit Playwright sicherheitsrelevante Aspekte wie Content Security Policy (CSP), Clickjacking oder Cross-Site Request Forgery (CSRF) zuverlässig testen lassen.

Ansatz: Playwright-Ende-zu-Ende-Security-Testing

In diesem Artikel werden wir uns auf die Überprüfung ausgewählter Sicherheitsaspekte mithilfe von automatisierten Ende-zu-Ende-Tests konzentrieren. Diese Tests können neben den Ende-zu-Ende-Tests für die Features der Anwendung implementiert werden. Sie können in der gleichen Pipeline laufen wie diese „normalen“ Tests. Daher fühlt sich ihre Entwicklung wie die Entwicklung der Tests für Anwendungsfeatures an. Wir zeigen in diesem Beispiel exemplarisch für Content Security Policy (CSP) wie man einige Aspekte mithilfe von Playwright überprüfen kann. Die CSP wird im Header einer HTML-Antwort verschickt. Sie wird während der Entwicklungsarbeiten des Frontends konfiguriert. Um die CSP zu überprüfen, bietet es sich daher an, im Rahmen eines Tests, die Seite aufzurufen und dort die Checks durchzuführen. Playwright ist für Ende-zu-Ende Tests einer Webapplikation derzeit das gängige Werkzeug. Hier werden wir speziell auf die Besonderheiten beim Testen der CSP mit Playwright eingehen. Im Großen und Ganzen können für die Sicherheitstests die gleichen Ansätze und Methoden verwendet werden wie für Ende-zu-Ende Tests für neue Features. In unseren Tests für die CSP wollen wir verschiedene Aspekte überprüfen.

Content-Security-Policy-Überprüfung

Der erste Aspekt betrifft das einfache Aufrufen der zu überprüfenden Seite. Hier wollen wir als Erstes sicherstellen, dass keine CSP durch die vorhandene Implementierung verletzt wird. Daher rufen wir die Seite auf und überprüfen, dass keine Warnung in der Konsole des Browsers erscheint. Mit einer kleinen Funktion können wir Playwright anweisen, die Fehlermeldungen der Browserkonsole, die während des Tests erzeugt werden, in ein Array zu schreiben. Dazu übergeben wir die Seite und das Array an die Funktion und deren Implementierung sorgt dafür, dass die Fehlermeldungen in unser Array geschrieben werden.

function logBrowserErrors(page: Page, errors: string[]) {
  page.on("console", (messsage) => {
    if (messsage.type() === "error") {
      errors.push(messsage.text());
    }
  });
}

Wir können daher nach dem Aufruf unserer zu überprüfenden Seite validieren, dass keine CSP-Warnungen oder andere Fehlermeldungen auf der Seite ausgelöst wurden. Die Überprüfung kann mit der expect-Funktion von Playwright vorgenommen werden.

expect(errors).toHaveLength(0);

Beim Aufrufen der Seite durch Playwright erhalten wir auch die Antwort auf diesen Aufruf. Diese enthält im Header die CSP-Attribute. Wir schreiben diese Werte in eine sogenannte Validierungsdatei. Diese wird beim ersten Durchlaufen des Tests mit den aktuellen CSP-Attributen gefüllt. Diese Werte müssen initial auf die erwarteten Werte kritisch überprüft werden. Sollte es Abweichungen zu den erwarteten Werten geben, so muss die CSP angepasst werden, damit die Werte in der Validierungsdatei mit den erwarteten Werten übereinstimmen.

Sobald die Validierungsdatei freigegeben worden ist, wird in jedem weiteren Durchlauf des Tests, ob lokal oder in einer Pipeline, der Inhalt der Datei mit den aktuell erhaltenen Attributen verglichen. Sollte eine Abweichung erkannt werden, schlägt der Test fehl. Auf diese Weise werden zuverlässig alle Änderungen an der CSP erkannt. Bei geplanten Änderungen der CSP kann die Datei angepasst werden. In den restlichen Fällen wird überprüft, warum sich die CSP geändert hat und es kann entschieden werden, ob die Änderung rückgängig gemacht werden muss oder ob sie beibehalten werden kann.

Hier ist ein Beispiel, wie der Inhalt einer solchen Validierungsdatei aussieht:

{
  "cspHeaderValues": [
    "default-src 'self'",
    "connect-src 'self'",
    "script-src 'nonce-[NONCE]' 'strict-dynamic' 'wasm-unsafe-eval'",
    "style-src-elem 'self' 'nonce-[NONCE]'",
    "style-src-attr 'unsafe-inline'",
    "img-src 'self' blob: data:",
    "font-src 'self' data:",
    "object-src 'none'",
    "base-uri 'self'",
    "form-action 'self'",
    "frame-ancestors 'none'"
  ]
}

Die Nonce-Werte haben wir in dieser Datei maskiert, da sie in jedem Durchlauf neu erzeugt werden und der Test daher nicht auf einen konkreten Nonce-Wert testen kann.

async function validateCSPData(
  response: Response,
  page: Page,
) {
  const cspHeaderValues =
    (await response.allHeaders())["content-security-policy"] ?? "";
  if (cspHeaderValues === "") {
    throw new Error("CSP must not be empty.");
  }
  const hasMetaCSP = await checkMetaCSP(page);
  expect(hasMetaCSP).toBeFalsy();
  const snapshot: Record<string, string[]> = {};
  snapshot.cspHeaderValues = cspHeaderValues
    .split(/;\s*/)
    .filter((str) => str !== "");
  await compareActualWithValidationFile(snapshot);
}

In der gezeigten Methode validateCSPData ist unsere Implementierung für die Validierung der CSP-Attribute zu sehen. Wir müssen der Methode lediglich die Seite (page) und die Antwort des Aufrufs der Seite (response) übergeben. Die Methode extrahiert aus der Antwort den Anteil, der die CSP betrifft. In einer ersten Validierung überprüfen wir, dass die CSP nicht leer ist. Wir führen dann eine weitere Überprüfung aus und validieren, dass keine Meta-CSP-Attribute im HTML-Teil der Antwort befindlich sind. Wir haben uns dazu entschieden als eigenen Standard keine Meta-CSP-Attribute zuzulassen und überprüfen das an dieser Stelle, um Konflikte zwischen der CSP im Header und in den Meta-Attributen zu vermeiden. Am Ende der Methode formatieren wir die CSP-Attribute und übergeben sie unserer Methode, die die Werte mit der oben erwähnten Datei vergleicht.

CSP-Warnung überprüfen

In einem weiteren Schritt manipulieren wird den HTML-Teil unserer zu überprüfenden Seite und verifizieren, dass die erwarteten CSP-Warnungen in der Konsole des Browsers erscheinen. Eine Manipulation enthält zum Beispiel folgende Zeile, die wir dem HTML-Body der Seite hinzufügen:

<script src="https://bad.test/evil.js" async=""></script>

Diese Manipulation simuliert einen Angriff per XSS (Cross-Site-Scripting). Bei einem solchen Angriff wird auf eine Website „bösartiger Code“, meist in Form von JavaScript, eingeschleust. Falls der Code zur Ausführung käme, könnten zum Beispiel sensible Daten abgegriffen werden. Daher ist es wichtig zu überprüfen, dass falls Code in die Seite eingeschleust werden sollte, dieser auf keinen Fall ausgeführt wird.

Die Manipulation des HTML-Bodys erreichen wir mithilfe der Methode route, die wir auf das page-Objekt von Playwright anwenden:

async function setupRouteWithModifiedBody(
  page: Page
) {
  await page.route(
    page.url(),
    async (route) => {
      const response = await route.fetch();
      let bodyForModification = await response.text();
      bodyForModification = bodyForModification.replace(
        "</body>",
        `<script src="https://bad.test/evil.js" async=""></script></body>`,
      );
      await route.fulfill({
        response,
        body: bodyForModification,
      });
    }
  );
}

In dieser Methode manipulieren wir den Aufruf der zu überprüfenden Seite. Wir wenden die route-Methode auf die URL der Seite an und manipulieren dabei den HTML-Body. In der route-Methode geben wir als ersten Parameter die URL an, die wir manipulieren möchten. Als zweiten Parameter definieren wir die Anweisungen, die dazu führen, dass der Body manipuliert wird. Dazu lassen wir zuerst mittels route.fetch die eigentliche Antwort auf Anfragen zu der zu testenden Seite in eine Variable speichern. Diese Antwort verändern wird dann, indem wir am Ende ein „böses“ Skript hinzufügen. Mittels route.fulfill weisen wir Playwright an, beim Aufruf der Seite den manipulierten Body zurückzugeben.

Nachdem die Methode im Test aufgerufen worden ist, wird jeder Aufruf der Seite von Playwright abgefangen und der HTML-Body der Antwort wird durch den manipulierten Body ersetzt.

Für den Fall, dass durch eine unzureichende CSP das Skript aufgerufen werden sollte, verwenden wir auch die route-Methode von Playwright. Diese leitet den Aufruf für das Skript auf ein von uns definiertes Skript um:

async function setupRouteForEvilScript(page: Page) {
  await page.route("https://bad.test/evil.js", async (route) => {
    const jsContent = `console.log("Hello world!");`;
    await route.fulfill({
      status: 200,
      contentType: "application/javascript",
      body: jsContent,
    });
  });
}

Wenn während der Testausführung die Seite mit dem manipulierten Body aufgerufen wird, wird eine Warnung in der Konsole des Browsers ausgegeben und das „böse“ Skript wird nicht geladen.

Man kann in dem Screenshot, der während der Testausführung erstellt wurde, mehrere verletzte CSP-Regeln sehen. Diese Fehlermeldungen werden in das anfangs erwähnte Array geschrieben. Sie werden wie die CSP im Header der HTML-Antwort in einer separaten Datei validiert. Sollte sich während einer Testausführung die Fehlermeldung ändern oder ganz ausbleiben, schlägt der Test fehl und es muss nach einer Ursache sowie einer Lösung dafür gesucht werden.

Clickjacking mittels CSP verhindern

Mithilfe der CSP kann auch verhindert werden, dass „bösartige“ Websites unsere Seite mittels eines iframe Elements in ihre Website einbetten, ein sogenannter Clickjacking-Angriff. Durch die Einbettung der Website wird unsere Seite durch die bösartige Website überlagert und weder die User noch wir als Betreiber erkennen, dass ungewollt Funktionen auf der Seite ausgeführt werden. Um dies zu verhindern, wird der CSP „frame-ancestors `none`“ hinzugefügt. Dies sorgt dafür, dass die Einbettung auf anderen Websites fehlschlägt. Für unseren Test haben wir eine minimale Website erstellt, die ein iframe-Element auf unsere Seite enthält. Wir haben dazu wieder die route-Methode verwendet.

async function setupRouteForIframeSite(page: Page) {
  const body = `<!DOCTYPE html>
    <head>
      <meta charset="utf-8">
      <title>ClickJacking Test</title>
    </head>
    <body><iframe src="${page.url()}"</body>
    </html>`;
  await page.route("https://bad.test/clickjacking", (route) =>
    route.fulfill({
      contentType: "text/html;charset=utf-8",
      body,
    }),
  );
}

Die Methode setupRouteForIframeSite führt dazu, dass wenn im Test die URL „https://bad.test/clickjacking“ aufgerufen wird, die in der Methode definierte Seite aufgerufen wird. Wenn die CSP korrekt konfiguriert ist, dann funktioniert das iframe-Element nicht. Zudem wird auf der Seite eine Fehlermeldung in der Konsole ausgegeben.

Das ist in dem obigen Screenshot zu sehen. In der Fehlermeldung wird auch die verletzte CSP „frame-ancestors 'none’“ angegeben. Auch diese Fehlermeldung wird wie oben beschrieben in eine Validierungsdatei geschrieben und bei jeder Ausführung des Tests überprüft.

CSRF-Angriff testen

Zum Abschluss stellen wir noch ein CSRF-Szenario vor, welches man mittels Ende-zu-Ende-Tests in Playwright überprüfen kann. In einem ersten Schritt loggt sich der Playwright Test bei der zu testenden Software ein. Wir haben für diesen Test zwei minimale Websites erstellt, die bei dem Klick auf einen Link eine Abfrage an unsere zu testende Software abschicken. Dies ist jedoch auf den ersten Blick für einen Nutzer nicht ersichtlich. Zu Demonstrationszwecken beziehungsweise Testzwecken haben wir dazu einen zustandsändernden GET-Request verwendet.

Wir testen sowohl einen Cross-Origin- als auch einen Same-Site-Fall.

Die erste Website hat eine von der zu testenden Seite unterschiedliche Domain. Die zweite Website hat eine Subdomain unserer zu testenden Seite als URL. Diese Seite ist oben abgebildet. Sie ist, wie man sieht, für den Test sehr minimal gehalten und enthält im Wesentlichen nur den bösartigen Link. Wenn Playwright im Test auf den Link klickt, überprüfen wir jeweils, dass eine Fehlermeldung beim Aufruf des Links auf unsere zu testende Software erscheint. Zusätzlich überwachen wir mittels der route-Methode von Playwright den Endpunkt, der durch die bösartigen Aufrufe, also hier das Klicken auf den Link, angegriffen wird.

async function monitorAttackedEndpoint(
  page: Page,
) {
  await page.route(attackedEndpoint, async (route) => {
    const response = await route.fetch();
    expect(response.status()).toBe(403);

    await route.fulfill({ response: response });
  });
}

Um einen solchen Angriff zu verhindern, werden zum Beispiel CSRF-Cookies verwendet. Auf diese Weise wird verhindert, dass der Endpunkt den bösartigen Request beantwortet, da die bösartige Seite keinen Zugriff auf die CSRF-Cookies hat, die für einen erfolgreichen Request mitgeschickt werden müssen. Es wird in unserer Software bei einem versuchten CSRF-Angriff ein http-403-Fehlercode zurückgegeben. Dies überprüfen wir mit der oben dargestellten Methode.

Schlussbetrachtung

Wir haben hier an einigen Beispielen dargelegt, wie sich Sicherheitsaspekte für Webanwendungen, unter anderem CSP oder CSRF, im Zusammenspiel mit Playwright durch Ende-zu-Ende-Tests automatisiert testen lassen. Es wurde prinzipiell gezeigt, wie sich einige unterschiedliche Aspekte, zum Beispiel das Vorhandensein der erwarteten CSP in der http-Antwort, testen lassen. Die Tests lassen sich an unterschiedliche Webanwendungen anpassen und können auf diese Weise projektübergreifend eingesetzt werden. Die dargestellten Tests sind nur ein kleiner Ausschnitt von möglichen automatisierbaren Sicherheitstests. Weitere Sicherheitsaspekte, wie beispielsweise Zugriffsberechtigungen oder Brute-Force-Angriffe, können auch mithilfe von Ende-zu-Ende-Tests durch Playwright automatisiert getestet werden.

Using OpenRewrite for large-scale refactoring

2025-10-23T00:00:00-05:00

Our Starting Position

What makes OpenRewrite so compelling is its automated nature. Migrating your code base between Java versions or upgrading a framework becomes a more relaxed task: You add the corresponding so-called “recipe”, execute rewriteRun, verify the code with your automated tests and then you’re done. Instead of replacing imports by hand or fighting with Gradle because of a rogue transitive dependency, you can take a coffee break while OpenRewrite works in the background.

An OpenRewrite recipe contains the logic to do a specific task, like changing org.junit imports with org.assertj equivalents. Due to the large user base and the open-source nature of most recipes, you can find recipes for everything from Spring Boot upgrades to switching from JUnit to AssertJ in minutes. In some cases, it might also be useful for enforcing code standards – much like an auto-formatter – where OpenRewrite can be integrated into the normal development pipeline, for example as a pre-commit hook.

How Does It Work?

There are “declarative” and “imperative” recipes which have different purposes. You can imagine declarative recipes like Lego. They are defined in a simple YAML file and typically consist of a list of existing recipes that should be executed together. Many of these recipes are available in OpenRewrite’s public repositories¹ and are designed for common tasks, such as dependency upgrades or framework migrations. For example, the AssertJ² recipe I mentioned earlier shows how an entire framework change can be automated with just a single declarative recipe.

Imperative recipes, on the other hand, are implemented in code. They define the actual logic that transforms your source code; in many cases by replacing old methods with new ones or changing an import. While there are many of these already available, OpenRewrite also provides a comprehensive Java API for writing your own recipes which we’ll explore in more detail next.

Lossless Semantic Tree and Visitor Pattern

OpenRewrite builds a Lossless Semantic Tree or LST³ when it is invoked. An LST, as its name suggests, is a much more detailed version of an AST (Abstract Syntax Tree). While the AST only contains the information necessary for evaluating the logical structure of the program, the LST includes whitespace information as well as a complete representation of the type relations. This means that once OpenRewrite has parsed a source file into an LST it can generate an exact replica from that LST alone. Because of this, local design abnormalities like an unusual indentation will be preserved as OpenRewrite doesn’t assume anything about your code styles. Additionally, because of the extensive type information, it can correctly identify the type of any given field. This is incredibly helpful if a recipe only wants to act on a very specific set of statements, for example for fixing a known vulnerability in a specific method from a package. OpenRewrite also uses this to verify that the new code uses existing types and doesn’t reference unavailable classes.

Once that LST is built, we get a chance to modify it. OpenRewrite is designed around the visitor pattern⁴ which allows us to define the behavior of a “visitor” which is moving along the LST. Different visitor types exist to balance how much you’re able to change vs. what can be validated by OpenRewrite. For example, a JavaIsoVisitor isn’t allowed to replace a method declaration with a field, however this is possible when using a JavaVisitor. We would do this by overriding visitX methods for all kinds of elements of a source file, such as class declarations, method declarations/invocations or conditionals. In each of these methods, we get some representation of that LST node in our code. These are immutable objects which contain the information present in the source file. We can use these when we want to change something for the current element, such as only renaming methods that start with “test”:

@Override
public J.MethodDeclaration visitMethodDeclaration(J.MethodDeclaration method, ExecutionContext executionContext) {
   if (method.getSimpleName().startsWith("test")) {
       // TODO: Rename this method
   }
   return super.visitMethodDeclaration(method, executionContext);
}

To allow for more control about how the LST is traversed , OpenRewrite leaves it up to us to decide if and where we call super.visitX. OpenRewrite generally recommends starting any visitX method with the call to super. Omitting this call entirely will mean that the sub-tree is not traversed at all. This can be beneficial for improving performance; however, it isn’t needed in most cases. To further expand upon our example from above, let’s now change the method name. In OpenRewrite, the LST itself should not be mutated. Instead, we build a new “method object” that we then return from our method.

@Override
public J.MethodDeclaration visitMethodDeclaration(J.MethodDeclaration method, ExecutionContext executionContext) {
   String methodName = method.getSimpleName();

   if (methodName.startsWith("test")) {
       String newName = methodName.replaceFirst("test", "check");
       return method.withName(method.getName().withSimpleName(newName));
   }
   return super.visitMethodDeclaration(method, executionContext);
}

OpenRewrite detects that we returned an object different to what was passed into the method. It concludes that we must have changed something about the code and will store this new object in place of the old node in the LST. If you want to instead completely remove a statement, simply return null. In cases where you don’t want to do anything you should return super.visitX.

After the first visitor has traversed the whole LST, OpenRewrite will run another visitor through our recipe. If it detects any further changes, it will repeat this step, until no changes are made anymore. To make sure that changes from our recipe did not cause a “regression” in another active recipe, it will then re-run all other recipes in a similar pattern. Once that finishes it can confidently assert that all recipes have applied their logic to every single piece of code in the code base and every possible change has been made.

Lessons learned

Because of the inherent complexity in this type of meta programming, a test-driven development approach is highly favorable. It allows you to effectively cover the many possible edge cases.

Something that OpenRewrite already warns about in their documentation is recipe state. Recipe state increases the risk of artifacts from previous data unexpectedly changing the behaviour of your recipe. This not only introduces bugs that are difficult to find and fix, it also massively increases the complexity of your recipe. In our above example this can’t be avoided entirely, since we not only need to rename method declarations but also adjust any calls to those methods. This means we need to pass the information about our new names to visitMethodInvocation so that we can adjust the method calls accordingly.

The first option we have is the cursor. While the Java API of OpenRewrite itself doesn’t expose explicit methods like enterClass and exitClass, the cursor keeps track of where exactly we currently are in a stack-like structure, hence the name. It is cleared between every single cycle of a recipe and is best suited for communicating between two methods inside a visitor that come after each other. This wouldn’t be suitable for our scenario since a method call may come from a completely different place in the code base. Another possible solution is to put our information into the execution context. It is only ever cleared after all recipes have run so it is a much more persistent storage location. There are some limitations that you need to keep track of, however. The execution context does not allow mutating stored data to avoid hard to debug problems that occur due to state conflicts. You also need make sure that you don’t overwrite data from other recipes. The optimal way would be a ScanningRecipe⁵ visitor, where we first get the opportunity to scan the whole code base and collect information, after which a second visitor can apply changes.

Final Thoughts

With an extensive collection of open-source recipes and a fleshed-out Java API, OpenRewrite is a great way to approach code refactoring at a large scale. While the in-memory nature of the LST naturally will become a bottleneck for bigger projects, this problem is solved by Moderne’s custom solution with which it is possible to split the tree generation and store it more permanently. While OpenRewrite is primarily focused on Java and the surrounding ecosystem, it also offers recipes for YAML, XML, JSON and even a few other languages like C# or Scala (although in a much more limited capacity). Further code examples can be found in the cronn github⁶

Performance Testing with k6: A Field Report

2025-07-18T00:00:00-05:00

Project context

GA-Lotse is a modular web application for health authorities which is intended to simplify internal documentation and external communication with citizens. Different departments are mapped in modules, which then can be configured by the health authorities. To ensure that the application meets highest security standards, the data is stored separately for each module. This and other security features – such as the Zero Trust principle – lead to intrinsic performance losses, which is why performance testing was an important part of the project.

Selecting the load testing tool

It is often the case that you don’t have to implement everything yourself, so we looked for a tool which supports performance testing. Since we want to test a web application, the tool must allow browser testing. Our additional requirements were as follows:

The ability to write the test code in TypeScript, as we also use TypeScript for the frontend of the application and the end-to-end tests
Open-source availability of the tool
Executability on a self-hosted server (not a pure cloud solution)
Good reporting to visualize the results of the tests for us and the developers.

After evaluating several tools, we decided on k6. k6 supports browser tests, enables development in TypeScript and, in combination with Grafana and through individually definable metrics, offers comprehensive reporting.

Our setup

k6 runs the performance tests and generates some metrics, such as TTFB or the duration of the individual requests. However, in order to visualize these and other test results, we needed even more tools. We chose InfluxDB as the database, as it is optimized for storing data in a time-resolved manner. To visualize the results, we used Grafana-Dashboards because k6 belongs to Grafana and it provides an interface to InfluxDB. To query the data from the InfluxDB, we used the proprietary database query language Flux. However, this is not a long-term solution as Flux will probably no longer be supported – or only supported to a limited extent – in the next major version. We decided to use the tools locally and package them in Docker containers in order to be able to run the tests hardware-independently and not be dependent on cloud providers. Alternatively, there is the option of using Grafana Cloud k6 to avoid installing the tools locally.

Performance testing with k6

A test with k6 can be executed with a Javascript or TypeScript file (see example script).

import { Options, Scenario } from "k6/options";
import { schoolEntryBrowserTest } from "@/modules/browser/schoolEntryBrowserTest";
import { schoolEntryApiTest } from "@/modules/api/schoolEntryApiTest";

const scenarios: Record<string, Scenario> = {
  schoolEntryBrowser: {
    exec: 'schoolEntryBrowserTestFunction',
    executor: 'constant-vus',
    vus: 3,
    duration: '15m',
    options: {
      browser: {
        type: 'chromium',
      }
    }
  },
  schoolEntryApi: {
    exec: 'schoolEntryApiTestFunction',
    executor: 'ramping-vus',
    startVUs: 1,
    stages: [
      { target: 3, duration: '5m' },
      { target: 5, duration: '5m' },
      { target: 3, duration: '5m' },
    ]
  }
};

export const options: Options = {
  discardResponseBodies: true,
  scenarios: scenarios,
  systemTags: ['status', 'url', 'check', 'scenario'],
  setupTimeout: '5m',
};

export async function schoolEntryBrowserTestFunction() {
  await schoolEntryBrowserTest();
}

export async function schoolEntryApiTestFunction() {
  await schoolEntryApiTest();
}

This script defines options for the test and the test functions to be executed. The options are defined as JSON. An important option which determines the course of the test is scenarios. This is where executable scenarios can be defined, thus mapping the actual test.

To define a scenario one must define a function to be executed, as well as the number of executing parallel users, which in k6 are called Virtual Users (VU). The total duration of the scenario can be determined by specifying time periods. In addition, ramps can be defined to increase or decrease the number of parallel users during the test. Another way to influence the course of the test is to set a time interval in which a specific number of VUs should go through the scenario.

Several such scenarios can be defined for a test, which are then run using different configurations. To make this definition of the scenarios easier and faster than editing a long JSON file, we have developed a builder that dynamically creates the scenario configuration and makes it available on GitHub: https://github.com/cronn/k6-scenario-builder.

Our findings

During testing, we noticed a few things which need to be taken into account. First of all, it makes sense to have a dedicated machine available to run the tests. Since performance is not only affected by the load of many simultaneous users, but also by the amount of data in the database, we created both short spike tests as well as test scenarios that have a runtime of several hours in order to constantly increase the amount of data and simulate a kind of time-lapse of the actual use of the application. These tests can be carried out much more comfortably by an external machine than on your own laptop.

In addition, the execution of a test requires sufficient resources on the executing machine. Therefore, care should be taken to ensure that there are always free resources available during the execution of a test so as not to unintentionally influence the results. We noticed this when running browser tests with some VUs. Too many browsers open at the same time turned the machine into a bottleneck. Our solution to this is to define both scenarios and browser tests which depict the same user journey, but send the necessary requests directly to the backend in order to increase the load on the backend without accessing the browser. Such API scenarios are also well suited to quickly assemble a scenario and thus get an overview of the backend’s performance.

Another insight we gained was to test in an environment which was as close to production as possible. After all, the configuration of an environment, especially a complex microservice cluster, can have significant impact on performance. In addition to running the tests from another machine and testing on a production-like environment, it was still important for us to enable testing entirely on our own laptop. This allows developers to independently develop new scenarios and provide easy access to databases and logs.

It also occurred that we had exceeded professional limits by configuring our scenarios, especially during long tests. For example, we created an unrealistic number of appointments for one day or user, or even had too many users with the same permissions. Many different parameters can influence performance and should therefore be defined as early as possible, allowing us to avoid unnecessary test runs. Nevertheless, it was also important for us to deliberately exceed the known limits to test the limits of the application and then improve it where necessary. After all, the customer may not know their professional limits, or their limits might be reached through technical errors. The application should not become unusable because the user booked one appointment too many. One lesson learned was therefore to clarify professional limits at an early stage and to observe them in the tests.

Pros and Cons of k6

We ran into problems from time to time during testing with k6. A significant limitation of developing performance tests with k6 is a lack of a debugger. k6 uses its own JavaScript engine to execute the test code, and there is no built-in debugger. The Javascript engine also has other weaknesses which you should be aware of, such as that it does not support the popular fetch API. In the context of browser tests, methods such as goto() are a weakness, as they do not always work reliably in combination with Chromium, which occasionally leads to timing problems. In addition, locators must be identified via XPaths, which is very susceptible to regression, as well as often unsightly and long. Finally, the documentation of k6 is often relatively short.

However, k6 also has many advantages. The reporting in combination with InfluxDB and Grafana works very well. Meaningful plots can be quickly created in such a setup without much prior knowledge and then be displayed in a dashboard so that the test results can be analyzed and communicated. In addition, the parallel execution of different scenarios, each of which is also executed with parallel virtual users, works very well. It allows you to create complex scenarios which map different types of performance tests, such as load tests, spike tests, and soak tests. The fact that the test options (and especially the scenarios) are described in JSON is an advantage as it provides a smooth transition to the Typescript code. You also have the option of running the browser tests in headful mode, so that problems can be detected and fixed during execution.

Summary

Since we had constantly developed both our tests and setup during the test phase, an iterative approach paid off for us. We started with two simple scenarios for application-critical modules. In these initial scenarios, we realized that we needed more metrics and plots in our reports to analyze the results. Iteratively, we then added metrics to our tests and visualized them in the Grafana board. These metrics included information such as the duration of requests, the loading times of certain pages, or even the CPU and RAM usage of the executing machine. The duration of individual requests was particularly important for us, but which information is relevant depends on the application. Metric types built into k6 allow the collection of information to be flexibly designed. Working with k6 has shown us both strengths and weaknesses of the tool. Whether k6 is the best choice certainly depends on the use case, but for us it was a suitable tool despite some significant weaknesses.

Performance-Testing mit k6: Ein Erfahrungsbericht

2025-07-18T00:00:00-05:00

Projektkontext

GA-Lotse (Gesundheitsamt-Lotse) ist eine modular aufgebaute Webanwendung für Gesundheitsämter, die die interne Dokumentation und externe Kommunikation mit Bürgerinnen und Bürgern vereinfachen soll. Verschiedene Abteilungen eines Gesundheitsamtes sind in Modulen abgebildet, die für Gesundheitsämter konfiguriert werden können. Damit die Anwendung höchsten Sicherheitsstandards genügt, werden die Daten für jedes Modul separat gespeichert. Dies und weitere Sicherheitsfeatures wie das Zero-Trust-Prinzip führen zu intrinsischen Einbußen der Performance, weshalb das Testen der Performance ein wichtiger Teil des Projektes war.

Auswahl des Lasttesttools

Wie so häufig muss man nicht alles selbst implementieren, daher haben wir uns nach einem Tool umgesehen, das Performance-Testing unterstützt. Da wir eine Webanwendung testen wollen, sollte es Browsertests ermöglichen. Zudem waren unsere Hauptanforderungen folgende:

Die Möglichkeit den Testcode in TypeScript zu schreiben, da wir TypeScript auch für das Frontend der Anwendung und die Ende-zu-Ende-Tests verwenden
Open-Source-Verfügbarkeit des Tools
Ausführbarkeit auf einem selbstgehosteten Server (keine reine Cloud-Lösung)
Ein gutes Reporting, um die Ergebnisse der Tests für uns und die Entwickler zu visualisieren.

Nach der Evaluation mehrerer Tools haben wir uns für k6 entschieden. k6 unterstützt Browsertests, ermöglicht die Entwicklung in TypeScript und bietet in Kombination mit Grafana sowie durch individuell definierbare Metriken ein umfassendes Reporting.

Unser Setup

k6 führt die Performance-Tests aus und erzeugt dabei bereits einige Metriken, wie z.B. TTFB oder die Dauer der einzelnen Requests. Um diese und weitere Testergebnisse persistieren und visualisieren zu können, benötigten wir noch weitere Tools.

Als Datenbank haben wir uns für InfluxDB entschieden, da diese dafür optimiert ist, Daten zeitaufgelöst zu speichern. Zur Visualisierung der Ergebnisse haben wir Grafana-Dashboards genutzt, unter anderem da k6 zu Grafana gehört und es eine Schnittstelle zur InfluxDB bietet. Zur Abfrage der Daten aus der InfluxDB haben wir die proprietäre Datenbankabfragesprache Flux genutzt. Diese wird jedoch vermutlich in der nächsten Major-Version v3 nicht mehr oder nur noch eingeschränkt unterstützt.

Wir haben uns entschieden, die Tools lokal zu nutzen und sie in Docker-Container zu verpacken, um die Tests hardwareunabhängig ausführen zu können und nicht von Cloud-Anbietern abhängig zu sein. Alternativ besteht die Möglichkeit, Grafana Cloud k6 zu verwenden, um die lokale Installation der Tools zu vermeiden.

Performance-Tests mit k6

Ein Test mit k6 lässt sich mit einem Javascript oder TypeScript-File ausführen (s. Beispielskript).

import { Options, Scenario } from "k6/options";
import { schoolEntryBrowserTest } from "@/modules/browser/schoolEntryBrowserTest";
import { schoolEntryApiTest } from "@/modules/api/schoolEntryApiTest";

const scenarios: Record<string, Scenario> = {
  schoolEntryBrowser: {
    exec: 'schoolEntryBrowserTestFunction',
    executor: 'constant-vus',
    vus: 3,
    duration: '15m',
    options: {
      browser: {
        type: 'chromium',
      }
    }
  },
  schoolEntryApi: {
    exec: 'schoolEntryApiTestFunction',
    executor: 'ramping-vus',
    startVUs: 1,
    stages: [
      { target: 3, duration: '5m' },
      { target: 5, duration: '5m' },
      { target: 3, duration: '5m' },
    ]
  }
};

export const options: Options = {
  discardResponseBodies: true,
  scenarios: scenarios,
  systemTags: ['status', 'url', 'check', 'scenario'],
  setupTimeout: '5m',
};

export async function schoolEntryBrowserTestFunction() {
  await schoolEntryBrowserTest();
}

export async function schoolEntryApiTestFunction() {
  await schoolEntryApiTest();
}

In diesem Skript werden Optionen für den Test sowie die auszuführenden Testfunktionen definiert. Die Optionen werden als JSON definiert. Eine wichtige Option, die den Testverlauf bestimmt, ist scenarios. Dort können Szenarien definiert werden, die ausgeführt werden und somit den eigentlichen Test abbilden.

Für ein solches Szenario wird eine auszuführende Funktion, sowie die Anzahl an ausführenden parallelen Nutzern, die in k6 Virtual User (VU) genannt werden, definiert. Mit der Angabe von Zeiträumen kann die Gesamtdauer des Szenarios bestimmt werden. Außerdem können Rampen definiert werden, um die Anzahl der parallelen User während des Tests zu erhöhen oder zu verringern. Eine andere Möglichkeit den Testverlauf zu beeinflussen, ist, ein Zeitintervall festzulegen, in dem eine konkrete Anzahl an VUs das Szenario durchlaufen sollen.

Für einen Test können mehrere solcher Szenarien definiert werden, die mit unterschiedlichen Konfigurationen durchlaufen werden. Um diese Definition der Szenarien einfacher und schneller zu gestalten als ein langes JSON-File zu editieren, haben wir einen Builder entwickelt, der die Szenario-Konfiguration dynamisch erstellt und diesen auf GitHub zur Verfügung gestellt: https://github.com/cronn/k6-scenario-builder.

Unsere Erkenntnisse

Während des Testens sind uns einige Dinge aufgefallen, die es aus unserer Sicht zu berücksichtigen gilt. Zunächst ist es sinnvoll, eine dedizierte Maschine zur Verfügung zu haben, die die Tests ausführt. Da die Performance nicht nur durch Last vieler gleichzeitiger User beeinträchtigt wird, sondern auch von der Menge der Daten in der Datenbank, haben wir neben kurzen Spike-Tests auch Testszenarien erstellt, die eine Laufzeit über mehrere Stunden haben, um so die Datenmenge stetig zu erhöhen und eine Art Zeitraffer der tatsächlichen Nutzung der Anwendung zu simulieren. Diese Tests sind von einer externen Maschine deutlich komfortabler auszuführen als von dem eigenen Laptop.

Zudem benötigt die Ausführung eines Tests ausreichend Ressourcen auf der ausführenden Maschine. Daher sollte darauf geachtet werden, dass während der Ausführung eines Tests stets noch freie Ressourcen vorhanden sind, um nicht die Ergebnisse ungewollt zu beeinflussen. Dies haben wir bei der Ausführung von Browsertests mit einigen VUs bemerkt. Eine zu große Anzahl an gleichzeitig geöffneten Browsern hat die auszuführende Maschine zum Bottleneck gemacht. Unsere Lösung dafür ist, neben Browsertests gleichzeitig Szenarien zu definieren, die eine möglichst gleiche User-Journey abbilden, jedoch die nötigen Requests direkt ans Backend schicken, um somit die Last aufs Backend browserunabhängig zu erhöhen. Solche API-Szenarien eignen sich auch gut, um schnell ein Szenario zusammenzubauen und somit browserunabhängig einen Überblick über die Performance des Backends zu bekommen.

Eine weitere Erkenntnis von uns war, auf einer möglichst produktionsnahen Umgebung zu testen. Denn auch die Konfiguration einer Umgebung, gerade ein komplexer Microservice-Cluster, kann die Performance erheblich beeinflussen. Neben dem Ausführen der Tests von einer anderen Maschine und dem Testen auf einer produktionsähnlichen Umgebung war es für uns dennoch wichtig, auch das Testen vollständig auf dem eigenen Laptop zu ermöglichen. Dies ermöglicht die unabhängige Entwicklung neuer Szenarien durch die Entwickler und einen einfachen Zugang zu Datenbanken und Logs.

Es ist vorgekommen, dass wir durch die Konfiguration unserer Szenarios, vor allem bei langen Tests, fachliche Limits überschritten haben. Zum Beispiel haben wir unrealistisch viele Termine für einen Tag oder User angelegt, oder sogar zu viele User mit den gleichen Berechtigungen gehabt. Viele Größen können die Performance beeinflussen und sollten deshalb möglichst frühzeitig abgesteckt werden. Dadurch können wenig aussagekräftige Testläufe vermieden werden. Trotzdem war es uns auch wichtig, die bekannten Limits bewusst zu überschreiten, um die Reaktion der Anwendung zu testen und dort dann gegebenenfalls nachzubessern. Denn es ist ja nicht gesagt, dass der Kunde seine fachlichen Limits kennt oder diese durch technische Fehler nicht überschritten werden. Bei einem Termin zu viel sollte die Anwendung nicht unbedienbar werden. Ein Learning war für uns daher, fachliche Limits früh abzuklären und in den Tests zu beachten.

Vor- und Nachteile von k6

Während des Testens mit k6 sind wir immer mal wieder auf Probleme gestoßen. Eine erhebliche Einschränkung beim Entwickeln von Performance-Tests mit k6 ist ein fehlender Debugger. k6 nutzt eine eigene JavaScript-Engine, um den Testcode auszuführen, für die es keinen Debugger gibt. Die Javascript-Engine hat auch weitere Schwächen, denen man sich bewusst sein sollte. Beispielsweise unterstützt sie die verbreitete Fetch API nicht. Im Zusammenhang mit Browsertests sind Schwächen von k6, dass Methoden wie goto(), die darauf warten sollen, dass eine Seite geladen ist, im Zusammenspiel mit Chromium nicht immer zuverlässig funktionieren, was hin und wieder zu Timing-Problemen führt. Darüber hinaus müssen Locator über XPaths identifiziert werden, was sehr regressionsanfällig ist, sowie häufig unschön und lang. Zuletzt ist auch die Dokumentation von k6 häufig relativ knapp.

Einige andere Dinge haben sich als Vorteile von k6 herausgestellt. Das Reporting im Zusammenspiel mit der InfluxDB und Grafana hat wie erhofft sehr gut funktioniert. Über dieses Setup lassen sich ohne große Vorkenntnisse schnell aussagekräftige Plots erstellen und in einem Dashboard anzeigen, sodass die Testergebnisse analysiert und kommuniziert werden können. Außerdem funktioniert das parallele Ausführen von verschiedenen Szenarien, die jeweils ebenfalls mit parallelen virtuellen Usern ausgeführt werden, sehr gut. Dadurch lassen sich komplexe Szenarien erstellen, die verschiedene Arten von Performance-Tests wie Load-Tests, Spike-Tests und Soak-Tests abbilden. Dass die Testoptionen und insbesondere die Szenarien als JSON beschrieben werden ist sehr angenehm, da es einen fließenden Übergang zum Typescript-Code bietet. Außerdem hat man die Möglichkeit, die Browsertests in einem Headful Mode laufen zu lassen, sodass sich Probleme während der Ausführung erkennen lassen und behoben werden können.

Zusammenfassung

Da wir während der Testphase unsere Tests und unser Setup stetig weiterentwickelt haben, hat sich für uns ein iterativer Ansatz ausgezahlt. Wir sind mit zwei einfachen Szenarien für Module gestartet, die zu den wichtigsten in der Anwendung gehören. Bei diesen ersten Szenarien haben wir festgestellt, dass wir weitere Metriken und Plots in unseren Reports benötigen, um die Ergebnisse analysieren zu können. Iterativ haben wir dann Metriken zu unseren Tests hinzugefügt und im Grafana-Board visualisiert. Dies waren Informationen wie die Dauer von Requests, die Ladezeiten von bestimmten Seiten oder auch die CPU- und RAM-Auslastung der ausführenden Maschine. Für uns war vor allem die Dauer einzelner Requests von Bedeutung, welche Informationen relevant sind, hängt jedoch von der Anwendung ab. Durch in k6 eingebaute Metrik-Typen lässt sich die Erhebung von Informationen flexibel gestalten.

Die Arbeit mit k6 hat uns sowohl Stärken als auch Schwächen des Tools gezeigt. Ob k6 passend ist, hängt sicher vom Anwendungsfall ab, für uns war es aber trotz einiger signifikanter Schwächen ein passendes Tool.

Analyzing Business Reports with LLMs – Part 2

2025-06-24T00:00:00-05:00

Welcome back to our series on analysing annual reports with AI. In Part One we showed how the extraction of key figures from annual reports with LLMs (such as ChatGPT) works. Now we are going deeper and showing the final working solution, which we are using in cooperation with North Data.

We have already demonstrated how relevant information can be filtered out of the dense text of annual reports in a structured way. But if you want to scale this process in practice, you quickly reach its limits – be it in terms of accuracy across many different documents, the robust processing of complex layouts and tables, or the cost-effectiveness of large-scale analysis.

This is exactly where there have been many exciting developments. With Gemini Flash from Google, a model is available which reshuffles the cards for automated document analysis in terms of speed, contextual understanding, and the delivery of structured data.¹ In this second part, we will ask: what makes Gemini Flash so more powerful for this specific task than previous approaches or the classic OCR pipelines? How does it make the step from feasibility study to productive tool? Let us look under the hood.

Gemini extracts structured JSON code from PDFs.

The classic approach: OCR as the basis, but not the whole solution

Before we dive into Gemini’s capabilities, it is worth looking at the traditional way of extracting data from PDFs. This most commonly starts with Optical Character Recognition (OCR). OCR tools generate text from scanned documents or image-only PDFs by converting pixels into letters. The result is not only the raw text content, but often also its position on the page, usually in the form of coordinates or so-called bounding boxes for each recognized word or line.

OCR Bounding Boxes from Azure Document Intelligence.

However, for a meaningful analysis we need structured data, not continuous text. This is where the challenges begin.

The first hurdle lays in the structure in the pure text output being recognized. How do you automatically identify tables, related key-value pairs (such as “revenue: €10 million”) or semantically meaningful blocks? This often requires complex, downstream steps – whether purpose-built parsers, rule-based systems that look for specific patterns, or even separate machine learning models trained on tasks such as table recognition.

However, these downstream systems are often susceptible to layout changes. Small adjustments in the design of a report from one year to the next or the format differing between companies can throw off painstakingly created rules or parsers and make them unusable.

In addition, there is a lack of contextual understanding. OCR provides the text but does not understand its meaning. Recognizing that the term “Total Assets” on page 10 refers to the same metric as a detailed breakdown in a table on page 45 is beyond the capabilities of pure text recognition.

All these factors create complexity and thus lead to a high development and maintenance effort. It can be said that OCR is a valuable tool, but for the extraction of structured data it is usually only the first step in a complex and often fragile processing chain.

Our path to productive use: evaluation, model selection and integration

The leap from successful demonstration (as shown in Part 1²) to a reliable, scalable production system required a systematic approach and further developments in several areas.

Firstly, a solid evaluation was essential. To this end we manually curated a dataset of 100 representative English annual reports. For the most important key figures, the correct values (ground truth) were annotated by hand and collected in a table. Only with such a reliable basis can the quality of different models and approaches be objectively measured and tracked over time.

Secondly, we significantly expanded the scope of extraction. Instead of just a few key figures, the goal was now to reliably extract a wide range of over 20 relevant values per report. This includes, among other things, the wage costs, information on profit and loss, cash flow, but also data such as the average number of employees or the name of the auditor.

These more demanding goals led us to test different models. In the end, the choice fell on Gemini 2.0 Flash Lite: This model optimally combined all the decisive factors for our application.

LLM comparison based on the parameters "intelligence" and "price", via artificialanalysis.ai.

Quality & Speed: In our tests, Gemini 2.0 Flash Lite showed high accuracy for most of the targeted metrics, often keeping up with that of larger, more expensive models. Google itself positions the Flash models as optimized for tasks where it is important to maintain high speed and efficiency while maintaining high quality ³. Our experience confirms that the model lives up to its “flash” in its name in terms of processing speed.

Cost: A decisive factor for large-scale deployment is cost. Gemini 2.0 Flash Lite is significantly cheaper than the larger Pro models. Compared to older models like gpt-3.5-turbo-16k, which still cost about $3 per million input tokens in July 2023 ⁴, the Gemini Flash variant we used is cheaper by a factor of 40 ⁵! This makes the processing of thousands of reports economically viable.

Multimodality & Context: A significant advantage over plain text models or classic OCR pipelines is Gemini’s multimodality. Put simply, instead of just delivering the raw text and its coordinates (like traditional OCR), Gemini Flash can “read” the text and “see” the page layout at the same time. It “understands” how text is arranged in columns or tables, recognizes headings, and can interpret images or charts in the document. As a result, it is better at capturing context which the pure text order often does not convey. This is a great advantage, especially with the complex and varied layouts of annual reports. Coupled with the long context window, which allows the analysis of large document sections in one go, this is a decisive step forward.

This combination of good quality, high speed, low cost, and the ability to understand documents holistically made Gemini 2.0 Flash Lite a viable choice for our productive deployment in collaboration with North Data.

Gemini Flash in Action: The Workflow with Structured Outputs

The core of our approach combines the strengths of Gemini with pragmatic solutions to deal with the peculiarities of large documents.

A central problem with annual reports is that they often comprise hundreds of pages. While handing over the entire document to Gemini would be ideal for context, it is too expensive for mass use. To get around this problem, we have developed a multi-step approach: First, we still rely on proven OCR technology to extract the plain text of the entire document. This raw text then serves as the basis for a quick preliminary analysis using keywords. We look for terms and phrases that typically indicate relevant sections, such as “Consolidated Balance Sheet”, “Income Statement” or “Notes to the Financial Statements”.

Based on this analysis we then select the up to 100 pages that are most likely to contain the financial ratios we are looking for. Only this selection is then passed on to Gemini Flash Lite as a PDF context. This trick not only significantly reduces processing costs but also helps to focus the model on the important parts of the document and minimize the “noise” of irrelevant pages.

After isolating the relevant pages, we commission Gemini to extract them into a predefined format. Another building block for precise results is the use of so-called structured outputs. Gemini can not only generate text but also provides directly structured JSON data which follows a predetermined scheme.

To do this, we define a clear target scheme in advance, which in turn defines exactly which data fields we expect and in which format (such as “number”, “text”, “currency symbol”). In Python, we like to use Pydantic for easy definition and validation. We explicitly give this structure to the model as an instruction. This is not only practical for automated further processing, but also demonstrably improves quality: In our tests, this step alone led to an improvement in the evaluation result of around 4%.

Here is a simplified Python example to illustrate the principle with the google-genai library and structured outputs:

from google import genai
from google.genai import types
from pydantic import BaseModel, Field


client = genai.Client(api_key="GEMINI_API_KEY")


# Define the desired output structure using Pydantic
class FinancialData(BaseModel):
    revenue: float | None = Field(
        description="Total revenue reported for the fiscal year."
    )
    net_income: float | None = Field(description="Net income or profit after tax.")
    total_assets: float | None = Field(description="Total assets value.")
    fiscal_year: int | None = Field(description="The ending year of the fiscal period.")
    currency_symbol: str | None = Field(
        description="Currency symbol used for major values (e.g., $, £, €)."
    )


# Upload the relevant PDF pages (assuming 'selected_report_pages.pdf' was created by pre-filtering)
pdf_file = client.files.upload(file="'selected_report_pages.pdf")

prompt = """
Please analyze the provided pages from the annual report PDF.
Extract the following financial figures for the main consolidated entity reported:
- Total Revenue
- Net Income (Profit after tax)
- Total Assets
- The Fiscal Year End
- The primary Currency Symbol used for the main financial figures (£, $, € etc.)

Return the data strictly adhering to the provided 'FinancialData' schema.
If a value cannot be found or determined confidently, leave the corresponding field null.
Pay close attention to units (e.g., thousands, millions).
"""

try:
    response = client.models.generate_content(
        model="gemini-2.0-flash-lite-001",
        contents=[prompt, pdf_file],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_schema=FinancialData,
        ),
    )
    extracted_data = FinancialData.model_validate_json(response.text)
    print(extracted_data)

except Exception as e:
    print(f"\nAn error occurred: {e}")

finally:
    client.files.delete(name=pdf_file.name)

A look at the numbers: How well does it really work?

To objectively assess the actual performance of our approach with Gemini Flash, we created a dataset of 100 manually annotated business reports. This serves as ground truth against which we check the extraction results of the model.

The overall accuracy across all metrics and reports for our approach was 83.5%. These were the first feasibility values for the solution we integrated at North Data. This is a solid basis which demonstrates that the approach works. However, it gets more interesting when you look at the accuracy for individual metrics:

Key figure (parameters)	Accuracy
Overall	83.5%
capital	96.0%
cash	95.0%
employees	95.0%
revenue	95.0%
equity	98.0%
currencySymbol	99.0%
auditorName	89.0%
materials	89.0%
…	…
liabilities (creditors)	75.0%
currentAssets	64.0%
realEstate	60.0%
receivables	52.0%
tax	41.0%

What does this table tell us and what are the current hurdles?

The results paint a clear picture: The model achieves remarkably high accuracy values for clearly defined master data or values, which are often prominently and relatively uniformly shown in annual reports. These include, for example, capital, equity, employees, cash or the currency symbol. Fortunately, hallucinations – for example inventing numbers that do not exist in the document – were not a significant problem in our tests. If errors occurred, it was usually due to misinterpretations of existing figures and not to their free invention.

It becomes more difficult for the model with more complex key figures. This is where the limitations of the current approach become apparent, especially when it comes to semantic fuzziness and varying levels of detail. Many balance sheet items can be defined, named, or broken down differently in reports. Terms such as “total assets” are not always clear – does it mean the balance sheet total before or after deduction of certain items such as goodwill, for example the intangible value?

The exact definition of current assets, receivables or liabilities varies between companies and reporting standards. This is where the model sometimes reaches its limits in deducing the exact definition valid in the respective report from the immediate context alone.

The dependence on layouts and the placement of information also plays a role. Some assets, such as realEstate (real estate assets), are often not prominently found on the main pages of the balance sheet but are hidden in detail in the “Notes to the Financial Statements” (Appendix). The model’s ability to correctly map such information across different pages and layouts is heavily challenged and results in lower accuracy scores.

Finally, some metrics require more complex interpretations or implicit calculations. The extraction of values such as tax is a good example of this. Different types of taxes (income taxes, sales taxes, etc.) and deferred taxes can often be spread over several sections. The correct aggregation and interpretation of this information is challenging, which explains the current accuracy of only 41% for this metric.

These quantitative results confirm our qualitative observations: the model is excellent at finding clearly labelled information. However, it reaches its limits when dealing with issues such as ambiguities in wording, widely varying or complex layouts, and the need to understand implicit knowledge or contexts across multiple text passages.

Another important aspect is the varying accuracy between different companies. The standard deviation of accuracy per company is about 9.2%. It is particularly striking that the accuracy of the large, individually designed reports from listed companies (PLCs) such as AstraZeneca (50%), Barclays (65%), HSBC (50%), Shell (70%) or Unilever (55%) tends to be significantly lower than average. Tests with excerpts of different lengths showed that the length of the context to be mastered is not a major difficulty for Gemini, we therefore assume that the uniqueness of the reporting structures of these groups is particularly challenging for the model. While Gemini Flash Lite handles layouts that are often created by smaller companies using off-the-shelf software, these complex cases are a bigger hurdle. One explanation could be that the reports that deviate from the standard rarely made it into Gemini’s training data.

Another recurring problem is the correct capture of units and scales. Missing or misinterpreting information such as “in thousands of £” or “millions of USD” will result in extracted values that are wrong by factors of 1,000 or 1,000,000. Here, robust downstream validation rules and targeted prompting are necessary to sensitize the model to these details.

The representation of negative numbers, which is often done by parentheses in annual reports (e.g. “(1.234)” instead of “-1.234”), also requires an explicit note in the prompt so that the model interprets this convention correctly and extracts the numbers with the correct sign. As already mentioned, hallucinations do not pose any major problems here (as it was with older models), it is the interpretation of the numbers that does not always succeed.

Finally, we are also faced with the classic trade-off between costs and performance in particularly complex cases. More sophisticated reasoning approaches such as Chain-of-Thought (CoT), in which the model makes its “thought steps” explicit, or the use of even larger and more powerful models (for example Gemini 2.5 Pro) could remedy the problems mentioned, especially when analysing the more complex reports.

However, these are currently often much more expensive. For example, Gemini 2.5 Pro is currently 16 to 32 times more expensive than the Gemini 2.0 Flash Lite we used. The common GPT-4.1, which is used in ChatGPT, also costs $2 per 1 million input tokens – about 27 times as much as Gemini 2.0 Flash Lite. Using our solution to process an average report from our 30-page test dataset costs only about $0.0007!

Conclusion: Gemini Flash as a powerful addition to the toolbox

Gemini Flash has proven to be a useful building block for us to take the extraction of structured data from annual reports to a new level and bring it into productive use at North Data. It does not necessarily replace the entire classic pipeline (as our OCR pre-filtering shows), but it does provide a powerful, integrated alternative to the core process of intelligent data extraction and structuring.

The ability to understand layouts, work within a larger context, and deliver structured outputs significantly reduces complexity and maintenance compared to traditional, multi-tiered approaches. The challenges remain, but the progress is clear and opens new opportunities for automated financial data analysis.

We are excited to see how this technology will develop further and what new solutions will emerge. Have you had similar experiences or developed different strategies? Share your thoughts with us!

This blog post was written with the support of Gemini 2.5 Pro.

OmniAI OCR Benchmark, retrieved 17/06/25 ↩
cronn Blog: Analyzing Business Reports with ChatGPT – Part I ↩
Documentation Google Gemini 2.0 Flash-Lite, retrieved 17/06/25 ↩
Web Archive: OpenAI-Preise vom 14. Juni 2023, retrieved 17/06/25 ↩
Prices for Gemini Developer API, retrieved 17/06/25 ↩