Skip to main content

Compiler Extensions

C++ Compiler Extensions for High-Performance Computing

Over the past few years, we have proposed several C++ language extensions using attributes to reduce a solver’s memory footprint and optimise its data layouts. These extensions also allow programmers to specify where compute kernels should run (aka to offload to GPU). Our core idea is that a programmer can annotate code to inform the compiler that a floating-point variable holds, for example, only 10 significant digits, and that the subsequent code would benefit from data organised as Structure of Arrays (SoA) rather than Array of Structs (AoS). However, programmers can continue to work with native IEEE formats and AoS. Our Clang compiler extensions then process this information: they store the data using only the 10 valid digits, convert it into native C++ types ahead of the any actual computation, and reorganise the data into SoA behind the scenes. Furthermore, they rewrite the compute kernels to exploit this altered layout, particularly if these kernels are executed on the GPU.

There are several logical follow-up steps that frame a new PhD thesis: How can we apply these concepts to higher-dimensional arrays? At present, we mainly study 1D data, such as a series of particles. Can the annotation mechanism be extended to support dynamic changes in precision during computation? Can we completely hide all arising conversion overhead – in particular on accelerated compute systems?

In this PhD project, we are searching for a colleague who wants to champion our compiler extensions and push them to the next generation. This includes maintaining the code base, applying it to a complicated PDE solver (where we have plenty higher-dimensional arrays), and tackling the research questions from above.

Preparatory work

All of our previous work has been published in a series of papers that discuss various aspects and dimensions of the idea to use compiler annotations to write HPC C++ code more efficiently. The ACM TOMS paper (the first one) is certainly the reference to-go paper, but the other ones also make valuable, albeit smaller contributions.

All the code modifications are implemented within a specific LLVM fork which is publicly available. That means that users can either compile their own compiler, or they can use this particular LLVM fork as source-to-source compiler running prior to the actual compiler on their system. There are certainly more elegant solutions to ship the compiler extensions to potential users, but we have not been able to identify them at the time being.

Pawel, who championed this idea as part of his PhD thesis also published a poster about the idea which won the SIAM PP 2026 best poster price:

Methodology: Higher-dimensional, hierarchical data

In our experiments above, we have always demonstrated that the ideas pay off by means of a “simple” SPH solver or its compute kernels, respectively. That means, the data structures we used have always been sequences of particles and therefore something that is intrinsically 1d. It is a natural thing to think about high-dimensional arrays, as they are omnipresent in scientific computing.

There is work along these lines, where other groups cluster d-dimensional data into chunks which they then compress. But it is not clear how that fits into our annotation-based mindset.

One of the tricks that we use in some of the papers stems actually from work that we have done way earlier with Eckhardt et al and that we later had already rolled out to multigrid solvers (here and here). Pick an arbitrary value from a sequence. In our earlier work we used the average, in the compiler work we pick the first entry. Then store all the other values relative to it. For example, if you have a cell with particles full with velocities. Often, all the particles in a cell drift in the same direction, so their velocities are very similar and we can store them as differences with only a few bits.

Such hierarchical storage is very similar to hierarchical bases or generating systems – notably if we apply them to d-dimensional arrays. Our assumption is that such storage could be induced by compiler attributes, too, such that it is simple for programmers to optimise the data layout used (and store only few bits per quantity), while the calculation still works with native, nodal data.

We can continue to draw the analogy to hierarchical bases and bring this plus our annotations together with iterative refinement algorithms: It is a natural question to start to apply the concept of a hierarchical, compressed data storage to iterative algorithms – as they are omnipresent in scientific computing – using dynamic precision. For example, we could store a 2d array of quantities where we know that each entry holds up to 10 valid digits as a sequence of presentations: the first one has only 1 valid digit, the second one adds a second one and so forth. If we have such a hierarchical representation, a compiler can generate compute code which starts to compute with the 1 digit data representation and lets the code load, parallel to the compute, load the next, more accurate presentation, which is then in a subsequent iteration of the algorithm used to refine the outcome successively. But users really should only write the iterative algorithm and the annotation then informs the compiler that the data representation is hierarchical and that we can iteratively stream the data into the core while we compute. Obviously, this opens up the door for a pipeline parallelisation, where one core unpacks the hierarchical data structure into proper nodal data, one core does the actual compute, and another core finally transfers the outcomes back into a hierarchical, compressed representation. The arising methodological question behind the scenes reads as follows: With such a mindset, it should be possible to determine reasonable precisions dynamically, i.e. the user does not have to know how many bits matter anymore, but the system can find out on-the-fly by assessing if additional bits of accuracy make a difference or not.

Software

We would expect the PhD candidate to take ownership of the compiler extensions and to ensure that the outcomes become usable to a wide community as open source. Within the project, the ideas of the dynamic, hierarchical compression should be applied to compute kernels of the ExaHyPE project.

At the moment, we realise all of our extensions for Clang only and make them based upon OpenMP. For the next evolution stage of this piece of software, we may ask if other languages (e.g. Fortran) can be (partially) supported or if the front-end could rely exclusively on the new, upcoming C++ standard providing built-in support for multithreading and GPU offloading via execution policies.

@tobiasweinzierl.bsky.social

  • Great to be back in Munich for the ExaHyPE anniversary workshop. Great talk by Han @icc-durham.bsky.social on ExaGRyPE.
  • Citing a NWOBHM song from 1983: The Eagle Has Landed https://dl.acm.org/doi/10.1145/3799887 All if open access, all is worth reading!
  • Congratulations: https://www.durham.ac.uk/news-events/latest-news/2026/06/durham-expert-recognised-among-global-leaders-in-scientific-computing/ And nice to see the HPC Days featured on there as well. One week to go ...
  • I've finally updated my project pages describing our long-standing and fruitful collaboration with Intel at @durham.ac.uk : https://tobiasweinzierl.webspace.durham.ac.uk/intel-academic-centre-of-excellence/
  • We will feature our HAI-End and @shareing.bsky.social projects in collaboration with @durham-comp-sci.bsky.social and @arc-durhamuni.bsky.social at the conference as well. Thanks to @cake-dri.bsky.social for making this possible. [contains quote post or other embedded content]