Performance optimization

Is Your Compute Problem Taking Too Long? Here Are Some Common Strategies to Speed It Up!

‍

In this short post, we explore some common strategies to improve the performance of complex computational problems. Given today’s computational methods' complexity, this discussion is necessarily high-level, but it can provide a broad overview of different approaches to this challenge.

Expert Domain Knowledge 🧠 Understanding the right methods for your specific field can make a significant difference. Often, different ways of modeling the same computational problem lead to better performance, or there are specialized algorithms that outperform general-purpose methods. The biggest improvements are often achieved when domain knowledge is combined with computing expertise. For example, in fluid dynamics, the Lattice-Boltzmann method can be much more efficient than the traditional Navier-Stokes equations for certain problems and can be parallelized effectively on GPUs. Leveraging specialized algorithms tailored to your domain can yield substantial performance gains.
Code Parallelization 🔀 Can parts of your code be parallelized? This is a broad problem with many different strategies and challenges. For example, in certain situations, one can parallelize the existing method without changing the underlying math. This is the simplest scenario, which is sometimes possible. Very often, unfortunately, existing methods are inherently sequential and one has to develop new methods that solve the problem at hand but are also parallelizable. The new methods might have different properties, e.g., convergence, numerical stability, accuracy, etc., requiring careful analysis to understand the final impact on the overall problem. These new methods can be purely “mathematical” or they can also come from the domain level, see above. Lastly, another point here is often overlooked but can become the biggest hurdle businesses face in practice to improve performance. Once the parallelization strategy is identified, one often finds that it is “locked” by the design of the current code, as most programmers think in terms of the sequential programming model, and the resulting data structures and the program flow prevent a parallel “view” of the computations. In such cases, significant parts of the code have to be rewritten in what is commonly referred to in the HPC community as “throughputization”.
Compute Accelerators 💻📟 Compute accelerators are highly specialized, often massively parallel hardware that can only tackle very specific problems. However, for that narrow class of problems, they often generate tremendous speedups. One of the main challenges in successfully exploiting them is transforming the problem at hand into an equivalent problem that can be implemented efficiently on the accelerators. In this world, optimizing the code for the underlying hardware is paramount and requires expert knowledge. Another often overlooked benefit is that accelerators are in fact co-processors, and with a proper implementation, the original machine can continue computing while the accelerators are doing their work!
Linear Algebra Packages 🧮 Computer science loves hierarchies! Often called “layered architecture”, you often find that different problems are implemented in terms of common abstraction layers. In computing, this is also very often the case. Domain-level problems are translated into numerical methods which very often end up as, you guessed it, linear algebra problems! One of the most common linear algebra problems found under the hood in computational science is the solving of systems of linear equations (SLEs). Look under the hood of a Finite Element Method-based structural load simulation, and you will find that it is often translated into an SLE and then solved in that form. Peek behind the scenes of the Simplex method in Operations Research, and you’ll again find the SLEs! There are a wide variety of linear algebra packages and libraries out there that can be used for this purpose. Some are parallel, some tailored to specific problems, and some to particular hardware. Knowing what to choose here and how to implement it can significantly improve the performance of your problem.
Data structures 🎒 On its own and in all of the previous sections, formatting your data can be crucial! A simple example: Let’s say you discretize an Ordinary Differential Equation on a 100,000 x 100,000 grid. In most cases, not all grid points will interact with each other, but only with a select few others in what we call a “neighborhood”. Consequently, this propagates down into linear algebra where most of the entries - that model interactions between all pairs of grid points - will be zero. Storing these zeros explicitly would mean a whopping 74 GB of memory use! Compared to this, a sparse matrix in a simple format would only require as little as 11 MB! As usual, sparse formats come with their own issues such as non-optimal memory access patterns, but using them can make the difference between solving a problem and not even being able to model it on your system
Numerics and Compilers 📈 This is perhaps the least disruptive method that can nevertheless result in impactful gains - without changing a single line of code! If we continue the hierarchical example from above, many linear algebra packages make use of basic underlying libraries of common methods called BLAS (Basic Linear Algebra Subprograms). Again, there are different vendors and types of BLAS libraries, offering different performance levels in different situations. Performance benefits here can be unlocked by choosing the right implementation and tuning the available parameters for the problem at hand. Likewise, tweaking compiler flags for your specific hardware can result in better-optimized binaries, providing performance boosts, again, without changing a single line of code.

The strategies outlined above are roughly ordered by the typical effort required for implementation. For example, changing the domain-level method often requires rewriting the entire application, whereas optimizing compiler flags requires no changes to the codebase. However, performance issues rarely have a single bottleneck. Instead, effective investigation must consider strategies across multiple or all of the mentioned layers. By implementing these strategies, you can significantly reduce computation times and enhance overall efficiency.

Need help optimizing your HPC workloads? Let’s connect and explore how we can accelerate your success!