This is a premium alert message you can set from Layout! Get Now!

Optimizing Rust code with LLVM: A detailed breakdown

0

When you compile Rust code by calling cargo run or by invoking rustc directly, one of the phases of compilation is handing the code off to LLVM to optimize it and generate machine code. Let’s dive into how it works and how to generate the fastest Rust code possible!

Jump ahead:

What is LLVM?

LLVM is a suite of compiler technologies that can be used across many different languages. LLVM used to stand for “Low Level Virtual Machine,” but as the project expanded to more subprojects, this name made less and less sense. Since 2011, the official name of the project is no longer an acronym, but simply “LLVM.”

Rust uses various parts of LLVM, such as its intermediate representation (IR), which is a fairly low-level representation of code. We’ll look at a detailed breakdown of an example LLVM IR in a later section.

Additionally, LLVM has a large set of optimization transforms to make the code run more efficiently. Let’s go over some examples below.

With dead code elimination, any operations with results that aren’t used — and that have no side effects — can be eliminated.

With loop-invariant code motion, calculations inside a loop that don’t depend on any values that change between loop iterations can be moved outside of the loop. If the loop runs many times, this can result in significant execution time savings.

LLVM also uses basic block vectorization so code that does the same operation on a lot of data can be turned into vectorized instructions like SSE or AVX.

Finally, LLVM utilizes backends that emit machine code from the optimized IR. Since machine code is specific to an architecture, there are specific backends for each supported architecture. LLVM supports many architectures, and so does Rust; the most common are x86 (Intel/AMD 32-bit), x64 (Intel/AMD 64-bit), and ARM64.

Rust has its own frontend that compiles down to LLVM IR and then relies on LLVM to optimize this and compile it down to machine code. C, C++, and Objective-C can also do this through the Clang frontend, which is a part of LLVM. Some other languages that can do this are Swift, Kotlin, Ruby, Julia, and Scala.

The main advantage of this approach is that all the frontend has to do is turn Rust code into LLVM IR — although this is still a lot of work! Then the existing LLVM parts can optimize the code and emit machine code for a bunch of different platforms.

What does the LLVM IR look like?

Helpfully, rustc has an option to emit the LLVM IR that a crate compiles down to with --emit llvm-ir. Let’s look at an example!

Here’s some very simple Rust code:

fn simple_add(x: u32, y: u32) -> u32 {
    return x + y;
}
fn main() {
    let z = simple_add(3, 4);
    println!("{}", z);
}

I put this inside a crate in main.rs, then called rustc main.rs --emit llvm-ir. This produces a file named main.ll with the LLVM IR. The file is actually surprisingly large. To keep things simple, let’s just look at the simple_add() function:

; main::simple_add
; Function Attrs: uwtable
define internal i32 @_ZN4main10simple_add17hdafc9bea2a13499fE(i32 %x, i32 %y) unnamed_addr #1 {
start:
  %0 = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %x, i32 %y)
  %_5.0 = extractvalue { i32, i1 } %0, 0
  %_5.1 = extractvalue { i32, i1 } %0, 1
  %1 = call i1 @llvm.expect.i1(i1 %_5.1, i1 false)
  br i1 %1, label %panic, label %bb1

bb1:                                              ; preds = %start
  ret i32 %_5.0

panic:                                            ; preds = %start
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h2d50e3e44ac775d8E(ptr align 1 @str.1, i64 28, ptr align 8 @alloc27) #7
  unreachable
}

That’s a lot of code for a one-line function! Let’s break down what’s going on.

Breakdown of our example LLVM IR

The first line is a comment saying what the “real” name of this function is. The second line indicates that an unwind table entry is required for purposes of exception handling:

; main::simple_add
; Function Attrs: uwtable

Most functions on x64 — the platform I’m using — require this.

Let’s take a look at the next piece:

define internal i32 @_ZN4main10simple_add17hdafc9bea2a13499fE(i32 %x, i32 %y) unnamed_addr #1 {

This is the declaration of the function.

internal means that this function is private to this module.

i32 means that this function returns a 32-bit integer type. Note that, unusually, i32 is used for both signed and unsigned types. There are different operations that treat them as different types when necessary.

@_ZN4main10simple_add17hdafc9bea2a13499fE is the internal name of the function. LLVM IR symbols that start with the @ symbol are global symbols, while ones that start with the % symbol are local symbols.

This internal function name is a name-mangled version of main::simple_add, which includes a hash of the function’s contents at the end to allow for multiple versions of the same crate to be compiled together. If you’re interested, here’s Rust’s current name-mangling code, although there is an RFC to change it.

The function arguments are (i32 %x, i32 %y).

unnamed_addr indicates to LLVM that the address of this function doesn’t matter, only its contents. This can be used to merge two functions together if they do exactly the same thing.

#1 indicates that attributes for this function are defined elsewhere. Later on in the file, this specifies that the target architecture is x86-64:

attributes #1 = { uwtable "target-cpu"="x86-64" }

The next line is just start:, which defines a label with the name “start.” This can be used as a branch target, although it isn’t in this function. We’ll need this label for preds = specifications below.

Next, here’s where the actual add happens! LLVM IR has a lot of different versions of add. This one adds two unsigned i32 numbers, hence the “u” in uadd:

  %0 = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %x, i32 %y)

Function calls in LLVM IR require you to specify the return type. Here, the return type is { i32, i1 }, which is the syntax used for a structure with an i32 and an i1.

The i1 is a one-bit integer — in other words, a boolean — and holds whether the add overflowed or not. The struct result of the function call is stored in the local variable %0.

The extractvalue instruction returns a value from a structure. Here the code is extracting the two values from %0 and placing them in %_5:

  %_5.0 = extractvalue { i32, i1 } %0, 0
  %_5.1 = extractvalue { i32, i1 } %0, 1

You can tell this is unoptimized code because we can look at it and realize that it’s not needed. The nice part about using LLVM is that the Rust frontend can just focus on emitting the correct IR, even if it’s inefficient, and rely on LLVM’s optimizations to optimize the inefficient code away.

Next, the @llvm.expect intrinsic is a hint to the optimizer that the first parameter probably has the value of the second parameter:

  %1 = call i1 @llvm.expect.i1(i1 %_5.1, i1 false)

This is useful for producing more efficient code. For example, if we’re going to branch on this value, the code for the more likely branch can be put closer in memory to avoid having to load memory that’s far away.

In the above, the code is saying that the add operation from above probably did not overflow. The intrinsic also returns the value that was passed in, for convenience.

In the below, br is a branch instruction:

  br i1 %1, label %panic, label %bb1

If %1 is true — i.e., if the add operation from above overflowed — the code will jump to the helpfully-named %panic label. Otherwise, it will jump to %bb1:

bb1:                                              ; preds = %start
  ret i32 %_5.0

The bb1 label handles the case where there was no overflow, so all that there’s left to do is return the result of the add operation.

Note that the preds = %start comment indicates that the basic block labeled by %start is the only predecessor of this block, meaning that the only way to get to this block was to have jumped here from %start. This is useful for some analysis and optimization passes.

Next is the call to panic if there was an overflow:

panic:                                            ; preds = %start
; call core::panicking::panic
  call void @_ZN4core9panicking5panic17h2d50e3e44ac775d8E(ptr align 1 @str.1, i64 28, ptr align 8 @alloc27) #7
  unreachable

Note that @str is a global variable defined elsewhere as:

@str.1 = internal constant [28 x i8] c"attempt to add with overflow"

The 28 value passed is the length of that string, and @alloc27 has information about the call stack. The #7 attribute is declared elsewhere as:

attributes #7 = { noreturn }

This is an indication that the call to this function will not return. In this case, this is because the process will exit. The unreachable intrinsic indicates control flow will never reach here.

Whew, that was a lot! One thing to note is that all the borrowing rules and such are enforced by the Rust frontend, not by LLVM. If the borrow checker finds a violation, it emits an error and exits before LLVM IR is generated.

The reference manual for LLVM IR is available on LLVM’s website if you’re interested!

How to make LLVM fully optimize your Rust code

rustc allows passing arguments directly to LLVM with -C llvm-args, but most of the time you won’t need to do this to get the best optimization for your code.

If you want your code to run as fast as possible, here are some lines to add to your Cargo.toml file that affect the release configuration of a build:

[profile.release]
lto = "fat"
codegen-units = 1

Let’s break this down, starting with the following:

lto = "fat"

This setting turns on LLVM’s link-time optimization. Most optimizations can only be done at a per-module level, but turning this on lets LLVM perform optimizations across all modules.

Note that in my experience, this makes compile times significantly slower; the Rust lto documentation states that setting this to “thin” is almost as good in most cases, but compiles significantly faster.

The next piece of code controls how many units a crate is split up into when compiling:

codegen-units = 1

The higher this number is, the faster compilation will be because there is more opportunity for parallelism. However, this also means that some optimizations will be unavailable. This is another setting that makes compiling a crate significantly slower!

Also note that whenever you’re playing with settings like this, it’s a good idea to benchmark your applicationcargo bench is a handy way to do this!

Benchmarking your application can help you determine how much these changes improve performance versus how much it affects your build times. You may discover that certain settings might not be worth it to turn on.

Additionally, note that in some cases, setting opt-level = 2 instead of the default 3 might actually produce faster code!

Conclusion

LLVM offers many ways to optimize your Rust code. You can see the full list of options you can set in a profile here.

To begin with, you might want to consider trying setting panic = "abort", an option that might make your code a little faster. However, this can change some behavior.

If your application uses a lot of memory, you could try using a different allocator that might be faster for you. Some options for this are jemalloc or mimalloc — feel free to try them out, and comment below with any questions!.

The post Optimizing Rust code with LLVM: A detailed breakdown appeared first on LogRocket Blog.



from LogRocket Blog https://ift.tt/QC1yjcv
Gain $200 in a week
via Read more

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.
Post a Comment

Search This Blog

To Top