SIMD Vectors in Zig

The program demonstrates a classic SIMD speedup by comparing two implementations of the same element-wise vector addition (c[i] = a[i] + b[i]) on two 100-million-element arrays of 32-bit floats. The scalarAdd() function is the straightforward baseline: it uses a single for loop that iterates over the input slices a and b in lockstep with their indices, performing one floating-point addition per element and writing the result into the output slice c. This version relies entirely on the CPU scalar floating-point unit and issues exactly one add instruction per element, making it simple, portable and easy to understand, but also the slowest path on modern hardware.

const std = @import("std");

fn scalarAdd(a: []const f32, b: []const f32, c: []f32) void {
    for (a, b, 0..) |x, y, i| {
        c[i] = x + y;
    }
}

fn simdAdd(a: []const f32, b: []const f32, c: []f32, comptime vec_len: usize) void {
    const Vec = @Vector(vec_len, f32);
    var i: usize = 0;
    while (i + vec_len <= a.len) : (i += vec_len) {
        const va: Vec = a[i..][0..vec_len].*;
        const vb: Vec = b[i..][0..vec_len].*;
        const vc = va + vb;
        c[i..][0..vec_len].* = vc;
    }
    // handle remainder scalar-style
    while (i < a.len) : (i += 1) {
        c[i] = a[i] + b[i];
    }
}

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    const N: usize = 100_000_000; // large enough to see speedup
    const a = try allocator.alloc(f32, N);
    defer allocator.free(a);
    const b = try allocator.alloc(f32, N);
    defer allocator.free(b);
    const c_scalar = try allocator.alloc(f32, N);
    defer allocator.free(c_scalar);
    const c_simd = try allocator.alloc(f32, N);
    defer allocator.free(c_simd);

    // fill arrays with deterministic values
    for (a, 0..) |*x, i| x.* = @as(f32, @floatFromInt(i));
    for (b, 0..) |*x, i| x.* = @as(f32, @floatFromInt(i)) * 0.5;

    // === SCALAR VERSION ===
    var timer = try std.time.Timer.start();
    scalarAdd(a, b, c_scalar);
    const scalar_ns = timer.lap();

    // === SIMD VERSION ===
    const vec_len = 32;
    timer.reset();
    simdAdd(a, b, c_simd, vec_len);
    const simd_ns = timer.lap();

    std.debug.print("Scalar time: {d:.2} ms\n", .{@as(f64, @floatFromInt(scalar_ns)) / 1_000_000.0});
    std.debug.print("SIMD time (vec_len={d}): {d:.2} ms\n", .{ vec_len, @as(f64, @floatFromInt(simd_ns)) / 1_000_000.0 });
    std.debug.print("Speedup: {d:.1}x\n", .{@as(f64, @floatFromInt(scalar_ns)) / @as(f64, @floatFromInt(simd_ns))});

    // optional: verify results match (they will)
    std.debug.assert(std.mem.eql(f32, c_scalar, c_simd));
}

The simdAdd() function shows the vectorised version. It defines a compile-time vector type Vec = @Vector(vec_len, f32) where vec_len is fixed at 32 (a 1024-bit vector, which maps to the widest SIMD registers available on AVX-512-capable CPUs). The main while loop advances through the arrays in steps of exactly 32 elements. For each chunk it converts the next 32 floats from a and b into vectors using the idiomatic slice-to-vector cast a[i..][0..vec_len].*, performs a single vector addition va + vb (which the compiler translates into one SIMD instruction that adds all 32 pairs in parallel) and writes the result vector back into c with another slice assignment. After the main loop finishes, a short scalar tail loop handles any remaining elements when the array length is not a multiple of 32, guaranteeing correctness for any size.

In main() we allocate four large slices (a, b, c_scalar, c_simd) using the page allocator, fill them with deterministic values so the results are reproducible and then time each implementation separately with std.time.Timer. The scalar version runs first, followed by a timer reset and the SIMD version. Finally the program prints the elapsed times in milliseconds and the calculated speedup factor, and an assert() verifies that both implementations produce identical results (they always will, because they compute exactly the same thing).

Save the code as simd.zig and execute it as follows:

$ zig run simd.zig
Scalar time: 386.61 ms
SIMD time (vec_len=32): 95.45 ms
Speedup: 4.1x

On my machine the scalar version took 386.61 ms while the SIMD version finished in 95.45 ms, delivering a clean 4.1× speedup.

Happy coding in Zig!