Learn Zig Series (#19) - SIMD with @Vector

What will I learn

You will learn what SIMD is and why it matters for performance;
Zig's @Vector(N, T) type for SIMD operations;
element-wise arithmetic on vectors;
loading and storing vectors from slices;
SIMD reductions: sum, min, max across vector lanes;
shuffling and selecting vector elements;
auto-vectorization hints vs explicit SIMD;
practical example: fast array summation and dot products.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Zig 0.14+ distribution (download from ziglang.org);
The ambition to learn Zig programming.

Difficulty

Intermediate

Curriculum (of the `Learn Zig Series`):

@scipio/zig-programming-tutoroial-ep001-intro" target="_blank" rel="noopener noreferrer">Zig Programming Tutorial - ep001 - Intro
@scipio/learn-zig-series-2-hello-zig-variables-and-types" target="_blank" rel="noopener noreferrer">Learn Zig Series (#2) - Hello Zig, Variables and Types
@scipio/learn-zig-series-3-functions-and-control-flow" target="_blank" rel="noopener noreferrer">Learn Zig Series (#3) - Functions and Control Flow
@scipio/learn-zig-series-4-error-handling-zigs-best-feature" target="_blank" rel="noopener noreferrer">Learn Zig Series (#4) - Error Handling (Zig's Best Feature)
@scipio/learn-zig-series-5-arrays-slices-and-strings" target="_blank" rel="noopener noreferrer">Learn Zig Series (#5) - Arrays, Slices, and Strings
@scipio/learn-zig-series-6-structs-enums-and-tagged-unions" target="_blank" rel="noopener noreferrer">Learn Zig Series (#6) - Structs, Enums, and Tagged Unions
@scipio/learn-zig-series-7-memory-management-and-allocators" target="_blank" rel="noopener noreferrer">Learn Zig Series (#7) - Memory Management and Allocators
@scipio/learn-zig-series-8-pointers-and-memory-layout" target="_blank" rel="noopener noreferrer">Learn Zig Series (#8) - Pointers and Memory Layout
@scipio/learn-zig-series-9-comptime-zigs-superpower" target="_blank" rel="noopener noreferrer">Learn Zig Series (#9) - Comptime (Zig's Superpower)
@scipio/learn-zig-series-10-project-structure-modules-and-file-io" target="_blank" rel="noopener noreferrer">Learn Zig Series (#10) - Project Structure, Modules, and File I/O
@scipio/learn-zig-series-11-mini-project-building-a-step-sequencer" target="_blank" rel="noopener noreferrer">Learn Zig Series (#11) - Mini Project: Building a Step Sequencer
@scipio/learn-zig-series-12-testing-and-test-driven-development" target="_blank" rel="noopener noreferrer">Learn Zig Series (#12) - Testing and Test-Driven Development
@scipio/learn-zig-series-13-interfaces-via-type-erasure" target="_blank" rel="noopener noreferrer">Learn Zig Series (#13) - Interfaces via Type Erasure
@scipio/learn-zig-series-14-generics-with-comptime-parameters" target="_blank" rel="noopener noreferrer">Learn Zig Series (#14) - Generics with Comptime Parameters
@scipio/learn-zig-series-15-the-build-system-buildzig" target="_blank" rel="noopener noreferrer">Learn Zig Series (#15) - The Build System (build.zig)
@scipio/learn-zig-series-16-sentinel-terminated-types-and-c-strings" target="_blank" rel="noopener noreferrer">Learn Zig Series (#16) - Sentinel-Terminated Types and C Strings
@scipio/learn-zig-series-17-packed-structs-and-bit-manipulation" target="_blank" rel="noopener noreferrer">Learn Zig Series (#17) - Packed Structs and Bit Manipulation
@scipio/learn-zig-series-18-async-concepts-and-event-loops" target="_blank" rel="noopener noreferrer">Learn Zig Series (#18) - Async Concepts and Event Loops
@scipio/learn-zig-series-18b-addendum-async-returns-in-zig-016" target="_blank" rel="noopener noreferrer">Learn Zig Series (#18b) - Addendum: Async Returns in Zig 0.16
@scipio/learn-zig-series-19-simd-with-vector" target="_blank" rel="noopener noreferrer">Learn Zig Series (#19) - SIMD with @Vector (this post)

Learn Zig Series (#19) - SIMD with @Vector

Welcome back! In @scipio/learn-zig-series-18-async-concepts-and-event-loops" target="_blank" rel="noopener noreferrer">episode #18 we built event loops from scratch using poll() and epoll, and in the @scipio/learn-zig-series-18b-addendum-async-returns-in-zig-016" target="_blank" rel="noopener noreferrer">#18b addendum we covered the new std.Io interface coming in Zig 0.16. All of that was about doing many things at once -- concurrency and I/O multiplexing. Today we go in a completely different direction: doing the SAME thing to many data elements at once. That's SIMD, and Zig's support for it is one of its most underappreciated features.

SIMD stands for Single Instruction, Multiple Data. The idea: instead of processing one number at a time, you tell the CPU to process 4, 8, 16, or even 64 numbers in a single instruction. Your CPU already has special wide registers for this -- SSE gives you 128-bit registers (4 floats), AVX gives you 256 bits (8 floats), and AVX-512 gives you 512 bits (16 floats at once). These registers and instructions exist on every x86 CPU made in the last 15 years. ARM has NEON (128-bit). The hardware is there -- the question is whether your language makes it easy to USE.

Most languages make it hard. In C, you either write intrinsics (_mm256_add_ps(a, b)) which are verbose and non-portable, or you hope the compiler auto-vectorizes your loops (which it often doesn't). Rust has the std::simd module that's been unstable for years. Python doesn't even try -- NumPy does SIMD internally, but you can't access it directly.

Zig takes a different approach: @Vector(N, T) is a first-class type that maps directly to SIMD hardware. You write normal arithmetic on vectors, and the compiler emits the right instructions for your target CPU. No intrinsics, no pragmas, no hoping.

Here we go!

Solutions to Episode 18 Exercises

Episode #18b was an addendum without exercises, so the solutions below are for ep18's exercises on event loops and I/O multiplexing. Full code, copy-paste-and-run.

Exercise 1 -- EventLoop with per-client byte tracking:

const std = @import("std");
const posix = std.posix;
const net = std.net;

const EventLoop = struct {
    const max_fds = 256;

    server_fd: posix.socket_t,
    fds: [max_fds]posix.pollfd,
    bytes_received: [max_fds]u64,
    nfds: usize,
    running: bool,

    pub fn init(port: u16) !EventLoop {
        const address = net.Address.initIp4(.{ 127, 0, 0, 1 }, port);
        const fd = try posix.socket(
            posix.AF.INET,
            posix.SOCK.STREAM | posix.SOCK.NONBLOCK,
            0,
        );
        errdefer posix.close(fd);

        try posix.setsockopt(
            fd, posix.SOL.SOCKET, posix.SO.REUSEADDR,
            &std.mem.toBytes(@as(c_int, 1)),
        );
        try posix.bind(fd, &address.any, address.getOsSockLen());
        try posix.listen(fd, 128);

        var self = EventLoop{
            .server_fd = fd,
            .fds = undefined,
            .bytes_received = [_]u64{0} ** max_fds,
            .nfds = 1,
            .running = true,
        };
        self.fds[0] = .{
            .fd = fd,
            .events = posix.POLL.IN,
            .revents = 0,
        };
        return self;
    }

    pub fn deinit(self: *EventLoop) void {
        var i: usize = 0;
        while (i < self.nfds) : (i += 1) {
            posix.close(self.fds[i].fd);
        }
    }

    pub fn run(self: *EventLoop) !void {
        std.debug.print("Event loop running\n", .{});
        while (self.running) {
            const ready = try posix.poll(self.fds[0..self.nfds], 1000);
            if (ready == 0) continue;

            var i: usize = 0;
            while (i < self.nfds) {
                if (self.fds[i].revents == 0) { i += 1; continue; }
                if (self.fds[i].fd == self.server_fd) {
                    try self.handleAccept();
                } else {
                    if (!self.handleClient(i)) continue;
                }
                i += 1;
            }
        }
    }

    fn handleAccept(self: *EventLoop) !void {
        const client = posix.accept(
            self.server_fd, null, null, posix.SOCK.NONBLOCK,
        ) catch |err| {
            if (err == error.WouldBlock) return;
            return err;
        };
        if (self.nfds >= max_fds) { posix.close(client); return; }
        self.fds[self.nfds] = .{
            .fd = client, .events = posix.POLL.IN, .revents = 0,
        };
        self.bytes_received[self.nfds] = 0;
        self.nfds += 1;
        std.debug.print("[+] Connection (total: {d})\n", .{self.nfds - 1});
    }

    fn handleClient(self: *EventLoop, index: usize) bool {
        var buf: [4096]u8 = undefined;
        if (self.fds[index].revents & (posix.POLL.HUP | posix.POLL.ERR) != 0) {
            self.removeClient(index);
            return false;
        }
        if (self.fds[index].revents & posix.POLL.IN != 0) {
            const n = posix.read(self.fds[index].fd, &buf) catch {
                self.removeClient(index);
                return false;
            };
            if (n == 0) { self.removeClient(index); return false; }
            self.bytes_received[index] += n;
            _ = posix.write(self.fds[index].fd, buf[0..n]) catch {};
        }
        return true;
    }

    fn removeClient(self: *EventLoop, index: usize) void {
        std.debug.print("[-] Disconnect fd={d}, total bytes received: {d}\n", .{
            self.fds[index].fd, self.bytes_received[index],
        });
        posix.close(self.fds[index].fd);
        self.fds[index] = self.fds[self.nfds - 1];
        self.bytes_received[index] = self.bytes_received[self.nfds - 1];
        self.nfds -= 1;
    }
};

pub fn main() !void {
    var loop = try EventLoop.init(8080);
    defer loop.deinit();
    try loop.run();
}

The key addition: a parallel bytes_received array indexed the same way as fds. When a client sends data, we accumulate the byte count. On disconnect, we print the total. The swap-on-remove trick from ep18 also swaps the byte counter -- keeping the arrays in sync.

Exercise 2 -- Multi-source monitor (stdin + TCP):

const std = @import("std");
const posix = std.posix;

pub fn main() !void {
    const address = std.net.Address.initIp4(.{ 127, 0, 0, 1 }, 8080);
    const server = try posix.socket(
        posix.AF.INET,
        posix.SOCK.STREAM | posix.SOCK.NONBLOCK,
        0,
    );
    defer posix.close(server);

    try posix.setsockopt(
        server, posix.SOL.SOCKET, posix.SO.REUSEADDR,
        &std.mem.toBytes(@as(c_int, 1)),
    );
    try posix.bind(server, &address.any, address.getOsSockLen());
    try posix.listen(server, 128);

    var fds: [66]posix.pollfd = undefined;
    var nfds: usize = 2;
    var connections_handled: u32 = 0;

    // Slot 0: stdin
    fds[0] = .{ .fd = 0, .events = posix.POLL.IN, .revents = 0 };
    // Slot 1: server socket
    fds[1] = .{ .fd = server, .events = posix.POLL.IN, .revents = 0 };

    std.debug.print("Monitoring stdin + TCP :8080\n", .{});
    var running = true;

    while (running) {
        const ready = try posix.poll(fds[0..nfds], -1);
        if (ready == 0) continue;

        var i: usize = 0;
        while (i < nfds) {
            if (fds[i].revents == 0) { i += 1; continue; }

            if (i == 0) {
                // stdin
                var buf: [1024]u8 = undefined;
                const n = posix.read(0, &buf) catch { running = false; break; };
                if (n == 0) { running = false; break; } // EOF
                const line = std.mem.trimRight(u8, buf[0..n], "\n\r");
                if (std.mem.eql(u8, line, "quit")) { running = false; break; }
                std.debug.print("[stdin] {s}\n", .{buf[0..n]});
            } else if (fds[i].fd == server) {
                const client = posix.accept(server, null, null, posix.SOCK.NONBLOCK) catch |err| {
                    if (err == error.WouldBlock) { i += 1; continue; }
                    return err;
                };
                connections_handled += 1;
                std.debug.print("[net] new connection fd={d}\n", .{client});
                if (nfds < fds.len) {
                    fds[nfds] = .{ .fd = client, .events = posix.POLL.IN, .revents = 0 };
                    nfds += 1;
                } else {
                    posix.close(client);
                }
            } else {
                var buf: [4096]u8 = undefined;
                const n = posix.read(fds[i].fd, &buf) catch 0;
                if (n == 0) {
                    posix.close(fds[i].fd);
                    fds[i] = fds[nfds - 1];
                    nfds -= 1;
                    continue;
                }
                std.debug.print("[net] data: {s}", .{buf[0..n]});
            }
            i += 1;
        }
    }

    // Clean up all remaining connections
    var i: usize = 2;
    while (i < nfds) : (i += 1) {
        posix.close(fds[i].fd);
    }
    std.debug.print("Shutdown. Total connections handled: {d}\n", .{connections_handled});
}

The trick: stdin is file descriptor 0, and poll() treats it just like any socket. You add it to the pollfd array and the kernel wakes you when the user types something. This is the power of Unix's "everything is a file descriptor" design -- poll() works uniformly across sockets, pipes, terminals, and more.

Exercise 3 -- epoll server with inactivity timeout:

const std = @import("std");
const posix = std.posix;
const linux = std.os.linux;

const max_clients = 256;
const timeout_ms: i64 = 10_000; // 10 seconds

var last_activity: [max_clients]i64 = [_]i64{0} ** max_clients;
var fd_to_index: [max_clients]posix.socket_t = [_]posix.socket_t{-1} ** max_clients;
var active_count: usize = 0;

fn findSlot() ?usize {
    for (0..max_clients) |i| {
        if (fd_to_index[i] == -1) return i;
    }
    return null;
}

fn findByFd(fd: posix.socket_t) ?usize {
    for (0..max_clients) |i| {
        if (fd_to_index[i] == fd) return i;
    }
    return null;
}

fn removeSlot(index: usize, epfd: posix.fd_t) void {
    const fd = fd_to_index[index];
    if (fd == -1) return;
    std.debug.print("[-] Timeout/close fd={d}\n", .{fd});
    posix.close(fd);
    fd_to_index[index] = -1;
    last_activity[index] = 0;
    active_count -= 1;
}

pub fn main() !void {
    const address = std.net.Address.initIp4(.{ 127, 0, 0, 1 }, 8080);
    const server = try posix.socket(
        posix.AF.INET,
        posix.SOCK.STREAM | posix.SOCK.NONBLOCK,
        0,
    );
    defer posix.close(server);
    try posix.setsockopt(server, posix.SOL.SOCKET, posix.SO.REUSEADDR,
        &std.mem.toBytes(@as(c_int, 1)));
    try posix.bind(server, &address.any, address.getOsSockLen());
    try posix.listen(server, 128);

    const epfd = try posix.epoll_create1(0);
    defer posix.close(epfd);
    var server_event = linux.epoll_event{
        .events = linux.EPOLL.IN,
        .data = .{ .fd = server },
    };
    try posix.epoll_ctl(epfd, linux.EPOLL.CTL_ADD, server, &server_event);

    std.debug.print("Listening :8080 with epoll + 10s inactivity timeout\n", .{});
    var events: [64]linux.epoll_event = undefined;

    while (true) {
        const n_ready = posix.epoll_wait(epfd, &events, 1000);
        const now = std.time.milliTimestamp();

        for (events[0..n_ready]) |ev| {
            if (ev.data.fd == server) {
                const client = posix.accept(server, null, null,
                    posix.SOCK.NONBLOCK) catch continue;
                if (findSlot()) |slot| {
                    var client_event = linux.epoll_event{
                        .events = linux.EPOLL.IN,
                        .data = .{ .fd = client },
                    };
                    posix.epoll_ctl(epfd, linux.EPOLL.CTL_ADD,
                        client, &client_event) catch { posix.close(client); continue; };
                    fd_to_index[slot] = client;
                    last_activity[slot] = now;
                    active_count += 1;
                    std.debug.print("[+] fd={d} (active: {d})\n", .{client, active_count});
                } else {
                    posix.close(client);
                }
            } else {
                if (findByFd(ev.data.fd)) |slot| {
                    var buf: [4096]u8 = undefined;
                    const n = posix.read(ev.data.fd, &buf) catch 0;
                    if (n == 0) {
                        removeSlot(slot, epfd);
                        continue;
                    }
                    last_activity[slot] = now;
                    _ = posix.write(ev.data.fd, buf[0..n]) catch {};
                }
            }
        }

        // Sweep for stale connections
        for (0..max_clients) |i| {
            if (fd_to_index[i] != -1 and (now - last_activity[i]) > timeout_ms) {
                std.debug.print("Timeout: fd={d} idle {d}ms\n", .{
                    fd_to_index[i], now - last_activity[i],
                });
                removeSlot(i, epfd);
            }
        }
    }
}

The timeout mechanism: epoll_wait returns after 1 second even with no I/O events (the 1000ms timeout). On each iteration we record std.time.milliTimestamp() and sweep all tracked connections. Any connection idle longer than 10 seconds gets closed. The sweep is O(max_clients) but with 256 slots and a 1-second interval, that's negligible. Production servers would use a timer wheel or sorted expiry queue for thousands of connections, but the principle is the same.

Right -- on to SIMD ;-)

What SIMD actually is

Your CPU has two kinds of arithmetic. Scalar arithmetic operates on single values: load one number from memory, add another number to it, store the result. The CPU has general-purpose registers (64-bit on modern x86) and each arithmetic instruction operates on one register.

SIMD arithmetic operates on multiple values packed into a single wide register. An SSE register is 128 bits wide -- that's 4x 32-bit floats, or 2x 64-bit doubles, or 16x 8-bit integers. One instruction adds all 4 floats simultaneously. Same clock cycle, 4x the work. AVX doubles this to 256 bits (8 floats), and AVX-512 doubles again to 512 bits (16 floats).

The speedup isn't theoretical -- it's a direct hardware capability. Your CPU literally has wider datapaths dedicated to SIMD. When you see benchmarks showing 4x or 8x speedups on array operations, that's not clever algorithmic optimization -- it's using the hardware that was sitting there unused.

Where does SIMD matter in practice? Image processing (apply the same operation to every pixel), audio processing (mix/filter samples), physics simulations (update particle positions), cryptography (hash multiple blocks), game engines (transform vertices), scientific computing (matrix operations), and data processing (compare/filter/aggregate arrays). Any time you apply the same operation to many independent data elements, SIMD can help.

In C, SIMD programming traditionally means writing intrinsics -- platform-specific function calls like _mm256_add_ps() that map directly to hardware instructions. They're fast but ugly and non-portable. Or you write normal loops and hope the compiler auto-vectorizes them, which it sometimes does and sometimes doesn't, and you never quite know which.

Zig gives you a third option: @Vector, a built-in type that's both readable AND guaranteed to use SIMD when the hardware supports it.

@Vector(N, T) -- the basics

A @Vector(N, T) is a fixed-length SIMD vector of N elements of type T. The element type can be any integer, float, or bool. The length must be a power of two (or any length the target supports, but powers of two are standard).

const std = @import("std");

pub fn main() void {
    // 4-element vector of f32 -- fits in one SSE register
    const a: @Vector(4, f32) = .{ 1.0, 2.0, 3.0, 4.0 };
    const b: @Vector(4, f32) = .{ 10.0, 20.0, 30.0, 40.0 };

    // Element-wise addition -- one SIMD instruction
    const sum = a + b;
    std.debug.print("sum: {d}\n", .{sum}); // { 11, 22, 33, 44 }

    // Element-wise multiplication
    const product = a * b;
    std.debug.print("product: {d}\n", .{product}); // { 10, 40, 90, 160 }

    // Scalar broadcast: every lane gets the same value
    const scale: @Vector(4, f32) = @splat(2.0);
    const doubled = a * scale;
    std.debug.print("doubled: {d}\n", .{doubled}); // { 2, 4, 6, 8 }
}

That's it. Normal arithmetic operators (+, -, *, /), comparison operators (<, >, ==), and bitwise operators (&, |, ^, ~) all work element-wise on vectors. No special syntax, no function calls. The compiler emits addps, mulps, or whatever the target instruction is. If the target doesn't have SIMD (like an old embedded CPU), Zig falls back to scalar operations automatically -- your code still compiles and runs correctly.

The @splat builtin creates a vector where every element has the same value. You use it when you want to apply a scalar operation across all lanes -- like multiplying every element by 2.0 in the example above. This is such a common pattern that most SIMD instruction sets have dedicated broadcast instructions for it.

Element-wise arithmetic and comparisons

All standard operators work element-wise. This includes the comparison operators, which return a @Vector(N, bool) -- a vector of booleans, one per lane:

const std = @import("std");

pub fn main() void {
    const a: @Vector(4, i32) = .{ 10, 20, 30, 40 };
    const b: @Vector(4, i32) = .{ 15, 15, 15, 15 };

    // Arithmetic
    const added = a + b;     // { 25, 35, 45, 55 }
    const subbed = a - b;    // { -5, 5, 15, 25 }
    const negated = -a;      // { -10, -20, -30, -40 }

    std.debug.print("add: {d}\n", .{added});
    std.debug.print("sub: {d}\n", .{subbed});
    std.debug.print("neg: {d}\n", .{negated});

    // Comparison: returns @Vector(4, bool)
    const mask = a > b;     // { false, true, true, true }
    std.debug.print("a > b: {any}\n", .{mask});

    // Bitwise (integer vectors)
    const x: @Vector(4, u8) = .{ 0xFF, 0x0F, 0xAA, 0x55 };
    const y: @Vector(4, u8) = .{ 0x0F, 0xF0, 0x55, 0xAA };
    const and_result = x & y;   // { 0x0F, 0x00, 0x00, 0x00 }
    const xor_result = x ^ y;   // { 0xF0, 0xFF, 0xFF, 0xFF }
    std.debug.print("AND: {x}\n", .{and_result});
    std.debug.print("XOR: {x}\n", .{xor_result});
}

The comparison mask (@Vector(N, bool)) is useful for conditional operations. In scalar code you'd write if (a > b) x else y. In SIMD you use @select:

const std = @import("std");

pub fn main() void {
    const a: @Vector(4, f32) = .{ 1.0, 5.0, 3.0, 7.0 };
    const b: @Vector(4, f32) = .{ 4.0, 2.0, 6.0, 1.0 };

    // Element-wise max: pick the larger of each pair
    const mask = a > b;
    const max_vals = @select(f32, mask, a, b);
    // mask = { false, true, false, true }
    // result: { 4.0, 5.0, 6.0, 7.0 }
    std.debug.print("max: {d}\n", .{max_vals});

    // Clamp values to range [2.0, 5.0]
    const lo: @Vector(4, f32) = @splat(2.0);
    const hi: @Vector(4, f32) = @splat(5.0);
    const clamped = @min(@max(a, lo), hi);
    std.debug.print("clamped: {d}\n", .{clamped}); // { 2, 5, 3, 5 }
}

@select(T, mask, a, b) picks from a where mask is true, from b where false. This compiles to a single SIMD blend or select instruction. No branches, no mispredictions. On modern CPUs, a branch mispredict costs 15-20 cycles. A SIMD select costs 1 cycle. When you're processing millions of elements, that difference is massive.

Loading and storing from slices

Real data lives in slices and arrays, not in vector literals. Zig provides clean conversions between slices and vectors:

const std = @import("std");

pub fn main() void {
    var data = [_]f32{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };

    // Load 4 elements from the slice into a vector
    const v1: @Vector(4, f32) = data[0..4].*;
    const v2: @Vector(4, f32) = data[4..8].*;

    std.debug.print("v1: {d}\n", .{v1}); // { 1, 2, 3, 4 }
    std.debug.print("v2: {d}\n", .{v2}); // { 5, 6, 7, 8 }

    // Process
    const result = v1 + v2; // { 6, 8, 10, 12 }

    // Store back to a slice
    data[0..4].* = result;
    std.debug.print("data after store: {d}\n", .{data});
    // { 6, 8, 10, 12, 5, 6, 7, 8 }
}

The data[0..4].* syntax dereferences a length-4 array pointer, which Zig implicitly coerces to a @Vector(4, f32). Going the other way -- storing a vector back into a slice -- works the same way with assignment. No memcpy, no conversion functions. The compiler generates aligned or unaligned SIMD load/store instructions depending on the data's alignment.

Alignment matters for SIMD. Aligned loads (where the address is a multiple of the vector width) are faster on many CPUs, especially older ones. Modern Intel CPUs handle unaligned loads with little penalty, but ARM NEON still benefits from alignment. If your hot loop processes millions of vectors, aligning your data to 16 or 32 bytes can make a measurable difference:

const std = @import("std");

pub fn main() void {
    // Force 32-byte alignment (good for AVX)
    var data: [8]f32 align(32) = .{ 1, 2, 3, 4, 5, 6, 7, 8 };

    const v: @Vector(8, f32) = data[0..8].*;
    const doubled = v * @as(@Vector(8, f32), @splat(@as(f32, 2.0)));
    data[0..8].* = doubled;

    std.debug.print("doubled: {d}\n", .{data});
}

The align(32) forces the array to start at a 32-byte boundary. The compiler can then use aligned AVX instructions (vmovaps instead of vmovups). In most real-world code the difference is small on modern Intel, but on ARM or in extremely latency-sensitive paths (audio callbacks, game physics) it can matter.

Reductions: collapsing a vector to a scalar

SIMD arithmetic gives you element-wise results. But often you need a single answer from a vector -- the sum of all elements, the minimum, the maximum. These are reductions, and Zig provides builtin functions for them:

const std = @import("std");

pub fn main() void {
    const v: @Vector(8, f32) = .{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };

    // Sum all elements
    const sum = @reduce(.Add, v);
    std.debug.print("sum: {d}\n", .{sum}); // 36.0

    // Minimum element
    const min_val = @reduce(.Min, v);
    std.debug.print("min: {d}\n", .{min_val}); // 1.0

    // Maximum element
    const max_val = @reduce(.Max, v);
    std.debug.print("max: {d}\n", .{max_val}); // 8.0

    // Works with integers too
    const ints: @Vector(4, i32) = .{ 10, -5, 30, -15 };
    std.debug.print("int sum: {d}\n", .{@reduce(.Add, ints)});  // 20
    std.debug.print("int min: {d}\n", .{@reduce(.Min, ints)});  // -15
    std.debug.print("int max: {d}\n", .{@reduce(.Max, ints)});  // 30

    // Multiply all elements
    const product = @reduce(.Mul, v);
    std.debug.print("product: {d}\n", .{product}); // 40320.0

    // Boolean reductions: AND (all true?) and OR (any true?)
    const mask: @Vector(4, bool) = .{ true, true, false, true };
    const all = @reduce(.And, mask);   // false
    const any = @reduce(.Or, mask);    // true
    std.debug.print("all: {}, any: {}\n", .{ all, any });
}

@reduce takes an operation (.Add, .Mul, .Min, .Max, .And, .Or, .Xor) and a vector, and collapses it to a single scalar. The compiler uses horizontal SIMD instructions when available (like haddps on SSE3) or a log2 tree of pairwise operations.

The boolean reductions are particularly handy for search operations -- "is ANY element in this vector equal to the target?" is a common pattern in string search, data filtering, and bounds checking. We'll see this in the practical example later.

Shuffling and selecting

Sometimes you need to rearrange elements within a vector or between vectors. Zig provides @shuffle for this:

const std = @import("std");

pub fn main() void {
    const a: @Vector(4, f32) = .{ 10.0, 20.0, 30.0, 40.0 };
    const b: @Vector(4, f32) = .{ 50.0, 60.0, 70.0, 80.0 };

    // Reverse a vector
    // Indices 0-3 refer to 'a', indices 4-7 refer to 'b' (when using 2 sources)
    // Negative indices mean "use the second vector"
    // ~index means: take from b at position (index)
    const reversed = @shuffle(f32, a, undefined, [4]i32{ 3, 2, 1, 0 });
    std.debug.print("reversed: {d}\n", .{reversed}); // { 40, 30, 20, 10 }

    // Interleave: take alternating elements from a and b
    // positive indices = from 'a', negative (bitwise NOT) = from 'b'
    const interleaved = @shuffle(f32, a, b, [4]i32{ 0, ~@as(i32, 0), 1, ~@as(i32, 1) });
    std.debug.print("interleaved: {d}\n", .{interleaved}); // { 10, 50, 20, 60 }

    // Broadcast one element to all lanes
    const broadcast = @shuffle(f32, a, undefined, [4]i32{ 2, 2, 2, 2 });
    std.debug.print("broadcast [2]: {d}\n", .{broadcast}); // { 30, 30, 30, 30 }

    // Widen: duplicate each element (useful for upsampling)
    const wide = @shuffle(f32, a, undefined, [8]i32{ 0, 0, 1, 1, 2, 2, 3, 3 });
    std.debug.print("widened: {d}\n", .{wide});
    // { 10, 10, 20, 20, 30, 30, 40, 40 }
}

@shuffle takes two source vectors and an index array. Positive indices select from the first vector, indices formed with bitwise NOT (~@as(i32, N)) select from the second. When you don't need a second source, pass undefined.

Shuffles compile to hardware shuffle instructions (shufps, vpermps, etc.) which execute in 1-3 cycles. They're how you do things like matrix transpositions, AoS-to-SoA conversions, and interleaving/deinterleaving for audio processsing. If you've ever worked with audio interleaved as LRLRLR and needed it as LLLL RRRR, a shuffle does that in one instruction per vector.

Practical example: fast array sum

Let's put it all together with a real benchmark. Summing a large array is the "hello world" of SIMD -- simple enough to understand, but the speedup is real and measurable:

const std = @import("std");

fn scalarSum(data: []const f32) f32 {
    var total: f32 = 0;
    for (data) |val| {
        total += val;
    }
    return total;
}

fn simdSum(data: []const f32) f32 {
    const vec_len = 8; // AVX: 8 floats at a time
    var accum: @Vector(vec_len, f32) = @splat(0.0);

    // Process 8 elements per iteration
    var i: usize = 0;
    while (i + vec_len <= data.len) : (i += vec_len) {
        const chunk: @Vector(vec_len, f32) = data[i..][0..vec_len].*;
        accum += chunk;
    }

    // Reduce the accumulator to a scalar
    var total = @reduce(.Add, accum);

    // Handle remaining elements (< 8)
    while (i < data.len) : (i += 1) {
        total += data[i];
    }

    return total;
}

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    const n: usize = 1_000_000;

    const data = try allocator.alloc(f32, n);
    defer allocator.free(data);

    // Fill with test data
    for (data, 0..) |*val, idx| {
        val.* = @floatFromInt(idx % 1000);
    }

    // Benchmark scalar
    var timer = try std.time.Timer.start();
    const scalar_result = scalarSum(data);
    const scalar_ns = timer.read();

    // Benchmark SIMD
    timer.reset();
    const simd_result = simdSum(data);
    const simd_ns = timer.read();

    std.debug.print("Scalar sum: {d:.1} ({d}ns)\n", .{ scalar_result, scalar_ns });
    std.debug.print("SIMD sum:   {d:.1} ({d}ns)\n", .{ simd_result, simd_ns });
    std.debug.print("Speedup:    {d:.1}x\n", .{
        @as(f64, @floatFromInt(scalar_ns)) / @as(f64, @floatFromInt(simd_ns)),
    });
}

The SIMD version uses an accumulator vector of 8 floats. Each iteration loads 8 consecutive floats from the array and adds them to the accumulator. After the loop, @reduce(.Add, accum) collapses the 8 partial sums into one. The tail loop handles any remaining elements that don't fill a full vector.

The speedup you get depends on your CPU and the data size. On a modern x86 with AVX, expect roughly 4-6x for this specific pattern. It's not exactly 8x because @reduce has some overhead and memory bandwith limits throughput at large sizes. But 4-6x from changing maybe 10 lines of code? That's the appeal of SIMD.

Notice the pattern: accumulate into a vector, reduce at the end. This is better than reducing after every load, because horizontal reduction is expensive (it needs to shuffle data between lanes). By accumulating across the full loop, you only pay the reduction cost once.

Practical example: dot product

The dot product of two arrays is sum(a[i] * b[i]) -- multiply corresponding elements, then sum the products. It shows up everywhere: machine learning (matrix multiplication is just many dot products), physics (projections, reflections), signal processing (convolution, correlation), graphics (lighting calculations).

const std = @import("std");

fn scalarDot(a: []const f32, b: []const f32) f32 {
    std.debug.assert(a.len == b.len);
    var total: f32 = 0;
    for (a, b) |ai, bi| {
        total += ai * bi;
    }
    return total;
}

fn simdDot(a: []const f32, b: []const f32) f32 {
    std.debug.assert(a.len == b.len);
    const vec_len = 8;
    var accum: @Vector(vec_len, f32) = @splat(0.0);

    var i: usize = 0;
    while (i + vec_len <= a.len) : (i += vec_len) {
        const va: @Vector(vec_len, f32) = a[i..][0..vec_len].*;
        const vb: @Vector(vec_len, f32) = b[i..][0..vec_len].*;
        accum += va * vb; // fused multiply-add on modern CPUs
    }

    var total = @reduce(.Add, accum);

    while (i < a.len) : (i += 1) {
        total += a[i] * b[i];
    }

    return total;
}

pub fn main() !void {
    const allocator = std.heap.page_allocator;
    const n: usize = 100_000;

    const a = try allocator.alloc(f32, n);
    defer allocator.free(a);
    const b = try allocator.alloc(f32, n);
    defer allocator.free(b);

    for (a, b, 0..) |*ai, *bi, idx| {
        ai.* = @floatFromInt(idx % 100);
        bi.* = @floatFromInt((idx * 7 + 3) % 100);
    }

    var timer = try std.time.Timer.start();
    const scalar_result = scalarDot(a, b);
    const scalar_ns = timer.read();

    timer.reset();
    const simd_result = simdDot(a, b);
    const simd_ns = timer.read();

    std.debug.print("Scalar dot: {d:.1} ({d}ns)\n", .{ scalar_result, scalar_ns });
    std.debug.print("SIMD dot:   {d:.1} ({d}ns)\n", .{ simd_result, simd_ns });
    std.debug.print("Speedup:    {d:.1}x\n", .{
        @as(f64, @floatFromInt(scalar_ns)) / @as(f64, @floatFromInt(simd_ns)),
    });
}

The va * vb in the SIMD version does 8 multiplications in one instruction. On CPUs with FMA (fused multiply-add, which is basically everything since Haswell/2013), the compiler may combine the multiply and add into a single vfmadd instruction -- two operations per element per cycle. The scalar version does one multiply and one add per iteration.

The dot product is the inner kernel of matrix multiplication, which is the inner kernel of neural network inference, which is the inner kernel of most modern AI workloads. Understanding SIMD dot products gives you a direct view into why GPUs (which are essentially massive SIMD engines) are so critical for AI.

Auto-vectorization vs explicit SIMD

A fair question: why bother with explicit @Vector when the compiler can auto-vectorize scalar loops? Let me show you when each approach makes sense.

Auto-vectorization is when the compiler looks at your scalar loop and decides to use SIMD instructions on its own. Zig (via LLVM) does this when:

The loop is simple (no function calls, no complex control flow)
Elements are independent (no loop-carried dependencies beyond simple reductions)
The compiler can prove there's no aliasing (pointers don't overlap)

const std = @import("std");

// The compiler WILL auto-vectorize this in ReleaseFast
fn addArrays(dst: []f32, a: []const f32, b: []const f32) void {
    for (dst, a, b) |*d, ai, bi| {
        d.* = ai + bi;
    }
}

// The compiler CANNOT auto-vectorize this (loop-carried dependency)
fn runningTotal(data: []f32) void {
    var total: f32 = 0;
    for (data) |*val| {
        total += val.*;
        val.* = total; // each element depends on the previous
    }
}

pub fn main() void {
    var a = [_]f32{ 1, 2, 3, 4, 5, 6, 7, 8 };
    var b = [_]f32{ 10, 20, 30, 40, 50, 60, 70, 80 };
    var dst: [8]f32 = undefined;

    addArrays(&dst, &a, &b);
    std.debug.print("added: {d}\n", .{dst});

    runningTotal(&a);
    std.debug.print("running total: {d}\n", .{a});
}

For addArrays, the compiler (in ReleaseFast mode) will generate SIMD instructions automatically. You get the speedup for free. For runningTotal, auto-vectorization is impossible because each iteration depends on the previous one -- there's a true data dependency.

When to use explicit @Vector:

The compiler doesn't auto-vectorize your loop (check with zig build-exe -O ReleaseFast --verbose-llvm-ir and look for vector types)
You need a specific vector width or reduction pattern
You're writing a library that MUST use SIMD regardless of compiler heuristics
You need shuffles, selects, or other operations that auto-vectorization won't infer
Performance is critical enough that you want deterministic SIMD, not "maybe the compiler will do it"

When auto-vectorization is fine:

Simple element-wise transforms (add, multiply, convert)
You're writing application code where 80% of the speedup comes from algorithmic choices, not SIMD
The code is clearer in scalar form and the compiler handles it well

Having said that, if you care about SIMD performance, always verify what the compiler emits. I've seen too many cases where developers assumed auto-vectorization was happening and it wasn't. The compiler has heuristics, and heuristics are wrong sometimes. Explicit @Vector removes the guesswork ;-)

Practical pattern: SIMD byte search

Here's a more advanced example that shows SIMD at its most useful -- searching for a byte in a buffer. This is the core of memchr, string search, newline counting, and many parsing operations:

const std = @import("std");

fn simdFindByte(haystack: []const u8, needle: u8) ?usize {
    const vec_len = 16; // 128-bit SSE
    const needle_vec: @Vector(vec_len, u8) = @splat(needle);

    var offset: usize = 0;
    while (offset + vec_len <= haystack.len) : (offset += vec_len) {
        const chunk: @Vector(vec_len, u8) = haystack[offset..][0..vec_len].*;
        const matches = chunk == needle_vec; // @Vector(16, bool)

        if (@reduce(.Or, matches)) {
            // At least one match in this chunk -- find which one
            // Convert bool mask to integer for bit scanning
            const mask: u16 = @bitCast(matches);
            const pos = @ctz(mask);
            return offset + pos;
        }
    }

    // Scalar tail
    while (offset < haystack.len) : (offset += 1) {
        if (haystack[offset] == needle) return offset;
    }

    return null;
}

pub fn main() void {
    const data = "The quick brown fox jumps over the lazy dog";

    if (simdFindByte(data, 'f')) |pos| {
        std.debug.print("Found 'f' at position {d}\n", .{pos});
    }
    if (simdFindByte(data, 'z')) |pos| {
        std.debug.print("Found 'z' at position {d}\n", .{pos});
    }
    if (simdFindByte(data, '!')) |_| {
        std.debug.print("Found '!'\n", .{});
    } else {
        std.debug.print("'!' not found\n", .{});
    }
}

This compares 16 bytes against the needle simultaneously. The == comparison produces a vector of bools, @reduce(.Or, ...) checks if ANY matched, and if so we convert the bool vector to a bitmask and use @ctz (count trailing zeros) to find the first match position. The entire 16-byte comparison and test takes ~3 cycles. A scalar loop checking one byte at a time would take 16 iterations.

This pattern -- broadcast a target, compare against chunks, bitmask, find first -- is the foundation of high-performance string processing. Glibc's memchr uses exactly this approach. JSON parsers use it to find delimiters. CSV parsers use it to find commas and newlines. If you're processing large amounts of text or binary data in Zig, this is one of the first optimizations worth reaching for.

Exercises

Write a simdMin function that finds the minimum value in a []const f32 slice using @Vector(8, f32). Process 8 elements at a time, keeping a running minimum vector (initialized to std.math.inf(f32) via @splat). After the main loop, reduce with @reduce(.Min, ...) and handle the scalar tail. Test it against a scalar version to verify correctness, and benchmark both with a 1 million element array.
Write a simdCountByte function that counts how many times a given u8 byte appears in a []const u8 slice. Use 16-wide vectors like the byte search example, but instead of returning on the first match, accumulate the count. Remember that @Vector(16, bool) can be @bitCast to a u16, and @popCount counts set bits. This pattern is the foundation of histogram building and frequency analysis.
Write a function that takes two []const f32 slices representing 2D points (one slice for x coordinates, one for y coordinates) and computes the distance of each point from the origin using sqrt(x*x + y*y). Store results in an output slice. Use @Vector(8, f32) and note that Zig provides @sqrt which works on vectors. Compare execution time against a scalar version for 100,000 points. This "structure of arrays" (SoA) layout where x and y coordinates are in separate slices is the standard SIMD-friendly data layout -- as opposed to "array of structs" (AoS) where each point is struct { x: f32, y: f32 }.

Before you close this tab...

SIMD (Single Instruction, Multiple Data) processes 4, 8, or 16+ data elements in a single CPU instruction using wide registers (SSE=128bit, AVX=256bit, AVX-512=512bit). The hardware is in every modern CPU -- Zig just makes it easy to use.
@Vector(N, T) is Zig's built-in SIMD type. All arithmetic, comparison, and bitwise operators work element-wise. The compiler emits native SIMD instructions for your target CPU, or falls back to scalar if SIMD isn't available.
@splat broadcasts a scalar to all vector lanes. @reduce collapses a vector to a scalar (sum, min, max, product, boolean AND/OR). @shuffle rearranges elements within or between vectors. @select picks elements conditionally based on a bool mask.
Loading from slices uses data[i..][0..N].* and storing uses the reverse assignment. Alignment (align(32)) can improve performance on some architectures.
The accumulate-then-reduce pattern (process entire array with vector accumulators, reduce once at the end) is the standard SIMD approach. It minimizes expensive horizontal operations.
Auto-vectorization works for simple loops but isn't guaranteed. Explicit @Vector gives you deterministic SIMD regardless of compiler heuristics. Verify with --verbose-llvm-ir when performance matters.
SIMD excels at bulk data processing: array math, image/audio processing, search, physics, cryptography -- anything where the same operation applies to many independent elements.

Next episode we switch gears to data formats -- specifically working with JSON in Zig. If you thought Zig's type system was strict about memory, wait until you see how it handles structured data parsing ;-)

Bedankt en tot de volgende keer!

@scipio

Learn Zig Series (#19) - SIMD with @Vector

@scipio

Learn Zig Series (#19) - SIMD with @Vector

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Zig Series):

Learn Zig Series (#19) - SIMD with @Vector

Solutions to Episode 18 Exercises

What SIMD actually is

@Vector(N, T) -- the basics

Element-wise arithmetic and comparisons

Loading and storing from slices

Reductions: collapsing a vector to a scalar

Shuffling and selecting

Practical example: fast array sum

Practical example: dot product

Auto-vectorization vs explicit SIMD

Practical pattern: SIMD byte search

Exercises

Before you close this tab...

Bedankt en tot de volgende keer!

Discussion

Curriculum (of the `Learn Zig Series`):