Projects
An online textbook guiding readers from AArch64 assembly all the way to a primitive-based tensor compiler. Covers Arm Neon and the Scalable Matrix Extension, just-in-time code generation, and real hardware.
Explores AMD XDNA1 and XDNA2 NPU microarchitectures and demonstrates writing tensor contraction kernels for them. Covers the VLIW ISA, BF16/BFP16 matrix instructions, and kernels reaching up to 1760 BFP16 GFLOPS.
Documents SME microbenchmarks on Apple M4 — the first publicly available silicon with SME support. Achieves 1833 GFLOPS for an FP32 GEMM, and covers JIT primitive generation and upstreaming into LIBXSMM. Companion to an SC'24 paper.
Apps
A Python package built on top of the einsum_ir C++ backend. Provides a Pythonic API to define, configure, optimize, and execute complex tensor contractions and elementwise operations. Supports dimension fusion, splitting, multiple backends, and a built-in contraction optimizer.
A self-contained single-page app that generates etops.TensorOperationConfig objects for the etops Python package. Specify backends, data types, primitive types, dimension types, execution modes, and strides — then export the encoded config string.
A browser-based tool for constructing and inspecting Einstein summation contraction trees. Supports drag-and-drop reordering of tensor indices, dimension type annotation (C/M/N/K), permutation node insertion, and responsive layouts for mobile and desktop.
A browser-based tool for visualizing GEMM benchmark files. Renders interactive D3.js line charts of GFLOPS vs. matrix dimensions, supports dimension filtering, overlays user-defined performance models, and exports charts to SVG or PDF.