"How NOT To Program
an Out-of-order Vector
Processor" slides are public.
static.sched.com/hosted_files...I only slightly disagree with using segmented load/store transpose. If you need to transpose from memory fine, but if you need register to register going though memory isn't the best. I'd use vslide1up/down or in the future vpaire/vpairo:
github.com/ved-rivos/ri...riscv-isa-manual/src/zvzip.adoc at zvzip · ved-rivos/riscv-isa-manual
RISC-V Instruction Set Manual. Contribute to ved-rivos/riscv-isa-manual development by creating an account on GitHub.
Fuzzing tip: use VLA instead of fixed-size buffers or malloc
1. with fixed-size buffers asan won't catch everything.
2. VLAs are faster than malloc, in my case I get 15% faster fuzzing.
If VLAs aren't portable enough, just check __STDC_NO_VLA__ and select between the other options.
Tenstorrent decided to publish the first benchmark data for Ascalon's RVV implementation using the instruction throughput benchmark of my rvv-bench benchmark suite. <3
camel-cdr.github.io/rvv-bench-re...
Overall, the results look really good so far:
* Most instructions have an inverse throughput of 0.5/1/2/4 for LMUL=1/2/4/8, even vslide1up/down, 64-bit vmulh, viota, vpopc and integer reductions
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)
*correction: 0.5/0.5/2/4 for vector-scalar/immediate compares (0.5/2/4/8 for vector-vector)
So if you are currently involved with ISA-level decisions about inclusion of any pext/pdep-like instructions:
Please consider including SAG/inverse-SAG with bit-reversal of the goats.
No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
TIL about Trace Cache:
www.realworldtech.com/forum/?threa... (thread on Apples Trace Cache)
Ventanas Veyron V2/V3 seem to also use something like a trace cache.
RWT Forums - Real World Tech
content overridden
Their V2 slides say, that they have a macro-op cache equivalent in size to a regular 32 KiB icache.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.

Ventana’s Second Gen RISC V Processor for Data Center and Other High Performance | Greg Favor
YouTube video by Ventana Micro
The sixth Championship of Branch Prediction (CBP2025) happened a week ago:
ericrotenberg.wordpress.ncsu.edu/cbp2025-work...Ohh, the talk recordings are on YouTube:
www.youtube.com/watch?v=1lwz...
CBP2025 - Opening Remarks - Rami Sheikh
YouTube video by Rami Sheikh
I wrote a reference implementation for a SAG without bit reflection:
github.com/clairexen/ed..., and I wrote a parametric SAG core for any bit width:
github.com/clairexen/ed...
edu-sag/param.v at main · clairexen/edu-sag
Educational 8-Bit Sheep-And-Goats (SAG) Verilog Reference IP - clairexen/edu-sag
SiFive X280 RVV benchmarks:
camel-cdr.github.io/rvv-bench-re...
Civil was so nice run my RVV benchmark on the SiFive X280 cores on the Tenstorrent Blackhole.
RVV benchmark SiFive X280
TIL you can't do forward compatible syscalls with inline assembly because the kernel can decide to clobber architectural state that was added after you wrote the code.
If you use svc with inline assembly, you have to explicitly clobber SVE registers.
Good luck doing this back in 2015 when you wrote
I just had this problem on RISC-V where I didn't clobber the vector registers and some autovectorized surrounding code broke on a newer kenel version.
@clairexen.bsky.social Hi Claire, we are trying to propose some of the dropped bitmanip instructions for RVV:
lists.riscv.org/g/sig-vector...
Since you were deeply involved in the development of the bitmanip spec, I was wondering if you could answer some questions about your bextdep implementation.
[Proposal] Bit Compress & Bit Decompress Instructions for RVV
For Spark/Flink workloads in data centers, reading large-scale Parquet files is often a performance bottleneck. Therefore, adding support for these instructions can effectively fill this gap, ensuring RISC-V's competitiveness with other ISAs.
Sidenote: My pseudocode for the LEB128 decoder using RVV pext/pdep instructions isn't completely correct.
I'll revisit it properly, with spike/qemu implementation, once I finish my project.
oh no
> When source and destination registers overlap and have different EEW, the instruction is mask- and tail-agnostic, regardless of the setting of the vta and vma bits in vtype.
looks like gcc generates wrong code, and clang is to conservative with overlaps and generates redundant moves:
godbolt.org/z/1czr8oGab
just created a bug report:
gcc.gnu.org/bugzilla/sho...
I'll have to check all RVV assembly I've written.
Edit: I thought I found a dav1d bug (vwadd.wx v0, v0, v8), but I didn't norice the .wx, so it wasn't a bug.
I'll have to check the rest of the code later.
Ok, there don't seem to be any bugs related to this in dav1d.
"Efficient Implementation of RISC-V Vector Permutation Instructions" --
arxiv.org/abs/2505.07112
"Efficient Architecture for RISC-V Vector Memory Access" --
arxiv.org/abs/2504.08334
I love how these two were released so close to each other.