HGPU group
High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC
- Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs #CUDA #LLM #Package hgpu.org?p=30520
- BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics #Bioinformatics #AI #LLM #Package hgpu.org?p=30519
- Nsight Python: A Python-First Profiling Toolkit for Seamless GPU Kernel Analysis (Tool) #CUDA #Triton #Profiling #Package hgpu.org?p=30517
- A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization #CUDA #LLM hgpu.org?p=30510
- SynPerf: A Hybrid Analytical-ML Framework for GPU Performance Prediction #Triton #CUDA #Performance #ML hgpu.org?p=30509
- Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10 #CUDA #Performance hgpu.org?p=30508
- The New Compiler Stack: A Survey on the Synergy of LLMs and Compilers #Compilers #LLM hgpu.org?p=30502
- DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation #CodeGeneration #LLM hgpu.org?p=30501
- AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis #Triton #CUDA #CodeGeneration #DSL #LLM hgpu.org?p=30499
- ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation #CUDA #OpenMP #CodeGeneration #LLM #Package hgpu.org?p=30498
- Hardware Acceleration for Neural Networks: A Comprehensive Survey #FPGA #TPU #NeuralNetworks #NN #Survey hgpu.org?p=30496
- Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission #Compression #Video #AI hgpu.org?p=30495
- GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs #CUDA #HIP #HPC #LLM #Performance hgpu.org?p=30494
- KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta #CUDA #Triton #PTX #AI #Meta #LLM hgpu.org?p=30493
- Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation #CUDA #PTX #Triton #ProgrammingLanguages #Package hgpu.org?p=30481
- Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs #CUDA #AI #Memory #Package hgpu.org?p=30480
- AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization #LLM #AI #Performance hgpu.org?p=30479
- Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs #CUDA #ProgrammingLanguages hgpu.org?p=30478
- PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations #CUDA #HIP #HLSL #AI #LLM #NLP hgpu.org?p=30477
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning #CUDA #CUBLAS #MatrixMultiplication #Package hgpu.org?p=30469
- BoltzGen:Toward Universal Binder Design #Biology #Bioinformatics #Biomolecules #Package hgpu.org?p=30468
- Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation #CUDA #CodeGeneration #LLM hgpu.org?p=30467
- PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage #CUDA #PyTorch #Databases hgpu.org?p=30466
- ML Inference Scheduling with Predictable Latency #ML #MachineLearning #TaskScheduling hgpu.org?p=30465
- cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution #CUDA #LLM #AI #Package hgpu.org?p=30464
- Accelerating Molecular Simulations with Triton: Fused GPU Kernels for TensorNet Neural Potentials #Triton #CUDA #MolecularDynamics #MD #MolecularSimulations #PyTorch #Chemistry #Biology hgpu.org?p=30453
- Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency #ROCm #LLM hgpu.org?p=30452
- Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters #GPUcluster #TaskScheduling #Package hgpu.org?p=30451
- TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization #Triton #CUDA #PyTorch #Package hgpu.org?p=30450
- tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection #Triton #BLAS #GEMM #AMD #ROCm #HPC #Performance #Package hgpu.org?p=30441
- Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles #LLVM #Compilers hgpu.org?p=30440
- Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels #Triton #Compilers #MachineLearning #ML #Thesis hgpu.org?p=30439
- hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware #FPGA #HLS #MachineLearning #ML #DeepLearning #DL #Package hgpu.org?p=30438
- Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis #PTX #CUDA #Benchmarking #Blackwell #HPC hgpu.org?p=30437
- QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation #Triton #CUDA #AI #CodeGeneration #LLM hgpu.org?p=30413
- KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit #Triton #CUDA #LLM #CodeGeneration hgpu.org?p=30412
- ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels #CUDA #AI #Package hgpu.org?p=30409
- Iris: First-Class Multi-GPU Programming Experience in Triton #Triton #HIP #CUDA #Package hgpu.org?p=30375
- ProofWright: Towards Agentic Formal Verification of CUDA #CUDA #LLM #CodeGeneration hgpu.org?p=30374
- AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs #ROCm #CUDA #LLM hgpu.org?p=30373