arXiv Sound
Automated posting of sound-related articles uploaded to arxiv.org (eess.AS + cs.SD)
Source: github.com/dsuedholt/bsky-pape…
Inspired by @paperposterbot.bsky.social and twitter.com/ArxivSound
- HiPPO, a hierarchical pronunciation assessment model, evaluates L2 learner proficiency at multiple linguistic levels; contrastive ordinal regularizer and curriculum learning improve assessment accuracy.
- LGTSE extended with TripleC Learning and parallel universal training improves multi-condition target speech extraction, achieving superior performance over condition-specific models on Libri2Mix tasks.
- AcuLa aligns audio encoders with medical language models for semantic understanding, improving AUROC on cardio-respiratory tasks from 0.68 to 0.79 and on COVID-19 cough detection from 0.55 to 0.89.
- Contract-driven QoE auditing framework using MOS regression shows that classical MOS regression is a special case with degenerate contract set and contract-driven quality is more stable than MOS.
- Shared embedding space with Adaptive Angular Margin (AAM) loss for face and voice features achieved first place in the FAME 2026 challenge with an average Equal-Error Rate (EER) of 23.99.
- YingMusic-SVC uses singing-trained RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced rectified flow matching loss to achieve improvements in timbre similarity, intelligibility, and perceptual naturalness.
- Extended eMoBi-Q model incorporates a nonlinear auditory filterbank and loudness perception to predict binaural audio quality in normal-hearing and hearing-impaired populations.
- Melody-driven SVS framework uses Diffusion Transformer (DiT) enhanced with melody extraction module from reference audio; Flow-GRPO reinforcement learning enhances pronunciation clarity and melodic fidelity.
- M3-TTS, a multi-modal diffusion transformer (MM-DiT) architecture, achieves state-of-the-art non-autoregressive text-to-speech performance with word error rates of 1.36% (English) and 1.31% (Chinese).
- LargeSC uses Mimi speech codec and Moshi foundation model with LoRA, achieves adaptive semantic compression and robust transmission over lossy channels, outperforming baselines with bandwidths from 550 bps to 2.06 kbps.
- Machine learning models, particularly logistic regression, predict Bisgaard audiogram types from loudness perception data with reasonable accuracy using PCA feature extraction, supporting remote audiology applications.
- Robust Reward Policy Optimization (RRPO) mitigates reward hacking in emotional TTS by using a hybrid regularization scheme and a robust Reward Model (RM), improving both emotional expressiveness and naturalness.
- Multi-loss learning framework with energy-adaptive mixup and frame-level attention yields state-of-the-art performance on IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE datasets for speech emotion recognition.
- Aliasing-aware Patch Embedding (AaPE), a new patch stem, mitigates aliasing in Transformer-based audio SSL by augmenting patch tokens with features from a complex sinusoidal kernel; yields state-of-the-art performance on some tasks.
- BioMamba, a Mamba-based audio LLM, achieves comparable performance to Transformer-based AVES on bioacoustic tasks with significantly less VRAM usage after pretraining and fine-tuning on the BEANS benchmark.
- A universal harmonic discriminator with a learnable triangular band-pass filter bank is proposed for GAN-based vocoders to improve time-frequency representation; validated on speech and singing datasets.
- Supervised finite scalar quantization (FSQ) methods for semantic speech token extraction outperform unsupervised K-means clustering in child ASR; even surpass continuous representations at ultra-low bitrates.
- Perceptual evaluation of acoustic level of detail (ALOD) in virtual acoustic environments shows that strong ALOD reduction is feasible while maintaining plausibility, speech intelligibility, and externalization; early reflections' accuracy is less relevant if late reverberation is represented.
- Unsupervised dimensionality reduction methods, PCA and autoencoders, define sonic behavior spaces for quality diversity algorithms; automatic approaches achieve greater diversity than handcrafted spaces, with PCA proving most effective.
- ImageBind-LoRA, leveraging ImageBind with LoRA, demonstrates cross-lingual generalization in face-voice association; fine-tuned on Arabic audio, it achieves an EER of 24.73% on unseen languages.
- Four approaches for dysarthria severity classification were compared using the SAND dataset; a feature-engineered XGBoost ensemble achieved the highest macro-F1 score, while deep learning models offered competitive performance.
- Pianist Transformer, a model for expressive piano performance rendering, uses a unified MIDI data representation and asymmetric architecture; self-supervised pre-training with 10B tokens achieves state-of-the-art performance.
- A generative feedback framework for singing voice synthesis evaluation provides multi-dimensional language and audio critiques using an audio-language model; experiments validate effectiveness for guiding generative model improvement.
- VibOmni, a multi-modal speech enhancement system for earables, uses bone-conducted vibrations captured by IMUs; a novel data augmentation technique generates synthetic vibration data from limited recordings.
- An interactive continual learning framework for singing voice separation allows users to fine-tune a U-Net model by marking false positives; experiments show performance improvements over the base model in various settings.
- Story2MIDI, a Transformer model, generates emotion-aligned music from text using a dataset of text-music pairs evoking similar emotions; evaluations confirm model's ability to capture intended emotional cues.
- Token-level adaptation of ASR systems improves dysfluency transcription on LibriStutter and KSoF datasets; language-adaptive pretraining and tokenizer analysis address English-centric bias in multilingual systems.
- The Parallel Delayed Memory Unit (PDMU), a delay-gated state-space module, enhances temporal modeling in bio-signals by compressing temporal information using Legendre Memory Units (LMU); demonstrates improved memory capacity and model performance.
- LLM2Fx-Tools is a multimodal tool-calling framework that generates audio effects chains for music post-production using a large language model; validated in style transfer setting.
- Q2D2, a geometry-aware audio codec, uses two-dimensional quantization on structured grids; improves compression efficiency with low token rates and high codebook utilization while maintaining state-of-the-art reconstruction quality.
- Identifiability conditions for acoustic feedback cancellation with the 2ch-AFC algorithm are derived; identifiability can be achieved when the order of the forward path feedforward filter exceeds the AR model order.
- Arabic TTS baselines based on FastPitch were created, adversarial training was introduced to address oversmoothing using cepstral-domain metrics, and synthetic voices were used to improve prosodic diversity.
- CLAM, a dual-stream detection architecture, uses MERT and Wave2Vec2 to detect synthetic music by identifying inconsistencies between vocal and instrumental elements; achieves state-of-the-art F1 score of 0.925 on MoM benchmark.
- An explainable multimodal deep learning framework detects lung diseases from respiratory audio signals, integrating a CNN-BiLSTM Attention spectral-temporal encoder with handcrafted acoustic features; achieves 91.21% accuracy.
- Parametric dithering improves ASR input compression at low bitrates; shows CER improvements of 25% at 1-bit resolution.
- Speech enhancement model internal representations were probed across SNRs using CKA and diffusion distance; noise levels differentially activate model regions and induce distinct inter-layer dynamics.
- STCTS, a generative semantic compression framework, decomposes speech into text, prosody, and timbre for ultra-low bitrate voice communication; achieves 75x bitrate reduction versus Opus while maintaining perceptual quality.
- Art2Music, a cross-modal framework, generates music from artistic images and text using OpenCLIP, LSTM, and HiFi-GAN; evaluations on ArtiCaps show improvements in multiple metrics, including feeling alignment.
- MoLT uses layer-wise tokens from late transformer layers for parameter- and memory-efficient audio-visual learning; outperforms existing methods on audio-visual benchmarks.
- The HPSU benchmark, comprising 20,000 expert-validated samples, evaluates human-level perception of Speech LLMs, revealing a gap in understanding intentions and emotions in real-world speech despite advances in ASR and SER.
- GRAPAM, a group-aware partial model merging approach, adapts adult-pretrained models to children's speech recognition by clustering children's data and merging partially fine-tuned models, achieving a 6% relative improvement on the MyST corpus.
- A framework for calibrating and fusing EEND models at the probability level improves diarization; calibration substantially improves even individual models, with gains up to 19% on CallHome.
- A hybrid augmentation strategy using deep generative models like diffusion models and traditional methods enhances Southern Resident Killer Whale detection, with diffusion-based augmentation achieving the highest recall (0.87) and a hybrid approach yielding an F1-score of 0.81.
- PURE Codec enhances speech codec learning via progressive unfolding of residual entropy, guiding multi-stage quantization with a pre-trained enhancement model for stable training and improved reconstruction, outperforming RVQ-based codecs.
- Diffusion models enhance speech synthesis but struggle when conditioning deviates from training; GLA-Grad++ improves upon GLA-Grad by applying the correction term once, accelerating generation and improving out-of-domain performance.
- A transformer-based language model trained on discrete representations from a disentangled neural audio codec achieves high-quality bandwidth extension; joint design improves codec structure and transformer modeling.
- HarmonicAttack, an efficient audio watermark removal method, employs a dual-path convolutional autoencoder and GAN-style training to separate watermarks, outperforming previous methods in near real-time.
- A diffusion model, trained to generate solo vocals conditioned on music mixtures, improves singing voice separation and achieves competitive objective scores against non-generative baselines with supplementary data; iterative sampling allows quality-efficiency control.
- SONAR, a frequency-guided deepfake detector, disentangles audio into low-frequency content and high-frequency residuals via XLSR encoder and SRM filters, using frequency cross-attention and contrastive loss for state-of-the-art performance.
- Acoustic neural network framework trains conventional architectures under physical constraints (non-negative signals/weights, no bias) for speech classification; SincHSRNN achieves high accuracy combining bandpass filters and hierarchical processing.