Filter by Topics
audio generation
speech synthesis
music technology
audio seperation
audio understanding
* Equal contribution, Research mentorship

2026

@article{wang2025capspeech, title = {Capspeech: Enabling downstream applications in style-captioned text-to-speech}, author = {Wang, Helin and Hai, Jiarui and Chong, Dading and Thakkar, Karan and Feng, Tiantian and Yang, Dongchao and Lee, Junhyeok and Thebaud, Thomas and Velazquez, Laureano Moro and Villalba, Jesus and Qin, Zengyi and Narayanan, Shrikanth and Elhilali, Mounya and Dehak, Najim}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, year = {2026} }
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

[BibTeX]

A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

Helin Wang*, Jiarui Hai*, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhilali, Najim Dehak

Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP)

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

[BibTeX]

A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

Helin Wang*, Jiarui Hai*, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhilali, Najim Dehak

Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP)

@article{zang2026summary, title = {Summary of The Inaugural Music Source Restoration Challenge}, author = {Zang, Yongyi and Hai, Jiarui and Ge, Wanying and Kong, Qiuqiang and Dai, Zheqi and Wang, Helin and Mitsufuji, Yuki and Plumbley, Mark D}, journal = {arXiv preprint arXiv:2601.04343}, year = {2026} }
Summary of The Inaugural Music Source Restoration Challenge
Summary of The Inaugural Music Source Restoration Challenge

[BibTeX]

A summary of the inaugural Music Source Restoration Challenge, covering tasks, data, baselines, and key findings.

Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D Plumbley

Challenge IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Summary of The Inaugural Music Source Restoration Challenge

[BibTeX]

A summary of the inaugural Music Source Restoration Challenge, covering tasks, data, baselines, and key findings.

Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D Plumbley

Challenge IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

@article{wang2026solospeech, title = {Solospeech: Enhancing intelligibility and quality in target speech extraction through a cascaded generative pipeline}, author = {Wang, Helin and Hai, Jiarui and Yang, Dongchao and Chen, Chen and Li, Kai and Peng, Junyi and Thebaud, Thomas and Moro-Vel{\'a}zquez, Laureano and Villalba, Jes{\'u}s and Dehak, Najim}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, year = {2026} }
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction Through a Cascaded Generative Pipeline
SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction Through a Cascaded Generative Pipeline

[BibTeX]

A cascaded generative pipeline for target speech extraction that improves intelligibility and perceptual quality.

Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Najim Dehak

IEEE Transactions on Audio, Speech and Language Processing (TASLP)

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction Through a Cascaded Generative Pipeline

[BibTeX]

A cascaded generative pipeline for target speech extraction that improves intelligibility and perceptual quality.

Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Najim Dehak

IEEE Transactions on Audio, Speech and Language Processing (TASLP)

2025

@inproceedings{hai2025flexsed, title = {Flexsed: Towards open-vocabulary sound event detection}, author = {Hai, Jiarui and Wang, Helin and Guo, Weizhe and Elhilali, Mounya}, booktitle = {2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2025} }
FlexSED: Towards Open-Vocabulary Sound Event Detection
FlexSED: Towards Open-Vocabulary Sound Event Detection

[BibTeX]

An open-vocabulary sound event detection approach that generalizes to unseen classes via flexible text-conditioned modeling.

Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

Spotlight IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

FlexSED: Towards Open-Vocabulary Sound Event Detection

[BibTeX]

An open-vocabulary sound event detection approach that generalizes to unseen classes via flexible text-conditioned modeling.

Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

Spotlight IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

@inproceedings{hai2025ezaudio, title = {EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer}, author = {Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong}, booktitle = {Interspeech 2025}, year = {2025} }
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

[BibTeX]

An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Oral Interspeech

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

[BibTeX]

An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Oral Interspeech

@inproceedings{hai2025synsonic, title = {SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering}, author = {Hai, Jiarui and Elhilali, Mounya}, booktitle = {2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2025} }
SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering
SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

[BibTeX]

A text-to-audio diffusion ControlNet augmentation pipeline with sample filtering to improve sound event detection.

Jiarui Hai, Mounya Elhilali

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

[BibTeX]

A text-to-audio diffusion ControlNet augmentation pipeline with sample filtering to improve sound event detection.

Jiarui Hai, Mounya Elhilali

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

@inproceedings{wang2025soloaudio, title = {Soloaudio: Target sound extraction with language-oriented audio diffusion transformer}, author = {Wang, Helin and Hai, Jiarui and Lu, Yen-Ju and Thakkar, Karan and Elhilali, Mounya and Dehak, Najim}, booktitle = {ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2025} }
SoloAudio: Target Sound Extraction with Language-Oriented Audio Diffusion Transformer
SoloAudio: Target Sound Extraction with Language-Oriented Audio Diffusion Transformer

[BibTeX]

A language-conditioned audio diffusion transformer for target sound extraction from mixtures.

Helin Wang*, Jiarui Hai*, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

SoloAudio: Target Sound Extraction with Language-Oriented Audio Diffusion Transformer

[BibTeX]

A language-conditioned audio diffusion transformer for target sound extraction from mixtures.

Helin Wang*, Jiarui Hai*, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2024

DreamVoice: Text-Guided Voice Conversion
DreamVoice: Text-Guided Voice Conversion

[BibTeX]

Text-guided voice conversion that follows natural-language prompts to control voice attributes and speaking style.

Jiarui Hai*, Karan Thakkar*, Helin Wang, Zengyi Qin, Mounya Elhilali

Dataset Interspeech

DreamVoice: Text-Guided Voice Conversion

[BibTeX]

Text-guided voice conversion that follows natural-language prompts to control voice attributes and speaking style.

Jiarui Hai*, Karan Thakkar*, Helin Wang, Zengyi Qin, Mounya Elhilali

Dataset Interspeech

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction
DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

[BibTeX]

A diffusion probabilistic model for target sound extraction that separates a desired source from audio mixtures.

Jiarui Hai*, Helin Wang*, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

[BibTeX]

A diffusion probabilistic model for target sound extraction that separates a desired source from audio mixtures.

Jiarui Hai*, Helin Wang*, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2023

Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction
Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

[BibTeX]

A diffusion-based method for singing voice pitch correction that adjusts pitch while preserving timbre and expression.

Jiarui Hai, Mounya Elhilali

Oral IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

[BibTeX]

A diffusion-based method for singing voice pitch correction that adjusts pitch while preserving timbre and expression.

Jiarui Hai, Mounya Elhilali

Oral IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis
Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis

[BibTeX]

Multi-task training with pre-trained modality encoders to learn stronger multimodal representations for sentiment analysis.

Jiarui Hai*, Yu-Jeh Liu*, Mounya Elhilali

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis

[BibTeX]

Multi-task training with pre-trained modality encoders to learn stronger multimodal representations for sentiment analysis.

Jiarui Hai*, Yu-Jeh Liu*, Mounya Elhilali

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)