Jiarui Hai

Education

Johns Hopkins University

Aug. 2022 - Present

PhD Candidate @ ECE & DSAI

Baltimore, MD

Advised by Prof. Mounya Elhilali

Tsinghua University

Aug. 2020 – Jun. 2022

Master of Engineering

Beijing, China

Tsinghua University

Aug. 2016 – Jun. 2020

Bachelor of Engineering

Beijing, China

Bachelor of Science

Experience

Apple

Apr 2026 - Present

ML Research Intern

USA

Adobe | CAVA

May 2025 - Feb. 2026

Research Scientist Intern

San Francisco, CA, USA

Tencent | AI Lab

May 2024 - Sep. 2024

Research Scientist Intern

Bellevue, WA, USA

Kuaishou | AI Platform

Aug. 2021 - Feb. 2022

Music Technology Intern

Beijing, China

About Me

I am a fourth-year PhD student at Johns Hopkins University, where my research focuses on audio and speech signal processing, with an emphasis on diffusion-based generative models and multimodal audio–language understanding.

I am also an active music producer and am always exploring research projects that unlocks new possibilities in audio and music production.

Introducing OpenSound

OpenSound is a community-driven initiative led by researchers from Johns Hopkins University, dedicated to advancing research in audio, speech, and music. The project brings together contributors to explore and develop interactive demos, build and refine models, curate high-quality datasets, and establish meaningful benchmarks for evaluation.

News

Second time at WASPAA, giving an oral presentation of FlexSED.

Second time at WASPAA, giving an oral presentation of FlexSED. Oct 2025

Oct 2025

Co-founded OpenSound, a community for sharing speech and audio models with interactive demos.

Co-founded OpenSound, a community for sharing speech and audio models with interactive demos. Sep 2024

Sep 2024

Attended WASPAA and Presented My First First-Author Paper, Diff-Pitcher

Attended WASPAA and Presented My First First-Author Paper, Diff-Pitcher Oct 2023

Oct 2023

Joined LCAP @ Johns Hopkins University as a PhD student.

Joined LCAP @ Johns Hopkins University as a PhD student. Aug 2022

Aug 2022

* Equal contribution,
^† Research mentorship

Research Highlights

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

[TL;DR] [BibTeX]

A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

Helin Wang*, Jiarui Hai*, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhilali, Najim Dehak

Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP) | 2026

[Paper] [Code] [Homepage] [Live Demo]

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

[TL;DR] [BibTeX]

A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP) | 2026

[Paper] [Code] [Homepage] [Live Demo]

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

[TL;DR] [BibTeX]

An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Oral Interspeech | 2025

[Paper] [Code] [Homepage] [Live Demo]

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

[TL;DR] [BibTeX]

An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Oral Interspeech | 2025

[Paper] [Code] [Homepage] [Live Demo]

Education

Experience

About Me

Introducing OpenSound

News

Research Highlights

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

FlexSED: Towards Open-Vocabulary Sound Event Detection

FlexSED: Towards Open-Vocabulary Sound Event Detection

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

More Projects

Education

Experience