Jiarui Hai
Researcher in Audio, Speech, and Music
Education
  • Johns Hopkins University
    Johns Hopkins University
    Aug. 2022 - Present
    PhD Candidate @ ECE & DSAI
    Baltimore, MD
    Advised by Prof. Mounya Elhilali
  • Tsinghua University
    Tsinghua University
    Aug. 2020 – Jun. 2020
    Master of Engineering
    Beijing, China
  • Tsinghua University
    Tsinghua University
    Aug. 2016 – Jun. 2020
    Bachelor of Engineering
    Beijing, China
    Bachelor of Science
  • Experience
  • Apple
    Apple
    Apr 2026 - present
    ML Research Intern
    USA
  • Adobe | CAVA
    Adobe | CAVA
    May 2025 - Feb. 2026
    Research Scientist Intern
    San Francisco, CA, USA
  • Tencent | AI Lab
    Tencent | AI Lab
    May 2024 - Sep. 2025
    Research Scientist Intern
    Bellevue, WA, USA
  • Kuaishou | AI Platform
    Kuaishou | AI Platform
    Aug. 2024 - Feb. 2025
    Music Technology Intern
    Beijing, China
  • About Me

    I am a fourth-year PhD student at Johns Hopkins University, where my research focuses on audio and speech signal processing, with an emphasis on diffusion-based generative models and multimodal audio–language understanding.

    I am also an active music producer and am always exploring research projects that unlocks new possibilities in audio and music production.

    About OpenSound
    OpenSound is a community-driven initiative led by researchers from Johns Hopkins University, dedicated to advancing research in audio, speech, and music. The project brings together contributors to explore and develop interactive demos, build and refine models, curate high-quality datasets, and establish meaningful benchmarks for evaluation.
    News
    Second time at WASPAA, giving an oral presentation of FlexSED.
    Second time at WASPAA, giving an oral presentation of FlexSED. Oct 2025
    Oct 2025
    Co-founded OpenSound, a community for sharing speech and audio models with interactive demos.
    Co-founded OpenSound, a community for sharing speech and audio models with interactive demos. Sep 2024
    Sep 2024
    Attended WASPAA and Presented My First First-Author Paper, Diff-Pitcher
    Attended WASPAA and Presented My First First-Author Paper, Diff-Pitcher Oct 2023
    Oct 2023
    Joined LCAP @ Johns Hopkins University as a PhD student.
    Joined LCAP @ Johns Hopkins University as a PhD student. Aug 2022
    Aug 2022
    * Equal contribution,
    Research mentorship
    Research Highlights
    @article{wang2025capspeech, title = {Capspeech: Enabling downstream applications in style-captioned text-to-speech}, author = {Wang, Helin and Hai, Jiarui and Chong, Dading and Thakkar, Karan and Feng, Tiantian and Yang, Dongchao and Lee, Junhyeok and Thebaud, Thomas and Velazquez, Laureano Moro and Villalba, Jesus and Qin, Zengyi and Narayanan, Shrikanth and Elhilali, Mounya and Dehak, Najim}, journal = {IEEE Transactions on Audio, Speech and Language Processing}, year = {2026} }
    CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
    CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

    [BibTeX]

    A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

    Helin Wang*, Jiarui Hai*, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhilali, Najim Dehak

    Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP) | 2026

    CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

    [BibTeX]

    A style-captioned TTS dataset and framework enabling controllable, style-aware text-to-speech for downstream applications.

    Helin Wang*, Jiarui Hai*, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesús Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhilali, Najim Dehak

    Dataset IEEE Transactions on Audio, Speech and Language Processing (TASLP) | 2026

    @inproceedings{hai2025flexsed, title = {Flexsed: Towards open-vocabulary sound event detection}, author = {Hai, Jiarui and Wang, Helin and Guo, Weizhe and Elhilali, Mounya}, booktitle = {2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}, year = {2025} }
    FlexSED: Towards Open-Vocabulary Sound Event Detection
    FlexSED: Towards Open-Vocabulary Sound Event Detection

    [BibTeX]

    An open-vocabulary sound event detection approach that generalizes to unseen classes via flexible text-conditioned modeling.

    Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

    Spotlight IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) | 2025

    FlexSED: Towards Open-Vocabulary Sound Event Detection

    [BibTeX]

    An open-vocabulary sound event detection approach that generalizes to unseen classes via flexible text-conditioned modeling.

    Jiarui Hai, Helin Wang, Weizhe Guo, Mounya Elhilali

    Spotlight IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) | 2025

    @inproceedings{hai2025ezaudio, title = {EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer}, author = {Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong}, booktitle = {Interspeech 2025}, year = {2025} }
    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    [BibTeX]

    An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

    Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Oral Interspeech | 2025

    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    [BibTeX]

    An efficient diffusion transformer that improves text-to-audio generation quality while reducing compute.

    Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Oral Interspeech | 2025

    Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction
    Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

    [BibTeX]

    A diffusion-based method for singing voice pitch correction that adjusts pitch while preserving timbre and expression.

    Jiarui Hai, Mounya Elhilali

    Oral IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) | 2023

    Diff-Pitcher: Diffusion-based Singing Voice Pitch Correction

    [BibTeX]

    A diffusion-based method for singing voice pitch correction that adjusts pitch while preserving timbre and expression.

    Jiarui Hai, Mounya Elhilali

    Oral IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) | 2023

    More Projects
    Education
  • Johns Hopkins University
    Johns Hopkins University
    Aug. 2022 - Present
    PhD Candidate @ ECE & DSAI
    Baltimore, MD
    Advised by Prof. Mounya Elhilali
  • Tsinghua University
    Tsinghua University
    Aug. 2020 – Jun. 2020
    Master of Engineering
    Beijing, China
  • Tsinghua University
    Tsinghua University
    Aug. 2016 – Jun. 2020
    Bachelor of Engineering
    Beijing, China
    Bachelor of Science
  • Experience
  • Apple
    Apple
    Apr 2026 - present
    ML Research Intern
    USA
  • Adobe | CAVA
    Adobe | CAVA
    May 2025 - Feb. 2026
    Research Scientist Intern
    San Francisco, CA, USA
  • Tencent | AI Lab
    Tencent | AI Lab
    May 2024 - Sep. 2025
    Research Scientist Intern
    Bellevue, WA, USA
  • Kuaishou | AI Platform
    Kuaishou | AI Platform
    Aug. 2024 - Feb. 2025
    Music Technology Intern
    Beijing, China