Harmonizing Emotion and Sound: A Novel Framework for Procedural Sound Generation Based on Emotional Dynamics

Hariyady Hariyady - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia
Ag Asri Ag Ibrahim - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia
Jason Teo - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia
Ahmad Fuzi Md Ajis - UiTM Cawangan Johor 85000 Segamat, Johor. Malaysia
Azhana Ahmad - Universiti Tenaga Nasional, Putrajaya Campus, Jalan Kajang - Puchong, 43000 Kajang, Selangor, Malaysia
Fouziah Md Yassin - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia
Carolyn Salimun - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia
Ng Giap Weng - Universiti Malaysia Sabah, 88400 Kota Kinabalu, Malaysia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.8.4.3101

Abstract


The present work proposes a novel framework for emotion-driven procedural sound generation, termed SONEEG. The framework merges emotional recognition with dynamic sound synthesis to enhance user schooling in interactive digital environments. The framework uses physiological and emotional data to generate emotion-adaptive sound, leveraging datasets like DREAMER and EMOPIA. The primary innovation of this framework is the ability to capture emotions dynamically since we can map them onto a circumplex model of valence and arousal for precise classification. The framework adopts a Transformer-based architecture to synthesize associated sound sequences conditioned on the emotional information. In addition, the framework incorporates a procedural audio generation module employing machine learning approaches: granular and wavetable synthesis and physical modeling to generate adaptive and personalized soundscapes. A user study with 64 subjects evaluated the framework through subjective ratings of sound quality and emotional fidelity. Analysis revealed differences among samples in sound quality, with some samples getting consistently high scores and some getting mixed reviews. While the emotion recognition model reached 70.3% overall accuracy, it performed better at distinguishing between high-arousal emotions but struggled to distinguish between emotions of similar arousal. This framework can be utilized in different fields such as healthcare, education, entertainment, and marketing; real-time emotion recognition can be applied to deliver personalized adaptive experiences. This step includes acquiring multimodal emotion recognition in the future and utilizing physiological data to understand people's emotions better.

Keywords


Affective Computing; Emotion Recognition; Emotional Dynamics; Artificial Emotional Intelligence; Procedural Sound Generation.

Full Text:

PDF

References


F. Cuadrado, I. L. Cobo, T. M. Blanco, and A. Tajadura‐Jiménez, “Arousing the Sound: A Field Study on the Emotional Impact on Children of Arousing Sound Design and 3D Audio Spatialization in an Audio Story,” Frontiers Media, vol. 11, 2020, doi: 10.3389/fpsyg.2020.00737.

B. Kenwright, “There’s More to Sound Than Meets the Ear: Sound in Interactive Environments,” Institute of Electrical and Electronics Engineers, vol. 40, no. 4, pp. 62–70, 2020, doi: 10.1109/mcg.2020.2996371.

F. Abri, L. Gutiérrez, A. S. Namin, D. R. W. Sears, and K. S. Jones, “Predicting Emotions Perceived from Sounds,” Cornell University, 2020. doi: 10.48550/arXiv.2012.02643.

D. Jain et al., “A Taxonomy of Sounds in Virtual Reality,” 2021. doi: 10.1145/3461778.3462106.

Z. Jia, Y. Lin, X. Cai, C. Haobin, G. Haijun, and J. Wang, “SST-EmotionNet: Spatial-Spectral-Temporal based Attention 3D Dense Network for EEG Emotion Recognition,” 2020. doi: 10.1145/3394171.3413724.

P. Thiparpakul, S. Mokekhaow, and K. Supabanpot, “How Can Video Game Atmosphere Affect Audience Emotion with Sound,” 2021. doi: 10.1109/iciet51873.2021.9419652.

D. Williams, “Psychophysiological Approaches to Sound and Music in Games,” Cambridge University Press, 2021. doi: 10.1017/9781108670289.019.

A. Pinilla, J. García, W. L. Raffe, J. Voigt-Antons, R. Spang, and S. Möller, “Affective visualization in Virtual Reality: An integrative review,” Cornell University, 2020. doi: 10.48550/arXiv.2012.08849.

A. Schmitz, C. Holloway, and Y. Cho, “Hearing through Vibrations: Perception of Musical Emotions by Profoundly Deaf People,” Cornell University, 2020. doi: 10.48550/arXiv.2012.13265.

A. N. Nagele et al., “Interactive Audio Augmented Reality in Participatory Performance,” Frontiers Media, vol. 1, 2021, doi: 10.3389/frvir.2020.610320.

M. Geronazzo and S. Serafin, “Sonic Interactions in Virtual Environments: the Egocentric Audio Perspective of the Digital Twin,” Cornell University, 2022. doi: 10.48550/arXiv.2204.09919.

J. Atherton and G. Wang, “Doing vs. Being: A philosophy of design for artful VR,” Routledge, vol. 49, no. 1, pp. 35–59, 2020, doi: 10.1080/09298215.2019.1705862.

E. Svikhnushina and P. Pu, “Social and Emotional Etiquette of Chatbots: A Qualitative Approach to Understanding User Needs and Expectations,” Cornell University, 2020. doi: 10.48550/arXiv.2006.13883.

P. Slovák, A. N. Antle, N. Theofanopoulou, C. D. Roquet, J. J. Gross, and K. Isbister, “Designing for emotion regulation interventions: an agenda for HCI theory and research,” Cornell University, 2022. doi: 10.48550/arXiv.2204.00118.

N. Marhamati and S. C. Creston, “Visual Response to Emotional State of User Interaction,” Cornell University, 2023. doi: 10.48550/arXiv.2303.17608.

X. Mao, W. Yu, K. D. Yamada, and M. R. Zielewski, “Procedural Content Generation via Generative Artificial Intelligence,” Cornell University, 2024. doi: 10.48550/arXiv.2407.09013.

D. Serrano and M. Cartwright, “A General Framework for Learning Procedural Audio Models of Environmental Sounds,” Cornell University, 2023. doi: 10.48550/arXiv.2303.02396.

T. Marrinan, P. Akram, O. Gurmessa, and A. Shishkin, “Leveraging AI to Generate Audio for User-generated Content in Video Games,” Cornell University, 2024. doi: 10.48550/arXiv.2404.

K. Fukaya, D. Daylamani-Zad, and H. Agius, “Intelligent Generation of Graphical Game Assets: A Conceptual Framework and Systematic Review of the State of the Art,” Cornell University, 2023. doi: 10.48550/arXiv.2311.

C. Bossalini, W. L. Raffe, and J. García, “Generative Audio and Real-Time Soundtrack Synthesis in Gaming Environments,” 2020. doi: 10.1145/3441000.3441075.

A. Dash and K. Agres, “AI-Based Affective Music Generation Systems: A Review of Methods, and Challenges,” Cornell University, 2023. doi: 10.48550/arXiv.2301.06890.

A. Kadhim Ali, A. Mohsin Abdullah, and S. Fawzi Raheem, “Impact the Classes’ Number on the Convolutional Neural Networks Performance for Image Classification,” 2023. doi: https://doi.org/10.62527/ijasce.5.2.132.

F. Miladiyenti, F. Rozi, W. Haslina, and D. Marzuki, “Incorporating Mobile-based Artificial Intelligence to English Pronunciation Learning in Tertiary-level Students: Developing Autonomous Learning,” 2022. doi: https://doi.org/10.62527/ijasce.4.3.92.

R. Bansal, “Read it to me: An emotionally aware Speech Narration Application,” Cornell University, 2022. doi: 10.48550/arXiv.2209.02785.

K. Agres, A. Dash, and P. Chua, “AffectMachine-Classical: A novel system for generating affective classical music,” Cornell University, 2023. doi: 10.48550/arXiv.2304.04915.

K. Zhou, B. Şişman, R. Rana, B. W. Schuller, and H. Li, “Speech Synthesis with Mixed Emotions,” Cornell University, 2022. doi: 10.48550/arXiv.2208.05890.

Y. Lei, S. Yang, X. Wang, and L. Xie, “MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” Cornell University, 2022. doi: 10.48550/arXiv.2201.06460.

Y. Lei, S. Yang, and L. Xie, “Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis,” Cornell University, 2020. doi: 10.48550/arXiv.2011.08477.

A. Scarlatos, “Sonispace: a simulated-space interface for sound design and experimentation,” Cornell University, 2020. doi: 10.48550/arXiv.2009.14268.

S. Torresin et al., “Acoustics for Supportive and Healthy Buildings: Emerging Themes on Indoor Soundscape Research,” Multidisciplinary Digital Publishing Institute, vol. 12, no. 15, pp. 6054–6054, 2020, doi: 10.3390/su12156054.

E. Easthope, “SnakeSynth: New Interactions for Generative Audio Synthesis,” Cornell University, 2023. doi: 10.48550/arXiv.2307.05830.

S. Afzal, H. Khan, I. Khan, and M. J. Piran, “A Comprehensive Survey on Affective Computing; Challenges, Trends, Applications, and Future Directions,” Cornell University, 2023. doi: 10.48550/arXiv.2305.07665.

K. Makantasis, A. Liapis, and G. N. Yannakakis, “The Pixels and Sounds of Emotion: General-Purpose Representations of Arousal in Games,” Institute of Electrical and Electronics Engineers, vol. 14, no. 1, pp. 680–693, 2023, doi: 10.1109/taffc.2021.3060877.

A. E. Ali, “Designing for Affective Augmentation: Assistive, Harmful, or Unfamiliar?,” Cornell University, 2023. doi: 10.48550/arXiv.2303.18038.

D. Harley, A. P. Tarun, B. J. Stinson, T. Tibu, and A. Mazalek, “Playing by Ear: Designing for the Physical in a Sound-Based Virtual Reality Narrative,” 2021. doi: 10.1145/3430524.3440635.

A. Kern, W. Ellermeier, and L. Jost, “The influence of mood induction by music or a soundscape on presence and emotions in a virtual reality park scenario,” 2020. doi: 10.1145/3411109.3411129.

T. Zhou, Y. Wu, Q. Meng, and J. Kang, “Influence of the Acoustic Environment in Hospital Wards on Patient Physiological and Psychological Indices,” Frontiers Media, vol. 11, 2020, doi: 10.3389/fpsyg.2020.01600.

A. Kern and W. Ellermeier, “Audio in VR: Effects of a Soundscape and Movement-Triggered Step Sounds on Presence,” Frontiers Media, vol. 7, 2020, doi: 10.3389/frobt.2020.00020.

D. Eckhoff, R. Ng, and Á. Cassinelli, “Virtual Reality Therapy for the Psychological Well-being of Palliative Care Patients in Hong Kong,” Cornell University, 2022. doi: 10.48550/arXiv.2207.

G. Nie and Y. Zhan, “A Review of Affective Generation Models,” Cornell University, 2022. doi: 10.48550/arXiv.2202.10763.

Z. Yang, X. Jing, A. Triantafyllopoulos, M. Song, I. Aslan, and B. W. Schuller, “An Overview & Analysis of Sequence-to-Sequence Emotional Voice Conversion,” Cornell University, 2022. doi: 10.48550/arXiv.2203.15873.

H. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and Y. Yang, “EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation,” Cornell University, 2021. doi: 10.48550/arXiv.2108.01374.

S. Cunningham, H. Ridley, J. Weinel, and R. Picking, “Supervised machine learning for audio emotion recognition,” Springer Science+Business Media, vol. 25, no. 4, pp. 637–650, 2020, doi: 10.1007/s00779-020-01389-0.

K. Matsumoto, S. Hara, and M. Abe, “Speech-Like Emotional Sound Generation Using WaveNet,” Institute of Electronics, Information and Communication Engineers, vol. E105.D, no. 9, pp. 1581–1589, 2022, doi: 10.1587/transinf.2021edp7236.

X. Ji et al., “Audio-Driven Emotional Video Portraits,” Cornell University, 2021. doi: 10.48550/arXiv.2104.07452.

N. R. Prabhu, B. Lay, S. Welker, N. Lehmann‐Willenbrock, and T. Gerkmann, “EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data,” Cornell University, 2023. doi: 10.48550/arXiv.2309.07828.

M. Liuni, L. Ardaillon, L. Bonal, L. Seropian, and J. Aucouturier, “ANGUS: Real-time manipulation of vocal roughness for emotional speech transformations,” Cornell University, 2020. doi: 10.48550/arXiv.2008.11241.

V. Isnard, T. Nguyen, and I. Viaud‐Delmon, “Exploiting Voice Transformation to Promote Interaction in Virtual Environments,” 2021. doi: 10.1109/vrw52623.2021.00021.

M. N. Dar, M. U. Akram, S. G. Khawaja, and A. N. Pujari, “CNN and LSTM-based emotion charting using physiological signals,” Sensors (Basel), vol. 20, no. 16, p. 4551, 2020, doi: 10.3390/s20164551.

H. Cui, A. Liu, X. Zhang, X. Chen, K. Wang, and X. Chen, “EEG-based emotion recognition using an end-to-end regional-asymmetric convolutional neural network,” Knowl Based Syst, vol. 205, no. 106243, p. 106243, 2020, doi: 10.1016/j.knosys.2020.106243.

M. Behnke, M. Buchwald, A. Bykowski, S. Kupiński, and L. D. Kaczmarek, “Psychophysiology of positive and negative emotions, dataset of 1157 cases and 8 biosignals,” Sci Data, vol. 9, no. 1, p. 10, 2022, doi: 10.1038/s41597-021-01117-0.

M. S. Khan, N. Salsabil, M. G. R. Alam, M. A. A. Dewan, and M. Z. Uddin, “CNN-XGBoost fusion-based affective state recognition using EEG spectrogram image analysis,” Sci Rep, vol. 12, no. 1, p. 14122, 2022, doi: 10.1038/s41598-022-18257-x.

P. L. Neves, J. Fornari, and J. Florindo, “Generating music with sentiment using Transformer-GANs,” Cornell University, Jan. 2022, doi: 10.48550/arXiv.2212.

S. Ji and X. Yang, “Emotion-Conditioned Melody Harmonization with Hierarchical Variational Autoencoder,” Cornell University, Jan. 2023, doi: 10.48550/arxiv.2306.03718.

D. Andreoletti, L. Luceri, T. Leidi, A. Peternier, and S. Giordano, “The Virtual Emotion Loop: Towards Emotion-Driven Services via Virtual Reality,” Cornell University, Jan. 2021, doi: 10.48550/arxiv.2102.13407.

S. H. Paplu, C. Mishra, and K. Berns, “Real-time Emotion Appraisal with Circumplex Model for Human-Robot Interaction,” Cornell University, Jan. 2022, doi: 10.48550/arxiv.2202.09813.

R. Nandy, K. Nandy, and S. T. Walters, “Relationship Between Valence and Arousal for Subjective Experience in a Real-life Setting for Supportive Housing Residents: Results From an Ecological Momentary Assessment Study,” Jan. 2023, doi: 10.2196/34989.

S. N. Chennoor, B. R. K. Madhur, M. Ali, and T. K. Kumar, “Human Emotion Detection from Audio and Video Signals,” Cornell University, Jan. 2020, doi: 10.48550/arxiv.2006.11871.

M. Singh and Y. Fang, “Emotion Recognition in Audio and Video Using Deep Neural Networks,” Cornell University, Jan. 2020, doi: 10.48550/arxiv.2006.08129.

K. Zhou, B. Şişman, R. Rana, B. W. Schuller, and H. Li, “Emotion Intensity and its Control for Emotional Voice Conversion,” Institute of Electrical and Electronics Engineers, Jan. 2023, doi: 10.1109/taffc.2022.3175578.

H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis,” Cornell University, Jan. 2023, doi: 10.48550/arxiv.2306.00648.

W. Peng, Y. Hu, Y. Xie, L. Xing, and Y. Sun, “CogIntAc: Modeling the Relationships between Intention, Emotion and Action in Interactive Process from Cognitive Perspective,” Cornell University, Jan. 2022, doi: 10.48550/arxiv.2205.03540.

E. Osuna, L.-F. Rodríguez, and J. O. Gutiérrez-García, “Toward integrating cognitive components with computational models of emotion using software design patterns,” Elsevier BV, Jan. 2021, doi: 10.1016/j.cogsys.2020.10.004.

K. Opong-Mensah, “Simulation of Human and Artificial Emotion (SHArE),” Oct. 2023. [Online]. Available: https://arxiv.org/pdf/2011.02151.pdf

G. Zhang and et al., “iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis based on Disentanglement between Prosody and Timbre,” Cornell University, Jan. 2022, doi: 10.48550/arxiv.2206.14866.

A. Vinay and A. Lerch, “Evaluating generative audio systems and their metrics,” Cornell University, Jan. 2022, doi: 10.48550/arxiv.2209.00130.

H. Mo, S. Ding, and S. C. Hui, “A Multimodal Data-driven Framework for Anxiety Screening,” Cornell University, Jan. 2023, doi: 10.48550/arxiv.2303.09041.

F. Yan, N. Wu, A. M. Iliyasu, K. Kawamoto, and K. Hirota, “Framework for identifying and visualising emotional atmosphere in online learning environments in the COVID-19 Era,” Springer Science+Business Media, Jan. 2022, doi: 10.1007/s10489-021-02916-z.

R. Habibi, J. Pfau, J. Holmes, and M. S. El‐Nasr, “Empathetic AI for Empowering Resilience in Games,” Cornell University, Jan. 2023, doi: 10.48550/arxiv.2302.09070.