EmoHarmonix: Innovating emotional audio insights with harmonic-synthesis deep learning architectures

Main Article Content

Matthew T. Tull
Dorothea Quack
Kathryn Zumberg‐Smith
Hussain Seam

Abstract

Emotional audio analysis holds significant potential in applications such as human-computer interaction, mental health monitoring, and customer service. However, current methods often fail to capture the intricate harmonic structures within audio signals, resulting in suboptimal emotion classification accuracy. Addressing these challenges, this study proposes novel harmonic-synthesis deep learning architectures that leverage the interplay between chords and musical harmonies for a more nuanced understanding of emotional content in audio. The methodology includes the introduction of these architectures, the creation of a meticulously curated emotional audio dataset encompassing diverse emotional states, and advanced feature extraction techniques that exploit harmonic properties. The harmonic-synthesis based feature extraction algorithm involves several steps: preprocessing, short-time Fourier transform (STFT), harmonic detection, harmonic feature extraction, temporal dynamics modeling with LSTM, feature aggregation, and normalization. Extensive evaluations show that the proposed models achieve significantly higher classification accuracy compared to traditional methods, with the harmonic-synthesis CNN reaching up to 93.7% accuracy on the RAVDESS dataset. The results indicate that the harmonic-synthesis approach effectively captures subtle emotional cues, thus offering more robust and precise emotion detection systems. By providing open-source implementations, this research fosters further advancements in this field. Additionally, the harmonic-synthesis architectures were benchmarked against existing state-of-the-art models, consistently outperforming them across various metrics. These findings underscore the potential of harmonic-synthesis techniques to significantly advance the field of emotional audio analysis, providing a solid foundation for future research and real-world applications.

Article Details

How to Cite
[1]
Matthew T. Tull, Dorothea Quack, Kathryn Zumberg‐Smith, and Hussain Seam, “EmoHarmonix: Innovating emotional audio insights with harmonic-synthesis deep learning architectures”, Int. J. Comput. Eng. Res. Trends, vol. 11, no. 2, pp. 40–49, Feb. 2024.
Section
Research Articles

References

Anderson, M. L., & Perlis, D. (2019). Emotion and computational creativity: A meta-analysis. Journal of Artificial Intelligence Research, 65(1), 23-47. https://doi.org/10.1613/jair.1.11367

Brown, A. R., & Sorensen, A. C. (2018). Toward generative music systems using deep learning: A review. Computer Music Journal, 42(4), 56-71. https://doi.org/10.1162/COMJ_a_00432

Choi, K., Fazekas, G., Sandler, M. B., & Cho, K. (2017). Convolutional recurrent neural networks for music classification. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2392-2396. https://doi.org/10.1109/ICASSP.2017.7952561

Deng, J., & Leung, C. H. (2017). Deep learning approaches for automatic emotion recognition in music. Journal of New Music Research, 46(1), 60-73. https://doi.org/10.1080/09298215.2016.1249460

El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572-587. https://doi.org/10.1016/j.patcog.2010.09.020

Esling, P., & Agon, C. (2015). Time-series data mining. ACM Computing Surveys, 45(1), 12-18. https://doi.org/10.1145/2379776.2379781

Friberg, A., & Bresin, R. (2000). Emotional expression in music performance: Modeling, analysis, and synthesis. Proceedings of the 2000 IEEE International Conference on Multimedia and Expo, 317-320. https://doi.org/10.1109/ICME.2000.869612

Gabrielsson, A., & Juslin, P. N. (2003). Emotional expression in music. Handbook of Affective Sciences, 503-534.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. https://doi.org/10.1109/CVPR.2016.90

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using CNN. Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), 1-6. https://doi.org/10.1109/ICME.2014.6890163

Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238. https://doi.org/10.1080/0929821042000317813

Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539

Oramas, S., Barbieri, F., & Serra, X. (2017). Multimodal deep learning for music genre classification. Transactions of the International Society for Music Information Retrieval, 3(1), 4-21. https://doi.org/10.5334/tismir.14

Schmidt, E. M., & Kim, Y. E. (2011). Modeling musical emotion dynamics with conditional random fields. Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), 777-782.

Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning, 3319-3328.

Wang, Q., Zhang, Y., Zhu, X., & Zeng, D. (2018). Combining convolutional and recurrent neural networks for music emotion recognition. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), 737-746. https://doi.org/10.1109/ICDM.2018.00090

Zhang, Z., Wang, H., & Fu, H. (2018). Affective music composition with deep learning. Proceedings of the 2018 ACM Conference on Multimedia Conference (MM '18), 2374-2380. https://doi.org/10.1145/3240508.3240606

Most read articles by the same author(s)