EmoHarmonix: Innovating emotional audio insights with harmonic-synthesis deep learning architectures
Main Article Content
Abstract
Emotional audio analysis holds significant potential in applications such as human-computer interaction, mental health monitoring, and customer service. However, current methods often fail to capture the intricate harmonic structures within audio signals, resulting in suboptimal emotion classification accuracy. Addressing these challenges, this study proposes novel harmonic-synthesis deep learning architectures that leverage the interplay between chords and musical harmonies for a more nuanced understanding of emotional content in audio. The methodology includes the introduction of these architectures, the creation of a meticulously curated emotional audio dataset encompassing diverse emotional states, and advanced feature extraction techniques that exploit harmonic properties. The harmonic-synthesis based feature extraction algorithm involves several steps: preprocessing, short-time Fourier transform (STFT), harmonic detection, harmonic feature extraction, temporal dynamics modeling with LSTM, feature aggregation, and normalization. Extensive evaluations show that the proposed models achieve significantly higher classification accuracy compared to traditional methods, with the harmonic-synthesis CNN reaching up to 93.7% accuracy on the RAVDESS dataset. The results indicate that the harmonic-synthesis approach effectively captures subtle emotional cues, thus offering more robust and precise emotion detection systems. By providing open-source implementations, this research fosters further advancements in this field. Additionally, the harmonic-synthesis architectures were benchmarked against existing state-of-the-art models, consistently outperforming them across various metrics. These findings underscore the potential of harmonic-synthesis techniques to significantly advance the field of emotional audio analysis, providing a solid foundation for future research and real-world applications.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
Anderson, M. L., & Perlis, D. (2019). Emotion and computational creativity: A meta-analysis. Journal of Artificial Intelligence Research, 65(1), 23-47. https://doi.org/10.1613/jair.1.11367
Brown, A. R., & Sorensen, A. C. (2018). Toward generative music systems using deep learning: A review. Computer Music Journal, 42(4), 56-71. https://doi.org/10.1162/COMJ_a_00432
Choi, K., Fazekas, G., Sandler, M. B., & Cho, K. (2017). Convolutional recurrent neural networks for music classification. Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2392-2396. https://doi.org/10.1109/ICASSP.2017.7952561
Deng, J., & Leung, C. H. (2017). Deep learning approaches for automatic emotion recognition in music. Journal of New Music Research, 46(1), 60-73. https://doi.org/10.1080/09298215.2016.1249460
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572-587. https://doi.org/10.1016/j.patcog.2010.09.020
Esling, P., & Agon, C. (2015). Time-series data mining. ACM Computing Surveys, 45(1), 12-18. https://doi.org/10.1145/2379776.2379781
Friberg, A., & Bresin, R. (2000). Emotional expression in music performance: Modeling, analysis, and synthesis. Proceedings of the 2000 IEEE International Conference on Multimedia and Expo, 317-320. https://doi.org/10.1109/ICME.2000.869612
Gabrielsson, A., & Juslin, P. N. (2003). Emotional expression in music. Handbook of Affective Sciences, 503-534.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. https://doi.org/10.1109/CVPR.2016.90
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang, Z., Dong, M., Mao, Q., & Zhan, Y. (2014). Speech emotion recognition using CNN. Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), 1-6. https://doi.org/10.1109/ICME.2014.6890163
Juslin, P. N., & Laukka, P. (2004). Expression, perception, and induction of musical emotions: A review and a questionnaire study of everyday listening. Journal of New Music Research, 33(3), 217-238. https://doi.org/10.1080/0929821042000317813
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
Oramas, S., Barbieri, F., & Serra, X. (2017). Multimodal deep learning for music genre classification. Transactions of the International Society for Music Information Retrieval, 3(1), 4-21. https://doi.org/10.5334/tismir.14
Schmidt, E. M., & Kim, Y. E. (2011). Modeling musical emotion dynamics with conditional random fields. Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), 777-782.
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning, 3319-3328.
Wang, Q., Zhang, Y., Zhu, X., & Zeng, D. (2018). Combining convolutional and recurrent neural networks for music emotion recognition. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), 737-746. https://doi.org/10.1109/ICDM.2018.00090
Zhang, Z., Wang, H., & Fu, H. (2018). Affective music composition with deep learning. Proceedings of the 2018 ACM Conference on Multimedia Conference (MM '18), 2374-2380. https://doi.org/10.1145/3240508.3240606