| Peer-Reviewed

Music Audio Sentiment Classification Based on Improvied Vision Transformer

Received: 17 February 2023    Accepted: 27 March 2023    Published: 31 March 2023
Views:       Downloads:
Abstract

Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.

Published in American Journal of Computer Science and Technology (Volume 6, Issue 1)
DOI 10.11648/j.ajcst.20230601.16
Page(s) 42-49
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Vision Transformer, Musical Sentiment, Sentiment Classification

References
[1] Xiao Xiaohong, Zhang Yi, Liu Dongsheng, Ouyang Chunjuan. Music Classification Based on Hidden Markov Model [J]. Computer Engineering and Applications, 2017, 53 (16): 138-143+165.
[2] KANG J, WANG H L, SU G B, et al. Survey of Music Emotion Recognition [J]. Computer Engineering and Applications, 2012, 58 (04): 64-72.
[3] Feng P Y. A Music Classification Recommendation Method Based on GRU and Attention Mechanism [D]. Guangdong university of technology, 2021. DOI: 10.27029/,dc nki.Ggdgu.2021.001410.
[4] JIA N, ZHEN C J. Model of Music Theme Recommendation Based on Attention LSTM [J]. COMPUTER SCIENCE, 2019, 46 (S2): 230-235.
[5] Chen Changfeng. Song Audio Emotion Classification Based on CNN-LSTM [J]. Communications Technology, 2019, 52 (05): 1114-1118.
[6] Zhang Yu-sha, JIANG Sheng-yi. Research on Speech Emotion Data Mining Classification and Recognition Method Based on MFCC Feature Extraction and Improved SVM [J]. Computer Applications and Software, 2020, 37 (08): 160-165+212.
[7] Cai X, Zhang H. Music genre classification based on auditory image, spectral and acoustic features [J]. Multimedia Systems, 2022, 28 (3): 779-791.
[8] TANG X, ZHANG C X, LI J F. Music Emotion Recognition Based on Deep Learning [J]. Computer Knowledge and Technology, 2019, 15 (11): 232-237. The DOI: 10.14004/j.carolcarroll nkiCKT.2019.1170.
[9] Tian Yong-Lin, Wang Yu-Tong, Wang Jian-Gong, Wang Xiao, Wang Fei-Yue. Key problems and progress of vision Transformers: The state of the art and prospects. Acta Automatica Sinica, 2022, 48 (4): 957−9.
[10] Hassani A, Walton S, Shah N, et al. Escaping the Big Data Paradigm with Compact Transformers [J]. 2021.
[11] Liu Wenting, Lu Xinming. Research Progress of Transformer Based on Computer Vision [J]. Computer Engineering and Applications, 2012, 58 (06): 1-16.
[12] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [C]// 2020.6.
[13] Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool: 10.48550/arXiv. 2101.00440 [P]. 2021.
[14] Song Yang. Research on Mongolian music classification Based on Transformer [D]. Inner Mongolia normal university, 2022. DOI: 10.27230/,dc nki.Gnmsu.2022.001124.
[15] Dong Anming, Liu Zongyin, Yu Jiguo, Han Yubing, Zhou You. Automatic Music genre Classification Based on Visual Transformation Network [J]. Journal of Computer Applications, 2012, 42 (S1): 54-58.
Cite This Article
  • APA Style

    Chen Zhen, Liu Changhui. (2023). Music Audio Sentiment Classification Based on Improvied Vision Transformer. American Journal of Computer Science and Technology, 6(1), 42-49. https://doi.org/10.11648/j.ajcst.20230601.16

    Copy | Download

    ACS Style

    Chen Zhen; Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am. J. Comput. Sci. Technol. 2023, 6(1), 42-49. doi: 10.11648/j.ajcst.20230601.16

    Copy | Download

    AMA Style

    Chen Zhen, Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am J Comput Sci Technol. 2023;6(1):42-49. doi: 10.11648/j.ajcst.20230601.16

    Copy | Download

  • @article{10.11648/j.ajcst.20230601.16,
      author = {Chen Zhen and Liu Changhui},
      title = {Music Audio Sentiment Classification Based on Improvied Vision Transformer},
      journal = {American Journal of Computer Science and Technology},
      volume = {6},
      number = {1},
      pages = {42-49},
      doi = {10.11648/j.ajcst.20230601.16},
      url = {https://doi.org/10.11648/j.ajcst.20230601.16},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajcst.20230601.16},
      abstract = {Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.},
     year = {2023}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Music Audio Sentiment Classification Based on Improvied Vision Transformer
    AU  - Chen Zhen
    AU  - Liu Changhui
    Y1  - 2023/03/31
    PY  - 2023
    N1  - https://doi.org/10.11648/j.ajcst.20230601.16
    DO  - 10.11648/j.ajcst.20230601.16
    T2  - American Journal of Computer Science and Technology
    JF  - American Journal of Computer Science and Technology
    JO  - American Journal of Computer Science and Technology
    SP  - 42
    EP  - 49
    PB  - Science Publishing Group
    SN  - 2640-012X
    UR  - https://doi.org/10.11648/j.ajcst.20230601.16
    AB  - Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.
    VL  - 6
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China

  • School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China

  • Sections