Music Audio Sentiment Classification Based on Improvied Vision Transformer

Chen Zhen; Liu Changhui

doi:doi:10.11648/j.ajcst.20230601.16

| Peer-Reviewed

Music Audio Sentiment Classification Based on Improvied Vision Transformer

Chen Zhen, Liu Changhui

Published in American Journal of Computer Science and Technology (Volume 6, Issue 1)

Received: 17 February 2023 Accepted: 27 March 2023 Published: 31 March 2023

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.

Published in	American Journal of Computer Science and Technology (Volume 6, Issue 1)
DOI	10.11648/j.ajcst.20230601.16
Page(s)	42-49
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2023. Published by Science Publishing Group

Keywords

Vision Transformer, Musical Sentiment, Sentiment Classification

References

[1]	Xiao Xiaohong, Zhang Yi, Liu Dongsheng, Ouyang Chunjuan. Music Classification Based on Hidden Markov Model [J]. Computer Engineering and Applications, 2017, 53 (16): 138-143+165.
[2]	KANG J, WANG H L, SU G B, et al. Survey of Music Emotion Recognition [J]. Computer Engineering and Applications, 2012, 58 (04): 64-72.
[3]	Feng P Y. A Music Classification Recommendation Method Based on GRU and Attention Mechanism [D]. Guangdong university of technology, 2021. DOI: 10.27029/,dc nki.Ggdgu.2021.001410.
[4]	JIA N, ZHEN C J. Model of Music Theme Recommendation Based on Attention LSTM [J]. COMPUTER SCIENCE, 2019, 46 (S2): 230-235.
[5]	Chen Changfeng. Song Audio Emotion Classification Based on CNN-LSTM [J]. Communications Technology, 2019, 52 (05): 1114-1118.
[6]	Zhang Yu-sha, JIANG Sheng-yi. Research on Speech Emotion Data Mining Classification and Recognition Method Based on MFCC Feature Extraction and Improved SVM [J]. Computer Applications and Software, 2020, 37 (08): 160-165+212.
[7]	Cai X, Zhang H. Music genre classification based on auditory image, spectral and acoustic features [J]. Multimedia Systems, 2022, 28 (3): 779-791.
[8]	TANG X, ZHANG C X, LI J F. Music Emotion Recognition Based on Deep Learning [J]. Computer Knowledge and Technology, 2019, 15 (11): 232-237. The DOI: 10.14004/j.carolcarroll nkiCKT.2019.1170.
[9]	Tian Yong-Lin, Wang Yu-Tong, Wang Jian-Gong, Wang Xiao, Wang Fei-Yue. Key problems and progress of vision Transformers: The state of the art and prospects. Acta Automatica Sinica, 2022, 48 (4): 957−9.
[10]	Hassani A, Walton S, Shah N, et al. Escaping the Big Data Paradigm with Compact Transformers [J]. 2021.
[11]	Liu Wenting, Lu Xinming. Research Progress of Transformer Based on Computer Vision [J]. Computer Engineering and Applications, 2012, 58 (06): 1-16.
[12]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [C]// 2020.6.
[13]	Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool: 10.48550/arXiv. 2101.00440 [P]. 2021.
[14]	Song Yang. Research on Mongolian music classification Based on Transformer [D]. Inner Mongolia normal university, 2022. DOI: 10.27230/,dc nki.Gnmsu.2022.001124.
[15]	Dong Anming, Liu Zongyin, Yu Jiguo, Han Yubing, Zhou You. Automatic Music genre Classification Based on Visual Transformation Network [J]. Journal of Computer Applications, 2012, 42 (S1): 54-58.

Cite This Article

Plain Text BibTeX RIS

APA Style

Chen Zhen, Liu Changhui. (2023). Music Audio Sentiment Classification Based on Improvied Vision Transformer. American Journal of Computer Science and Technology, 6(1), 42-49. https://doi.org/10.11648/j.ajcst.20230601.16

Copy | Download

ACS Style

Chen Zhen; Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am. J. Comput. Sci. Technol. 2023, 6(1), 42-49. doi: 10.11648/j.ajcst.20230601.16

Copy | Download

AMA Style

Chen Zhen, Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am J Comput Sci Technol. 2023;6(1):42-49. doi: 10.11648/j.ajcst.20230601.16

Copy | Download

@article{10.11648/j.ajcst.20230601.16,
  author = {Chen Zhen and Liu Changhui},
  title = {Music Audio Sentiment Classification Based on Improvied Vision Transformer},
  journal = {American Journal of Computer Science and Technology},
  volume = {6},
  number = {1},
  pages = {42-49},
  doi = {10.11648/j.ajcst.20230601.16},
  url = {https://doi.org/10.11648/j.ajcst.20230601.16},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajcst.20230601.16},
  abstract = {Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.},
 year = {2023}
}

Copy | Download

TY  - JOUR
T1  - Music Audio Sentiment Classification Based on Improvied Vision Transformer
AU  - Chen Zhen
AU  - Liu Changhui
Y1  - 2023/03/31
PY  - 2023
N1  - https://doi.org/10.11648/j.ajcst.20230601.16
DO  - 10.11648/j.ajcst.20230601.16
T2  - American Journal of Computer Science and Technology
JF  - American Journal of Computer Science and Technology
JO  - American Journal of Computer Science and Technology
SP  - 42
EP  - 49
PB  - Science Publishing Group
SN  - 2640-012X
UR  - https://doi.org/10.11648/j.ajcst.20230601.16
AB  - Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.
VL  - 6
IS  - 1
ER  -

Copy | Download

Author Information

Chen Zhen

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China
Liu Changhui

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Chen Zhen, Liu Changhui. (2023). Music Audio Sentiment Classification Based on Improvied Vision Transformer. American Journal of Computer Science and Technology, 6(1), 42-49. https://doi.org/10.11648/j.ajcst.20230601.16

Copy | Download

ACS Style

Chen Zhen; Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am. J. Comput. Sci. Technol. 2023, 6(1), 42-49. doi: 10.11648/j.ajcst.20230601.16

Copy | Download

AMA Style

Chen Zhen, Liu Changhui. Music Audio Sentiment Classification Based on Improvied Vision Transformer. Am J Comput Sci Technol. 2023;6(1):42-49. doi: 10.11648/j.ajcst.20230601.16

Copy | Download

@article{10.11648/j.ajcst.20230601.16,
  author = {Chen Zhen and Liu Changhui},
  title = {Music Audio Sentiment Classification Based on Improvied Vision Transformer},
  journal = {American Journal of Computer Science and Technology},
  volume = {6},
  number = {1},
  pages = {42-49},
  doi = {10.11648/j.ajcst.20230601.16},
  url = {https://doi.org/10.11648/j.ajcst.20230601.16},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajcst.20230601.16},
  abstract = {Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.},
 year = {2023}
}

Copy | Download

TY  - JOUR
T1  - Music Audio Sentiment Classification Based on Improvied Vision Transformer
AU  - Chen Zhen
AU  - Liu Changhui
Y1  - 2023/03/31
PY  - 2023
N1  - https://doi.org/10.11648/j.ajcst.20230601.16
DO  - 10.11648/j.ajcst.20230601.16
T2  - American Journal of Computer Science and Technology
JF  - American Journal of Computer Science and Technology
JO  - American Journal of Computer Science and Technology
SP  - 42
EP  - 49
PB  - Science Publishing Group
SN  - 2640-012X
UR  - https://doi.org/10.11648/j.ajcst.20230601.16
AB  - Common neural network models have problems of low accuracy and low efficiency in music sentiment classification tasks. In order to further excavate sentiment information contained in the audio spectrum and increase the accuracy of music sentiment classification, an improved Vision Transformer model is proposed. Since the public data set does not meet the requirements of the task of music sentiment classification, this paper makes a four-category music sentiment data set. After the audio is preprocessed, the processed audio features are trained by Vision Transformer. Modify the input of Vision Transformer to fit the structure of Vision Transformer. Position parameters of Vision Transformer model can better preserve the connection between audio features. Encoder structure can also fully learn local features and global features. Due to the long training time of this model, softpool pooling layer is introduced into the model, which can better retain the emotional features, speed up the calculation of the model, but also retain the model accuracy. Experimental results show that the classification accuracy of Vision Transformer model reaches 86.5%, which has better classification effect compared with neural networks such as ResNet. Meanwhile, the improved Vision Transformer reduces training time by 10.4% and accuracy by only 0.3%. On the public data set gtzan, the accuracy of this model reaches 90.7%.
VL  - 6
IS  - 1
ER  -

Copy | Download