Resource building and classification of Mizo folk songs

Esther Ramdinmawii; Sanghamitra Nath

doi:10.1017/nlp.2024.23

Resource building and classification of Mizo folk songs

Part of: NLP Editorial Board access (current content+all back content)

Published online by Cambridge University Press: 23 May 2024

Esther Ramdinmawii

and

Sanghamitra Nath

Show author details

Esther Ramdinmawii*: Affiliation:
Department of of Computer Science & Engineering. Tezpur University, Napaam, Sonitpur, Assam, India
Sanghamitra Nath: Affiliation:
Department of of Computer Science & Engineering. Tezpur University, Napaam, Sonitpur, Assam, India
*: Corresponding author: Esther Ramdinmawii; Email: [email protected]

Article contents

Abstract
Introduction
Literature survey
Methodology
Experiments and discussion of results
Summary and conclusion
Footnotes
References

Rights & Permissions

Abstract

Folk culture represents the social, ethnic, and traditional livelihood of people belonging to a certain tribe or community and is important in keeping their culture and tradition alive. The Mizo people are a Tibeto-Burmese ethnic group, native to the Indian state of Mizoram and neighboring regions of Northeast India. Mizo folk culture is an amalgamation of festivity, celebration, liveliness, kinship, brotherhood, and merriment, and above all, preserves the ethnicity of this tribal community that is fundamentally entrenched. Unfortunately, the Mizos are fast giving up their old customs and adopting the new mode of life that is greatly influenced by the western culture. This makes it all the more crucial to preserve the intangible cultural heritage of this ethnic tribe whose folk cultures are vanishing day by day. To the best of our knowledge, this work is the first attempt at preservation and classification of Mizo folk songs. The first part of this paper presents a literature survey on preservation, analysis, and classification of folk songs. The second part presents the methodology for preliminary classification of Mizo folk songs. Three categories of Mizo folk songs—Hunting chants (Hlado), Children’s songs (Pawnto hla), and Elderly songs (Pi pu zai)—are used in this study. A total of 29 acoustic features are used. A long short-term memory network using custom attention layer has been proposed for classification, whose results are compared with four supervised models (Support Vector Machines, K-Nearest Neighbor, Naive Baye’s, and Ensemble). Experimental results from the proposed model are promising, with an implication of scope for future research in acoustic analysis and classification of Mizo folk songs using recent unsupervised methods.

Keywords

low-resource regional songs acoustic analysis Mizo folk song resources folk song classification

Type: Article
Information: Natural Language Processing , Volume 31 , Special Issue 2: Natural Language Processing Applications for Low-Resource Languages , March 2025 , pp. 655 - 673

DOI: https://doi.org/10.1017/nlp.2024.23 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2024. Published by Cambridge University Press

1. Introduction

Folk music embodies a profound legacy and diversity while also transcending cultural boundaries with its universal language. It serves as a means to safeguard our cultural heritage and pass down our history to future generations, ensuring the preservation of our rich legacy. Folk music in India encompasses a vast array of traditions, each reflecting the unique cultural heritage of its respective region. The Northeastern states of India exhibit a particularly diverse and rich folk music tradition. These cultures have a vibrant tapestry of folk songs that are deeply rooted in their history, customs, and rituals. These songs provide insights into the local customs, beliefs, and values of the communities they belong to.

One such community is the Mizo community from Mizoram, which is the southernmost state among the Northeastern states of India, as shown in Figure 1. The Mizo tribe has a significant population, with over 8,40,000 speakers (as per 2011 census) within Mizoram as well as its neighboring states such as Manipur, Tripura, and Assam. There are also Mizo-speaking populations in certain parts of Bangladesh, toward the western border of Mizoram, as well as in Myanmar, on the eastern border of Mizoram. Various sub-tribes like Thado, Paite, Lusei, Pawi, and others reside within the Mizo community, with the Lusei dialect adopted as the lingua franca of modern-day Mizoram. The Mizo language belongs to the Tibeto-Burman language family (Weidert Reference Weidert1975), specifically the Kuki-Chin subgroup. Mizo tribe and their language, with their vibrant cultural traditions and linguistic heritage, contribute to the diverse tapestry of North-East India’s rich ethnic and linguistic mosaic.

The Mizos have a rich and diverse cultural heritage. Their folklore is naturally passed down from previous generations orally. Mizo folk songs depict stories of the Mizo society, tradition, and culture, at a certain time in history. They reflect the glorious past of the Mizos, including their way of farming and harvesting, hunting, war, natural disasters, romance and nuptials, place of females in social strata, place of males in society, etc.

Figure 1. Speaker population of Mizo language^{Footnote a} (marked in green dots).

According to the works of Thanmawia (Reference Thanmawia1998), Lalthangliana (Reference Lalthangliana2005), and Lalremruati (Reference Lalremruati2012, Reference Lalremruati2019), Mizo folk songs can be categorized under five main themes depending on their purpose and use—hunting and war chants, lamentations, satire, love, and nature themes. Songs are also categorized depending on their types of tune, called thlûk (Lalthangliana Reference Lalthangliana1993; Khiangte Reference Khiangte2001, Reference Khiangte2002; Lalzarzova, 2016). This is termed as ‘Hlabu’. Those having the same tune or melody are called ‘hlabu khat’. Different categories from these earlier works are discussed in brief as follows:

1. War chants: Bawh hla (war chants) are chanted solo by warriors after a successful war or raid where they have taken the head of an enemy. These personal and subjective chants are spontaneous and convey the singer’s emotions and mood, reflecting a sense of pride and ego (Lalremruati Reference Lalremruati2019).
2. Hunting chants: Hlado (hunting chants) share the same melody as Bawh hla and are spontaneously composed and chanted after a triumphant hunt. They typically emphasize the singer’s supremacy over the common man, employing words that express the singer’s ego and pride (Lalremruati Reference Lalremruati2019).
3. Lamentations: Songs of this nature are traditionally chanted during times of adversity. The Mizos experienced famines during their settlement in the Than ranges, leading to significant loss of life (Lalremruati Reference Lalremruati2019). These songs, known as ‘ṭhuthmun zai’ (songs sung while sitting), emerged as a way for people to offer condolences and gather together, sitting and singing these songs in solidarity (Thanmawia Reference Thanmawia1998; Reference Thanmawia1998, Lalremruati Reference Lalremruati2012, Reference Lalremruati2019).
4. Satire: These songs serve the purpose of ridicule and can be both aggressive and offensive, but they also encompass cheerful and humorous elements. Known as ‘intuk hla’, they are utilized to lighten the mood in gatherings that are otherwise heavy and tense (Thanmawia Reference Thanmawia1998; Lalthangliana Reference Lalthangliana2005; Lalremruati Reference Lalremruati2019).
5. Nature themed: Mizo folk songs frequently depict the beauty of nature and its influence on society, emphasizing the Mizo people’s reliance on and connection with the natural world, including its flora and fauna. These songs often draw parallels between elements of nature and the affection shared between couples, intertwining themes of nature and love (Lalremruati Reference Lalremruati2019).
6. Couplet and triplet: Mizo folk songs can be categorized based on the number of lines they contain. The first form of folk song, known as a couplet (tlar hnih zai), is believed to consist of two lines (Lalthangliana Reference Lalthangliana1993). It is further believed that earlier songs primarily comprised couplets and triplets (Khiangte Reference Khiangte2001).
7. Songs named after individuals: Mizo folk songs are also categorized according to the names of their original composers. Subsequently, other composers utilize the same tune to create different lyrics, a practice often done as a tribute to honor the original composer (Lalremruati Reference Lalremruati2012; Lalzarzova 2016).
8. Songs named after merry and festive occasions: The Mizos celebrate various festivals, many of which are connected to the agricultural season. Among the most common ones are Chapchar Kût, Mim Kût, and Ṭhalfavang Kût. Chapchar Kût marks the joyous completion of rice plantation, Thalfavang Kût celebrates the harvest, while Mim Kût is a solemn festival dedicated to the souls of the deceased, featuring rituals, feasting, and mournful singing and dancing (Khiangte Reference Khiangte2002).

The Indian Government has implemented heritage preservation schemes that aim to preserve and promote oral traditions, performing arts, social and ritual events, etc., from various states. Projects by the All India Radio and the Indira Gandhi National Centre for the Arts^{Footnote b} focus on preserving dying folk songs and classical Indian music; they have not yet included Mizo folk songs or the Mizo language in the Technology Development for Indian Languages^{Footnote c} repository. It is crucial to protect the language, culture, and traditions of the Mizo community, as Mizo is classified as ‘vulnerable’ on the 2010 UNESCO list of endangered languages (Moseley Reference Moseley2010).

The cultural development of Mizo society has been significantly influenced by the impact of globalization, which has gradually diminished the significance of traditional folk songs due to linguistic changes in the Mizo language. The vocabulary of these songs differs from spoken Mizo and includes borrowings from the Paite dialect, making it challenging to sing or comprehend the lyrics. As a result, passing down this cultural heritage to the younger generation has become increasingly difficult amidst the rapid social changes, depriving them of access to and practice folk tales, which are vital for maintaining cultural roots. Thus, preserving folk songs has become even more crucial.

Hence, this work aims to address this issue by proposing a framework for preservation and classification of Mizo folk songs. The main contributions of this work are listed below:

Creation of Mizo folk song database for research in music processing.
Utilization of the database toward identification of unique acoustic characteristics of the Mizo folk songs from a speech processing point of view.
Acoustic classification of Mizo folk song categories using a long short-term memory (LSTM) network with custom attention layer (LSTM-attn).

The paper is structured as follows. Section 2 presents a brief summary of existing literature on Music Information Retrieval (MIR), existing analysis methods, features, and classification methods of folk songs in other languages. Section 3 discusses the methodology including data used, acoustic features employed, as well as detailed discussion of the proposed LSTM-attn model. Section 4 discusses the experiments and results. Section 5 concludes with a summarization of the work, its limitations, and future scope.

2. Literature survey

2.1 Music information retrieval and recent techniques

MIR deals with problems of music access, filtering, tool development, and retrieval (Orio, Reference Orio2006). According to (Orio, Reference Orio2006), the applications of MIR are intended to help users find specific music in a large collection by a particular similarity matching technique and criteria. Major tasks in MIR include (i) audio fingerprinting, (ii) audio-textual alignment, (iii) cover song identification, (iv) music genre identification and classification, and (v) music recommendation, among others (Srinivasa Murthy and Koolagudi Reference Srinivasa Murthy and Koolagudi2018; Blaß and Bader, Reference Blaß and Bader2019). Our proposed framework focuses on the tasks of music identification and classification. The basic framework of an MIR system is shown in Figure 2.

Figure 2. Block diagram of a typical MIR system.

In Deruty et al. (Reference Deruty, Grachten, Lattner, Nistal and Aouameur2022), music production of contemporary pop music is carried out using AI tools. Different musical instrument sound generation tools are utilized with automatic music labeling features in the form of symbolic representations and the coupling of composition with sound editing and mixing. In Shah et al. (Reference Shah, Pujara, Mangaroliya, Gohil, Vyas and Degadwala2022), music genre classification is carried out using machine learning models such as Support Vector Machines (SVM), Random Forest, Extreme Gradient Boosting, and Convolutional Neural Networks (CNN). They used the popular GTZAN dataset for training and testing. It is seen that CNN performs the best compared to the traditional models. Deep learning models are also used in Mersy (Reference Mersy2021), wherein depth-wise separable CNN is trained on electronic dance music and validated the performance with a CNN that is tested on a source-separated spectrogram and a normal spectrogram. The source-separated spectrogram proves better in terms of classification performance for limited dataset. Genre classification on the GTZAN dataset and Free Music Archive dataset is also undertaken in by Ashraf et al. (Reference Ashraf, Geng, Wang, Ahmad and Abid2020), using deep learning models such as CNN, Recurrent Neural Network (RNN), and CNN-RNN models with global layer regularization (GLR) using Mel-spectrograms to evaluate the performances. In the GLR technique, every hidden unit of a layer shares the same normalization terms. It is seen that CNN-RNN networks performed better on the two datasets due to the utilization of deep features.

In recent years, retrieval of music information from real-time embeddings has been seen in Stefani and Turchet (Reference Stefani and Turchet2022). Twelve acoustic guitar techniques are compiled, and the onset of such musical instances is detected. Cepstral features are used to train and test Deep Neural Network (DNN) models, and deployed to a Raspberry Pi-based embedded computer, with accuracy of 99.1%. Image embedding and acoustic embeddings are used in Dogan et al. (Reference Dogan, Xie, Heittola and Virtanen2022) for zero-shot audio classification. Similarly, in Lazzari, Poltronieri, and Presutti (Reference Lazzari, Poltronieri and Presutti2023), the pitch class from music structures is embedded into continuous vectors using existing methods and custom encodings using LSTM neural networks. This performs better than those techniques that use chord symbolic annotations.

2.2 Classification methods for identification of folk song and music

In music and singing processing, songs that share the same tune or melodies and have similar acoustic components are grouped together. It is usually done for efficient storage and retrieval, where ordering is necessary for ease of access. Classification of songs is also important in finding out the geographical origin of folk songs/music, for music recommendation, etc.

A few decades back, classification methods based on the ending notes, number of lines in the song, number of syllables in a line, etc., were adopted in Elschekova (Reference Elschekova1966), Keller (Reference Keller1984), Bohlman (Reference Bohlman1988), Umapathy, Krishnan, and Jimaa (Reference Umapathy, Krishnan and Jimaa2005), Van Kranenburg et al. (Reference Van Kranenburg, Garbers, Volk, Wiering, Grijp and Veltkamp2007), but not without problems and limitations (Keller, Reference Keller1984). In recent years, machine learning models have been heavily employed for musical classification, mainly based on the genre. Music genre classification techniques are found in Jiang et al. (Reference Jiang, Lu, Zhang, Tao and Cai2002), Aucouturier and Pachet (Reference Aucouturier and Pachet2003), Umapathy et al. (Reference Umapathy, Krishnan and Jimaa2005), Orio (Reference Orio2006); Meng et al. (Reference Meng, Ahrendt, Larsen and Hansen2007), Lee et al. (Reference Lee, Shih, Yu and Lin2009), and Fu et al. (Reference Fu, Lu, Ting and Zhang2010), which are based on different temporal and spectral methods. Several approaches and models have emerged over the years. Conditional Random Fields (CRFs) have been employed by (Liu et al. Reference Liu, Xu, Wei and Tian2007; Li et al. Reference Li, Luo, Ding, Zhao and Yang2019) along with GMM and Restricted Boltzmann Machine. In Li, Ding, and Yang (Reference Li, Ding and Yang2017), it is seen that CRF-GMM outperformed traditional classification models (approx. 4.6 %–18.13 %).

Attention neural network-based architecture for folk song classification is also explored (Arronte-Alvarez and Gomez-Martin Reference Arronte-Alvarez and Gomez-Martin2019). Musical motif embedding is also introduced to represent folk songs in different languages. For motif embedding, Word2Vec model (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013) has been used and then later on employed in the ANN architecture. They classified folk songs of Chinese, Swedish, and German origins. Their results are comparable to those in existing studies (Cuthbert, Ariza, and Friedland Reference Cuthbert, Ariza and Friedland2011; Le and Mikolov Reference Le and Mikolov2014). In Loh and Emmanuel (Reference Loh and Emmanuel2006), the extreme learning machine (ELM) for music genre classification is used, wherein features from 160 songs of four different genres in classical, pop music, rock music, and dance music are extracted. ELM and SVM have been used to classify these folk music.

In the Indian context, classification of Punjabi folk musical instruments from audio segments is carried out by Singh and Koolagudi (Reference Singh and Koolagudi2017). Although vocal singing is not considered in their paper, the classification methods they used for polyphonic musical signals could be employed for the classification of vocal singing with multiple singers. Classification accuracy of 71% is achieved using the J48 classifier, which is increased to 91% by further improvement of input data samples. Feature selection is performed based on the performance of the J48 classifier. The selected features are then supplied to eight additional classifiers, where the highest classification rate is achieved by logistics classifier (95%). In Das, Bhattacharyya, and Debbarma (Reference Das, Bhattacharyya and Debbarma2021), a classification system for Kokborok music using traditional machine learning techniques is developed. A computational method to minimize the errors for each class is developed, with an alpha ( $\alpha$ ) value defined to estimate better accuracy, which successfully improved the original classification accuracy. In Das et al. (Reference Das, Ramdinmawii, Kumar and Nath2023), music source separation is carried out for Hunting chants of Mizo folk songs using techniques like REpeating Pattern Extraction Technique (REPET), Robust Principal Component Analysis (RPCA), and Non-negative Matrix Factorization (NMF). It is seen that RPCA obtained the best signal-to-distortion ratio and signal-to-noise ratio for separation of vocals and musical accompaniments, followed by REPET, and NMF.

Based on the literature survey, the following research gaps are noted:

Folk songs have received relatively less attention compared to mainstream or commercial music genres. Consequently, there is a limited pool of research studies and resources dedicated to under-resourced folk songs. This limitation affects the depth and breadth of research findings and the development of specialized tools and techniques for analysis.
Folk songs are often part of an oral tradition, passed down through generations without written documentation. This poses challenges in preserving and documenting these songs, leading to the risk of songs being lost or forgotten over time.
The oral nature of folk songs and the lack of standardized annotation and metadata for these songs make it difficult to compare and analyze them systematically.

3. Methodology

In Figure 3, the methodology followed for Mizo folk song classification is depicted. Firstly, the dataset is built by collecting folk songs from different sources. Then pre-processing is carried out for extraction of acoustic features. Next, the extracted features are used for training and testing of the classification models. These different stages of the framework are detailed in the subsequent sections.

Figure 3. Methodology for classification of Mizo folk songs.

3.1 Mizo folk song dataset

The dataset for this study is collected from three sources, namely, publicly available Mizo folk songs of performances in cultural events and competitions, which are sourced from the internet, songs provided by the Art & Culture Department (A&C), Mizoram, and songs collected in field recordings. The internet data included audio from YouTube videos, mainly from documentaries, competition performances, and recordings made by educational institutes. The sampling frequency is 48,000 samples per second in mono channel.

Songs obtained from A&C Department have vocals accompanied by cow-hide drums. Recording has been done using the Shure SM58 dynamic vocal microphone, at 44,100 samples per second as the sampling rate and 1,411 kbps bit rate, with stereo channel. The songs were originally recorded for use in a folk song competition by the technical staff at A&C Department, and later shared with the authors of this study. Field data is recorded from a male singer who performed several categories of folk songs. Recording is carried out in a quiet room by the authors of this work. Zoom H1n portable recorder has been used, with sampling frequency of 44,100 samples per second, bit rate of 1,411 kbps, with stereo mode. The recording is placed approximately 1 ft. from the singer and mounted on a tripod.

Data imbalance is observed mainly due to singers being more familiar with certain song categories than others. From all the collected songs, Hunting chants (Hlado), Children’s songs (Pawnto hla), and Elderly songs (Pi Pu zai) have the most number of song samples and longer duration. Here, the Hlado songs obtained from the internet have been chanted in an open field, and the ones from field data are recorded in a quiet room. Pawnto hla has been sung by kids in an open playground. Pi pu zai has been recorded in a room full of people who sang the songs together in a group. The data distribution can be seen in Table 1. Except for these three chosen categories, other categories contain varying song samples ranging between 1 and 20. This huge imbalance in data makes it infeasible for classification using all available categories. Hence, for the purpose of this paper, the said three categories have been chosen.

Table 1. Categories of Mizo folk song dataset used in this study

Data pre-processing tasks such as cleaning, segmentation, noise removal, and normalization are carried out on the dataset. Unwanted segments like background noise, coughing sounds, swallowing sounds, and tongue clicking in the recordings are removed. However, in order to avoid aliasing and windowing effects, about 0.5–1 sec regions of silence are left uncut at the start and end of each audio clip. Amplitude normalization is performed by taking the absolute maximum amplitude of the song signal, in order to keep the amplitudes in the range of −1 and + 1.

A consistent naming structure is maintained for each category of the data: songcategory_source_genderSpeakerNo_speechNo. So, for a folk song type Hlado, the first song, performed by the second male singer obtained from A&C Department, can be written as: hlado_src2_m2_0001.wav. This dataset will be made available upon request to the authors or through the Natural Language Processing Laboratory, Department of Computer Science & Engineering, Tezpur University.

3.2 Acoustic feature extraction

For the experiments, Matlab (MATLAB 2022) and Praat (Boersma Reference Boersma2001) tools are used. For the purpose of feature extraction, the songs are sampled at 48,000 samples per second, and frame-wise extraction is carried out. Frame size of 25 msec and frameshift of 10 msec are used (Paliwal, Lyons, and Wójcicki Reference Paliwal, Lyons and Wójcicki2010; O’Shaugnessy Reference O’Shaugnessy1987, p. 179). The following acoustic features are used:

1. Fundamental Frequency (F0): F0 is the frequency at which the vocal folds vibrate during voice production. In this work, F0 is extracted using the autocorrelation function, which is computed as:
(1) \begin{equation} R(i) = \sum ^{N-1}_{n=i}x(n)x(n-i) \end{equation}
where 1 $\le$ i $\le$ p for a finite duration of x(n) and $p$ is a range of lag values ((O’Shaugnessy Reference O’Shaugnessy1987, p. 196); (Huang et al. Reference Huang, Acero, Hon and Reddy2001, p. 321)). Six parameters of F0, minimum, maximum, mean, range, standard deviation, and median, are extracted.
2. Signal energy: The energy of a continuous-time signal x(t) can be calculated by taking the square of amplitude of each time instance of x ((Haykin and Van Veen, Reference Haykin and Van Veen2007, p.20); (O’Shaugnessy Reference O’Shaugnessy1987, p. 180)). It is computed as
(2) \begin{equation} E_x = \int _{-\infty }^{\infty }x^2(t) dt \end{equation}
3. Zero crossing rate: The amount of time a signal crosses the x-axis is known as zero-crossing rate (ZCR). For a signal x(t), ZCR is computed as
(3) \begin{equation} ZCR(x(t)) = \frac{1}{2M}|sgn(x(t))-sgn(x(t-1))|W(i-j) \end{equation}
where W(i) represents a window of size M samples, and the signum function returns output of ZCR in the range of [0, 1] (O’Shaugnessy 1987, p.182). Higher ZCR value implies higher frequency content in the signal (Lerch Reference Lerch2012, p. 62)
4. Strength of excitation (SoE): SoE is the relative strength of impulse-like excitation at the Glottal Closure Instants. In this work, SoE is extracted using zero frequency filtering (ZFF) method (Yegnanarayana and Murty Reference Yegnanarayana and Murty2009). The slope of the ZFF signal at each epoch is the SoE (Mittal Reference Mittal2016; Kadiri and Alku Reference Kadiri and Alku2020).
5. Cepstral peak prominence (CPP): It is a commonly used method for acoustic measure of voice quality in different applications of speech analysis like singing voice studies (Baker et al. Reference Baker, Sundberg, Purdy, de and S.2022) and speech dysphonia (Fraile and Godino-Llorente Reference Fraile and Godino-Llorente2014). We have extracted CPP with voice detection and without voice detection, as found in (Murton, Hillman, and Mehta Reference Murton, Hillman and Mehta2020).
6. Mel frequency cepstral coefficients (MFCC): MFCCs describe the overall spectral envelope of a signal (Lerch Reference Lerch2012; O’Shaugnessy, Reference O’Shaugnessy1987). The $i^{th}$ coefficient, as in Lerch (Reference Lerch2012, p. 51), is computed as
(4) \begin{equation} MFCC_i (n)= = \sum _{k'=1}^{K'} log|X'(k',n)|.cos\left (i.\left (k' - \frac{1}{2}\right ) \frac{\pi }{K'}\right ) \end{equation}
where $|X'(k',n)|$ is the Mel spectrum at that frame block. In this work, 13 MFCC coefficients are used.
7. Formant frequencies: Acoustic resonances in the vocal tract are called formants (O’Shaugnessy, Reference O’Shaugnessy1987). They are crucial in examining the articulatory response of the vocal tract (Ladefoged and Johnson Reference Ladefoged and Johnson2014). A $10^{th}$ order linear prediction is used for generating the first four formants, and the songs are resampled to 10,000 samples per second.

3.3 Proposed models

At present, it is still difficult to implement a fully unsupervised learning model for audio, since singing signal is an exceedingly non-linear data. Moreover, sufficient data to implement an unsupervised model is currently unavailable for Mizo folk songs. Hence, an approach using a supervised deep learning model, LSTM, is proposed in the following subsection.

3.3.1 LSTM with attention mechanism (LSTM-attn)

A LSTM model with attention mechanism has been proposed. This attention mechanism enhances the ability of LSTM to focus on specific regions of the input acoustic feature vector at each time step. It computes attention scores based on the similarity between the data points in the feature vector, and assigns weights to different regions of the input vector giving ‘attention’ to the most relevant data points. LSTM-attn in this work is computed as in Algorithm 1, using the following parameters:

One-hot encoded input sequence, x, with dimension 2102 $\times$ 29 $\times$ 1 (for the 29 selected acoustic features)
Two weight parameters, $Q_w$ and $K_w$ , are defined as learnable weight matrices for Query projection and Key projection, respectively.
Output sequence, y, which is an attention-weighted sequence with the same dimension as x.

Algorithm 1. Attention mechanism for LSTM

The model summary of LSTM-attn is shown in Figure 4. The first LSTM layer in this figure is the input layer, which takes the input having a shape of 2102 $\times$ 29 $\times$ 1. This layer uses ReLU (Rectified Linear Unit) with 64 units to introduce non-linearity in the input vector. This layer allows to capture the temporal differences in the input feature sequence. The output produced has a shape of 32 $\times$ 29 $\times$ 64, as batchSize = 32.

Figure 4. Summary of the proposed LSTM-attn model with custom attention layer.

This is then passed to the attention layer, where attn_scores are computed based on the importance of the data points, as per the algorithm mentioned above. The dimension of the weights for the matrices $Q_w$ and $K_w$ are customized as 64 $\times$ 29, taking the size of both the time axis and the feature axis from the input vector, rather than weighing on the time axis alone as done in conventional attention mechanisms. This projection of input data to a higher dimensionality for query vector allows the model to concentrate on the most relevant features in the batch, while the key weight is set to retain the dimension of the input vector. Moreover, this customization allows for pairwise relationships between features within the input sequence while preserving all input features. The shape of the output is maintained from the previous layer. This setting was seen to improve the model performance than allowing the model to assign random weights.

Next is another LSTM layer with 128 ReLU units, whose output sequence has 32 $\times$ 29 $\times$ 128 shapes. Then, the output is flattened to get a vector of shape 32 $\times$ 3712. A fully connected dense layer using ReLU activation with 128 units is again added, which reduces the shape of the vector to 32 $\times$ 128. Subsequently, the softmax layer follows with 3 units (i.e. the number of classes, which in our case is the number of song categories) to produce the 32 $\times$ 3 output as class probabilities.

3.3.2 Machine learning models

In addition to the proposed LSTM-attn model above, four commonly used supervised machine learning models, SVM, K-Nearest Neighbor (KNN), Naive Bayes, and Ensemble learning, are employed for comparing the results obtained from the LSTM-attn model. These models have been found to have the highest classification rates as compared to other models for shorter segments of speech (Grimaldi, Cunningham, and Kokaram Reference Grimaldi, Cunningham and Kokaram2003; Huang et al. Reference Huang, Lin, Wu and Li2014).

4. Experiments and discussion of results

4.1 Experiments

In this work, a total of 29 acoustic features and parameters have been extracted and divided into four different combination sets. This is done to find out which group of features are relevant for the classification task based on their acoustic properties. The features are grouped as follows: set-1: Temporal features (F0, Energy, ZCR); set-2: Source feature (SoE); set-3: Source-system features (CPP, MFCC); set-4: System features (Formants); set-1 + 2: Temporal + source features (F0, Energy, ZCR, SoE); set-1 + 2 + 3: Temporal + source + source-system features ((F0, Energy, ZCR, SoE, CPP, MFFC); and set-1 + 2 + 3 + 4: all sets of feature (F0, Energy, ZCR, SoE, CPP, MFFC, Formants). Class labels 1, 2, and 3 are assigned to the hunting chants, children’s songs, and elderly songs, respectively.

In total, there are seven feature combinations used for classification of the folk songs. Performing the classification with such combinations will help to identify what acoustic features are relevant for the classification of Mizo folk songs. The three categories of songs, whose typical sample length is 1–5 mins, are divided into manageable chunks of 3 sec. So, from the original 93 song files, a total of 2948 samples are generated.

After removal of NaN and zero values, the feature vector is one-hot encoded to reshape and make it compatible with the model. The shape of the vector becomes 2102 $\times$ 29 $\times$ 1. Using the seven feature set combinations, experiments are carried out wherein the shape of the input vector changes depending on the number of features considered. For these experiments, the ‘adam’ optimizer is used, along with a constant batch size of 32 for different epochs—10, 20, 30, 40, and 50. Only the epoch with the best result, i.e., 10, is reported in this study. With the small size of the feature vector, it is deemed sufficient to choose 10 epochs for this work. For the four ML models, after eliminating NaN values, the dimension of the final feature vector becomes 2183 $\times$ 29. The models are trained using k-fold cross validation (k = 5) on 80% of the data, and 20% is set aside for testing.

4.2 Discussions

In Table 2, the accuracy results of the proposed LSTM-attn model are shown. The training and testing results are split 80:20 from the input sequence. Although slightly lesser, the performance of the model for each feature set is comparable to the existing ML models used in this study. As the classes are rather distinctive from one another, it has been observed that class-2 and class-3 exhibit minimal misclassification between them, and higher number of misclassifications are seen between class-1 and class-3.

Table 2. Macro-averages of long short-term memory with attention layer model for classification of three categories of Mizo folk songs, with 20% testing data

Interestingly, it is observed that LSTM-attn performance deteriorates as the feature set combines more acoustic features. The accuracy plots of the feature sets are shown in Figure 5. The set-1 provides the best accuracy (95.01%) among the individual feature sets. However, as the combinations are increased, the model appears to gradually reduce in performance (91.21%) in case of set-1 + 2 + 3 + 4 for all the features. This is attributed to the fact that LSTMs are sequential models and work well in capturing patterns and dependencies in sequential data. However, as the feature sets combined are not inherently sequential by nature, the performance of the LSTM-attn is seen to deteriorate.

Figure 5. Accuracy plots for LSTM-attn with 10 epochs for different feature sets (Accuracy $\times$ 100).

Out of the four different supervised classifiers employed, it can be seen from Table 3 that Ensemble method achieves the best accuracy of 97.71% for temporal features in set-1 and all features in set-1 + 2 + 3 + 4 , with 66 incorrectly classified data points. It can also be observed from Figure 6 that there is hardly any misclassification between class-3 (elderly songs) and class-2 (Children’s songs). This is because children’s voice and adult’s voice have clear distinction and lack similarity, so the models are able to train and predict well. Misclassification is highest in case of hunting chants and the elderly songs. Although the rhythm and tempo of the songs are not similar, there is still the fact that both are sung and performed by adults. As such, the characteristics of these two categories of songs may show some similarity in terms of excitation features.

Table 3. Classifier performances for three categories of Mizo folk songs, with different combinations of acoustic features (5-fold cross validation with 20% data for testing)

Figure 6. Confusion matrices of the four ML models and LSTM-attn with different feature sets.

4.2.1 A comparative analysis of classification using different feature sets

In set-1 , the LSTM-attn model yields a testing accuracy of 95.01%, which, despite a slight decrease, is considered a fairly good performance given the limited data size. The accuracy plot in Figure 5(a) shows slight improvements. The precision, recall, and f1-score are also the highest among all the feature sets, as shown in Table 2. Ensemble model also performs well, achieving an accuracy rate of 97.71%. KNN, SVM, and Naive Bayes classifiers obtained testing accuracies of 96.79%. Overall, this feature set with temporal features demonstrates the best accuracy when considering both machine learning models and the proposed LSTM-attn model.

In set-2 , LSTM-attn achieves the highest performance (71.73%) despite challenges with reduced features and its f1-score dipping to 0.74. There is no misclassification of class-3 as class-2 for the models except Naive Bayes seen in Figure 6(b). Ensemble model also performs the best among (64.91%) the ML models. The use of a single acoustic feature (SoE) in this set affects the performance. It can also be due to the fact that estimation of excitation strength is obtained using a ZFF model (Yegnanarayana and Gangashetty Reference Yegnanarayana and Gangashetty2011), which uses a fixed window length in the trend removal of zero-frequency resonators. This fixed windowing does not work well for singing voice and expressive voice due to high source-filter system interaction (Kadiri and Yegnanarayana Reference Kadiri and Yegnanarayana2015; Kadiri, Alku, and Yegnanarayana Reference Kadiri, Alku and Yegnanarayana2020).

In set-3 , source-system features produced better classification accuracy than set-2 . LSTM-attn obtained 91.69% accuracy with its f1-score at 0.89. Ensemble method performs the best (92.66%) while SVM has the lowest accuracy (89.68%) and the highest classification error. The accuracy plot in Figure 5(c) shows higher accuracies with little improvement over 10 epochs.

In set-4 , LSTM-attn obtains the best performance (69.88%). The challenges with feature set-4 become evident as the models struggle to effectively categorize folk songs. As can be seen in Figure 6(d), the misclassification ratio closely mirrors the classification accuracy, which can be attributed to the influence of musical notes in vocal singing, causing shifts in formant frequencies (Heaton, Reference Heaton2010). Although relatively steady, the accuracy plot also shows a slight dip toward the last epoch in Figure 5(d).

With set-1 + 2 , there is an improvement to the classification accuracy when the temporal features are combined with source features, as it has been observed that set-2 does not perform well on its own. The LSTM-attn obtains 93.11% accuracy, although the ML models achieved better accuracy.

With feature set-1 + 2 + 3 , improvements in the accuracy of the classification are observed with fewer classification errors as shown in Figure 6(f). The f1-score is at 0.92 and the accuracy plot in Figure 5(f) shows a steady curve between training and testing data. KNN model does a good job of classification with the least amount of misclassified data points. The LSTM-attn achieves the lowest accuracy of 93.11% with f1 score of 0.93.

Finally, for set-1 + 2 + 3 + 4 , the incorporation of system features (formants) to the previous three types of features does not improve the classification accuracy for LSTM-attn, KNN, SVM, and Naive Bayes. However, in Ensemble model, the accuracy is improved and misclassification is reduced. It is observed that the performance of LSTM-attn does not necessarily improve with more diversified feature combinations. It is seen to have lesser accuracy rate than those whose features are of the same feature type.

4.2.2 Comparison with existing works

Given the absence of prior research on the acoustic analysis and classification of Mizo folk songs, this study draws comparisons with similar research on folk songs in other low-resource Indian languages. Currently, the existing works do not typically employ more recent techniques using deep learning methods due to being under-resourced.

As depicted in Table 4, different works on classification of Indian folk songs are shown, out of which Kokborok (Das et al. Reference Das, Bhattacharyya and Debbarma2021) seems to be closest to Mizo in terms of language family (Tibeto-Burman language family) and geographical location. Although the experimental setup is dissimilar, in case of Kokborok, 63% classification accuracy has been achieved by using statistical computational methods to improve the classification error of the feature sets.

Table 4. Classifier performance compared with existing studies of under-resourced folk songs

In case of existing work with similar experimental setup, the folk songs of different categories of Gais, Rais, and Phag are classified by Pandey and Dutta (Reference Pandey and Dutta2014). A 5-fold cross validation and 80:20 training-testing ratio has been employed, which achieved 91.3% using SVM classifier. Despite the performance of our proposed LSTM-attn model being lower than the four existing ML models used in this study, there is a slight improvement than the existing works.

5. Summary and conclusion

Mizo is a low-resource language that lacks tools and technology required for the archival of its folk music. It has been observed that very few acoustic studies exist for Indian folk songs and music in spite of its richness in cultural and regional diversity. A survey of literature on Mizo folk songs as well as on recent methods of folk song and music classification have been carried out.

This work proposes an LSTM-attn consisting of an LSTM layer, a custom attention layer, a fully connected dense layer, and a softmax layer. Its performance is compared with those of existing machine learning models like SVM, KNN, Naive Bayes, and Ensemble models. Three categories of Mizo folk songs are used as dataset for classification, Hunting chants (Hlado), Children’s songs (Pawnto hla), and Elderly songs (Pi pu zai), with the total duration of the songs being approximately 2 hrs. A total of 29 acoustic features grouped into temporal features, source features, source-system features, and vocal tract filter features are extracted from the Mizo folk songs. Classification is carried out with 20% of the data segregated for testing. The highest accuracy achieved for the LSTM-attn is 95.01% (for temporal features), while it achieved 91.21% for all features combined. The results are comparable to existing studies of folk song classification in other Indian languages.

Our work is constrained by the relatively small dataset, which necessitates the segmentation of song samples into 3-sec segments. This approach may result in the loss of contextual information and potential discontinuities. Consequently, important audio events or transitions that span longer duration could be divided across different segments, posing challenges in capturing the complete audio content. Moreover, this employed frame-wise analysis might have overlooked important tonal characteristics of the Mizo language present in these folk songs. Performance of the proposed LSTM-attn model could be improved with larger sample size in each class. A comprehensive evaluation of the model will be undertaken in future. Additionally, analysis will be conducted to address the issue of lower accuracy in diverse acoustic feature sets. Exploration of tone-tune relationship in the Mizo language will also be undertaken, building upon previous studies in by Ramdinmawii and Nath (Reference Ramdinmawii and Nath2022) and Gogoi and Nath (Reference Gogoi and Nath2023).

This work would significantly contribute to India’s efforts in preserving intangible cultural heritage, benefiting Mizoram’s Art & Culture Department, currently engaged in archiving the state’s heritage. Additionally, this method can have broader applications in MIR, not only in Mizo but also in other Tibeto-Burman languages like Tani (Arunachal Pradesh), Meitei (Manipur), and Garo (Meghalaya).

Acknowledgement

Authors thank the Director and Technician, Department of Art & Culture, Mizoram, for their contribution in sharing their prerecorded songs. Authors are also grateful to the late Pu Lalkhuma (Sialsuk) for his valuable contribution to the dataset. Lastly, a great appreciation goes to the owners of YouTube channels who permitted us to use their content for our dataset in this work.

Footnotes

Special Issue on ‘Natural Language Processing Applications for Low-Resource Languages’

^a https://www.kamat.com/kalranga/nindia/mizoram/map.htm

^b https://ignca.gov.in/regional-centers/southern-regional-centre/

^c https://tdil-dc.in/index.php?option=com_vertical&parentid = 58&lang=en

References

Arronte-Alvarez, A. and Gomez-Martin, F. (2019). An attentional neural network architecture for folk song classification, arXiv preprint arXiv: 1904.Google Scholar

Ashraf, M., Geng, G., Wang, X., Ahmad, F. and Abid, F. (2020). A globally regularized joint neural architecture for music classification. IEEE Access 8, 220980–220989.CrossRef Google Scholar

Aucouturier, J.-J. and Pachet, F. (2003). Representing musical genre: A state of the art. Journal of New Music Research 32(1), 83–93.Google Scholar

Baker, C. P., Sundberg, J., Purdy, S. C., de, SLeão, S., H., et al. (2022). CPPS and voice-source parameters: Objective analysis of the singing voice. Journal of Voice 38(3), 549–560.CrossRef Google Scholar PubMed

Betsy, S. and Bhalke, D. (2015). Genre classification of Indian Tamil music using mel-frequency cepstral coefficients. International Journal of Engineering Research & Technology 4(12), 423–427.Google Scholar

Bhatt, M. and Patalia, T. (2017). Neural network based Indian folk dance song classification using MFCC and LPC. International Journal of Intelligent Engineering and Systems 10(3), 173–183.CrossRef Google Scholar

Blaß, M. and Bader, R. (2019). Content-based music retrieval and visualization system for ethnomusicological music archives. Computational Phonogram Archiving, 5, 145–173.CrossRef Google Scholar

Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International 5(9), 341–345.Google Scholar

Bohlman, P. V. (1988). The Study of Folk Music in the Modern World. Bloomington, Indiana, U.S.: Indiana University Press.Google Scholar

Cuthbert, M. S., Ariza, C. and Friedland, L. (2011). Feature extraction and machine learning on symbolic music using the music21 toolkit. In ISMIR, pp. 387–392.Google Scholar

Das, N., Ramdinmawii, E., Kumar, A. and Nath, S. (2023). Vocal singing and music separation of mizo folk songs. In 2023 4th International Conference on Computing and Communication Systems (I3CS), IEEE, pp. 1–6.Google Scholar

Das, S., Bhattacharyya, B. K. and Debbarma, S. (2021). Building a computational model for mood classification of music by integrating an asymptotic approach with the machine learning techniques. Journal of Ambient Intelligence and Humanized Computing 12(6), 5955–5967.Google Scholar

Deruty, E., Grachten, M., Lattner, S., Nistal, J. and Aouameur, C. (2022). On the development and practice of ai technology for contemporary popular music production. Transactions of the International Society for Music Information Retrieval 5(1), 35.CrossRef Google Scholar

Dogan, D., Xie, H., Heittola, T. and Virtanen, T. (2022). Zero-shot audio classification using image embeddings. In 2022 30th European Signal Processing Conference (EUSIPCO), IEEE, pp. 1–5.CrossRef Google Scholar

Elschekova, A. (1966). Methods of classification of folk-tunes. Journal of the International Folk Music Council 18, 56–76.CrossRef Google Scholar

Fraile, R. and Godino-Llorente, J. I. (2014). Cepstral peak prominence: A comprehensive analysis. Biomedical Signal Processing and Control 14, 42–54.CrossRef Google Scholar

Fu, Z., Lu, G., Ting, K. M. and Zhang, D. (2010). A survey of audio-based music classification and annotation. IEEE Transactions on Multimedia 13(2), 303–319.CrossRef Google Scholar

Gogoi, J. and Nath, S. (2023). Analysing word stress and its effects on assamese and mizo using machine learning. In 2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS), IEEE, pp. 1–6.Google Scholar

Grimaldi, M., Cunningham, P. and Kokaram, A. (2003). A wavelet packet representation of audio signals for music genre classification using different ensemble and feature selection techniques. In Proceedings of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval, pp. 102–108.CrossRef Google Scholar

Haykin, S. and Van Veen, B. (2007). Signals and Systems. Daryaganj, New Delhi, India: Wiley India Pvt. Ltd.Google Scholar

Heaton, E. M. (2010). Formant Changes in Amateur Singers After Instruction in a Vowel Equalization Technique. Ann Arbor, Michigan, U.S.: ProQuest LLC.Google Scholar

Huang, X., Acero, A., Hon, H.-W. and Reddy, R. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, vol. 95. Hoboken, New Jersey, U.S.: Prentice Hall.Google Scholar

Huang, Y.-F., Lin, S.-M., Wu, H.-Y. and Li, Y.-S. (2014). Music genre classification based on local feature selection using a self-adaptive harmony search algorithm. Data & Knowledge Engineering 92, 60–76.Google Scholar

Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H. and Cai, L.-H. (2002). Music type classification by spectral contrast feature. In Proceedings. IEEE International Conference on Multimedia and Expo, IEEE, vol 1, pp. 113–116.Google Scholar

Kadiri, S. R. and Alku, P. (2020). Excitation features of speech for speaker-specific emotion detection. IEEE Access 8, 60382–60391.CrossRef Google Scholar

Kadiri, S. R., Alku, P. and Yegnanarayana, B. (2020). Analysis and classification of phonation types in speech and singing voice. Speech Communication 118, 33–47.CrossRef Google Scholar

Kadiri, S. R. and Yegnanarayana, B. (2015). Analysis of singing voice for epoch extraction using zero frequency filtering method. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4260–4264.CrossRef Google Scholar

Keller, M. S. (1984). The problem of classification in folksong research: A short history. Folklore 95(1), 100–104.CrossRef Google Scholar

Khiangte, L. (2001). Mizo folk literature. Indian Literature 45(1), 72–83.Google Scholar

Khiangte, L. (2002). Mizo Songs and Folk Tales. New Delhi, India: Sahitya Akademi.Google Scholar

Ladefoged, P. and Johnson, K. (2014). A Course in Phonetics. Stamford, Connecticut, U.S.: Cengage Learning.Google Scholar

Lalremruati, R. (2012). Oral literature: a study of Mizo folk songs, PhD thesis. Mizoram University. Google Scholar

Lalremruati, R. (2019). Narratives of Mizo traditional songs : A thematic analysis. International Journal of Research and Analytical Reviews 6(2), 422–425.Google Scholar

Lalthangliana, B. (1993). History of Mizo Literature. Aizawl: R.T.M. Press.Google Scholar

Lalthangliana, B. (2005). Culture and Folklore of Mizoram. Publications Division Ministry of Information & Broadcasting. New Delhi, India:Publications Division, Ministry of Information and Broadcasting, Govt. of India.Google Scholar

Lalzarzova, 2016). Thanglunghnemi zai bihchianna. Mizo Studies VIII(2), 27–35.Google Scholar

Lazzari, N., Poltronieri, A. and Presutti, V. (2023). Pitchclass2vec: Symbolic music structure segmentation with chord embeddings, arXiv preprint arXiv: 2303.15306.Google Scholar

Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International Conference on Machine Learning, PMLR, pp. 1188–1196.Google Scholar

Lee, C.-H., Shih, J.-L., Yu, K.-M. and Lin, H.-S. (2009). Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Transactions on Multimedia 11(4), 670–682.Google Scholar

Lerch, A. (2012). An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics. Hoboken, New Jersey, U.S.: John Wiley & Sons, Inc.Google Scholar

Li, J., Ding, J. and Yang, X. (2017). The regional style classification of Chinese folk songs based on GMM-CRF model. In Proceedings of the 9th International Conference on Computer and Automation Engineering, pp. 66–72.CrossRef Google Scholar

Li, J., Luo, J., Ding, J., Zhao, X. and Yang, X. (2019). Regional classification of chinese folk songs based on crf model. Multimedia Tools and Applications 78(9), 11563–11584.CrossRef Google Scholar

Liu, Y., Xu, J., Wei, L. and Tian, Y. (2007). The study of the classification of Chinese folk songs by regional style. In International Conference on Semantic Computing (ICSC 2007), IEEE, pp. 657–662.CrossRef Google Scholar

Loh, Q.-J. B. and Emmanuel, S. (2006). ELM for the classification of music genres. In 2006 9th International Conference on Control, Automation, Robotics and Vision, IEEE, pp. 1–6.CrossRef Google Scholar

MATLAB (2022). Version 9.12.0 (R2022b). Natick, Massachusetts: The MathWorks Inc.Google Scholar

Meng, A., Ahrendt, P., Larsen, J. and Hansen, L. K. (2007). Temporal feature integration for music genre classification. IEEE Transactions on Audio, Speech, and Language Processing 15(5), 1654–1664.Google Scholar

Mersy, G. (2021). Efficient robust music genre classification with depthwise separable convolutions and source separation. InProceedings of the AAAI Conference on Artificial Intelligence, vol.35(18), pp. 15972–15973.Google Scholar

Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781.Google Scholar

Mittal, V. K. (2016). Discriminating features of infant cry acoustic signal for automated detection of cause of crying. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), IEEE, pp. 1–5.CrossRef Google Scholar

Moseley, C. (2010). Atlas of the World’s Languages in Danger. place de Fontenoy, Paris, France: UNESCO Publishing.Google Scholar

Murton, O., Hillman, R. and Mehta, D. (2020). Cepstral peak prominence values for clinical voice evaluation. American Journal of Speech-Language Pathology 29(3), 1596–1607.CrossRef Google Scholar PubMed

Orio, N. (2006). Music retrieval: A tutorial and review. Foundations and Trends in Information Retrieval 1(1), 1–90.CrossRef Google Scholar

O’Shaugnessy, D. (1987). Speech Communication: Human and Machine. Boston, Massachusetts, U.S.: Addison-Wesley Publishing Company.Google Scholar

Paliwal, K. K., Lyons, J. G. and Wójcicki, K. K. (2010). Preference for 20-40 MS window duration in speech analysis. In 2010 4th International Conference on Signal Processing and Communication Systems, IEEE, pp. 1–4.Google Scholar

Pandey, A. and Dutta, I. (2014). Bundeli folk-song genre classification with KNN and SVM. In Proceedings of the 11th International Conference on Natural Language Processing, pp. 133–138.Google Scholar

Ramdinmawii, E. and Nath, S. (2022). A preliminary analysis on the correlates of stress and tones in Mizo. ACM Transactions on Asian and Low-Resource Language Information Processing 22(2), 1–15.CrossRef Google Scholar

Rege, A. and Sindal, R. (2021). Audio classification for music information retrieval of Hindustani vocal music. Indonesian Journal of Electrical Engineering and Computer Science 24(3), 1481.CrossRef Google Scholar

Shah, M., Pujara, N., Mangaroliya, K., Gohil, L., Vyas, T. and Degadwala, S. (2022). Music genre classification using deep learning. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, pp. 974–978.Google Scholar

Singh, I. and Koolagudi, S. G. (2017). Classification of Punjabi folk musical instruments based on acoustic features. In Proceedings of the International Conference on Data Engineering and Communication Technology, Springer, pp. 445–454.Google Scholar

Srinivasa Murthy, Y. V. and Koolagudi, S. G. (2018). Content-based music information retrieval (CB-MIR) and its applications toward the music industry: A review. ACM Computing Surveys 51(3), 1–46.Google Scholar

Stefani, D. and Turchet, L. (2022). On the challenges of embedded real-time music information retrieval. In Proceedings of the 25-th International Conference on Digital Audio Effects (DAFx20in22), vol. 3, pp. 177–184.Google Scholar

Thanmawia, R. (1998). Mizo Poetry. Publications Division, Ministry of Information & Broadcasting, Government of India. Aizawl, Mizoram, India: Din Din Heaven.Google Scholar

Umapathy, K., Krishnan, S. and Jimaa, S. (2005). Multigroup classification of audio signals using time-frequency parameters. IEEE Transactions on Multimedia 7(2), 308–315.Google Scholar

Van Kranenburg, P., Garbers, J., Volk, A., Wiering, F., Grijp, L. and Veltkamp, R. (2007). Towards integration of music information retrieval and folk song research. In Proceedings of the 8th International Conference on Music Information Retrieval, pp. 505–508.Google Scholar

Weidert, A. (1975). Componential Analysis of Lushai Phonology, vol. 2. Amsterdam, The Netherlands: John Benjamins Publishing Company.Google Scholar

Yegnanarayana, B. and Gangashetty, S. V. (2011). Epoch-based analysis of speech signals. Sadhana 36(5), 651–697.CrossRef Google Scholar

Yegnanarayana, B. and Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing 17(4), 614–624.CrossRef Google Scholar

Figure 1. Speaker population of Mizo languagea (marked in green dots).

Figure 2. Block diagram of a typical MIR system.

Figure 3. Methodology for classification of Mizo folk songs.

Table 1. Categories of Mizo folk song dataset used in this study

Algorithm 1. Attention mechanism for LSTM

Figure 4. Summary of the proposed LSTM-attn model with custom attention layer.

Table 2. Macro-averages of long short-term memory with attention layer model for classification of three categories of Mizo folk songs, with 20% testing data

Figure 5. Accuracy plots for LSTM-attn with 10 epochs for different feature sets (Accuracy $\times$ 100).

Table 3. Classifier performances for three categories of Mizo folk songs, with different combinations of acoustic features (5-fold cross validation with 20% data for testing)

Figure 6. Confusion matrices of the four ML models and LSTM-attn with different feature sets.

Table 4. Classifier performance compared with existing studies of under-resourced folk songs

Article contents

Resource building and classification of Mizo folk songs

Abstract

Keywords

1. Introduction

2. Literature survey

2.1 Music information retrieval and recent techniques

2.2 Classification methods for identification of folk song and music

3. Methodology

3.1 Mizo folk song dataset

3.2 Acoustic feature extraction

3.3 Proposed models

3.3.1 LSTM with attention mechanism (LSTM-attn)

3.3.2 Machine learning models

4. Experiments and discussion of results

4.1 Experiments

4.2 Discussions

4.2.1 A comparative analysis of classification using different feature sets

4.2.2 Comparison with existing works

5. Summary and conclusion

Acknowledgement

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests