1. Introduction
An ironic, stinging, sour, cutting statement or comment that indicates the reverse of what someone truly intends to express is sarcasm (Riloff et al. Reference Riloff, Qadir, Surve, De Silva, Gilbert and Huang2013). The use of sarcastic language is a resentment concealed as humor and intended to provoke, annoy, or convey contempt. Sarcasm, in the context of expression analysis, is a type of language expression in which the intended meaning of a remark differs from its literal interpretation. Because of its intrinsic ambiguity, it poses a distinct challenge and opportunity for investigation. Expression analysis attempts to decipher the complexities of sarcasm by evaluating verbal indicators such as tone, context, and grammatical structures. Expression analysis seeks to capture the nuances of sarcasm by interpreting the contrast between overt and implicit meanings, offering light on how language is used to transmit complex ideas and emotions. This provides useful insights into the complexity of human communication, particularly in understanding how people utilize linguistic strategies to convey thoughts that go beyond the surface text. As the intention of sarcasm is often vague and misleading, people cannot discriminate between a true story and satire or irony (Riloff et al. Reference Riloff, Qadir, Surve, De Silva, Gilbert and Huang2013). Sarcasm, despite its sometimes vague or complex meanings, fulfills various functions in communication. Sarcasm is a rhetorical strategy used to communicate a message by exploiting the gap between the literal and intended meanings of words. Its use can be ascribed to a variety of circumstances, including humor or irony, expression of emotion, criticism, and so on. Facebook, YouTube, and Twitter are influential social media platforms for sharing people’s judgments, thoughts, opinions, and sentiments nowadays (Hussain, Mahmud, and Akthar Reference Hussain, Mahmud and Akthar2018). The aforementioned large amount of available data offers the extent to research in natural language processing (NLP).
Sarcasm detection in the low-resource language is a very narrow research area in NLP. Sarcasm detection is a subset of sentiment analysis problems where the focus is on recognizing sarcasm rather than identifying a sentiment across the board (Eke et al. Reference Eke, Norman, Shuib and Nweke2020). Sarcasm detection research is available for high-resource languages such as English. But, despite being the world’s seventh most spoken language with 240 million native speakers (Hossain et al. Reference Hossain, Hoque, Siddique and Sarker2021), research on sarcasm detection in the Bengali language is unexplored and overlooked. Due to the limited resources and the scarcity of large-scale sarcasm data, identifying sarcasm from Bengali text is currently a difficult challenge for the researchers of NLP (Romim et al. Reference Romim, Ahmed, Talukder and Saiful Islam2021).
Facebook is a popular free social networking website that allows registered users to upload photos and videos, send messages, and keep in touch with friends, family, and colleagues.Footnote a Bangladesh has had 41 million Facebook users since January 2021.Footnote b People socialize in the Facebook comment section to express their perspectives, judgments, and opinions on the content of a post. Any automatic detection system that uses machine learning is large-scale dataset-dependent as it requires rigorous training and testing. As far as we have noticed, there is no available Bengali text corpus for sarcasm detection. We have constructed a corpus named `Ben-Sarc’ that contains Facebook comments written in Bengali. Furthermore, we have classified the Bengali texts as sarcastic and non-sarcastic and proposed a sarcasm detection model using machine learning.
In the next section, we highlight the objective of our research. Then, we briefly discuss related works on high and low-resource language sarcasm detection in Section 3. Section 4 shows the dataset creation along with the annotation process. Moreover, Section 5 explains the proposed methodology. Section 6 contains the experimental results and their analysis, while Section 7 contains the conclusion and future work.
2. Research objectives and our contribution
The purpose of this research is discussed in this section. Due to the limited resources and lack of a high-quality dataset, sarcasm detection from Bengali text is totally unexplored. To the best of our knowledge, there is no large-scale self-annotated dataset for sarcasm detection from Bengali text. Moreover, maintaining the quality of the dataset is required to produce satisfactory results. In this research, we introduce a large-scale self-annotated dataset ‘Ben-Sarc’ for Bengali sarcasm detection maintaining high quality by human evaluation. Then, we conduct a detailed experiment using machine learning, deep learning, and transfer learning to set a benchmark result on this dataset.
Our main contributions in this paper are summarized as follows:
-
At first, we construct a large-scale self-annotated Bengali corpus for sarcasm detection. The corpus can be found at https://github.com/sanzanalora/Ben-Sarc.
-
Then, we conduct a comprehensive experiment on this corpus to detect sarcasm from Bengali texts with the help of traditional machine learning, deep learning, and transfer learning approaches to set a baseline for future researchers.
3. Related works
The increasing engagement of social media users influences the quantitative and qualitative analysis of available data. Though most of the research is on the English language, sarcasm detection for low-resource languages such as Indonesian (Lunando and Purwarianti Reference Lunando and Purwarianti2013), Hindi (Bharti, Babu, and Raman Reference Bharti, Babu and Raman2017; Reference Baruah, Das, Barbhuiya and DeyBaruah et al. 2020; Pawar and Bhingarkar Reference Pawar and Bhingarkar2020), Czech (Reference Ptáček, Habernal and HongPtáček et al. 2014), and Japanese (Hiai and Shimada Reference Hiai and Shimada2018) is available. We discuss some of the related approaches in the following literature review analysis.
3.1. English language
Pawar and Bhingarkar (Reference Pawar and Bhingarkar2020) experimented with traditional machine learning algorithms such as Support Vector Machine (SVM), K Nearest Neighbors (KNN), and Random Forest (RF) on 9104 tweets on Twitter. Sentamilselvan et al. (Reference Sentamilselvan, Suresh, Kamalam, Mahendran and Aneri2021) worked on both sarcasm and irony detection separately. SVM, Naive Bayes (NB), Decision Tree (DT), and RF were applied to the irony detection dataset whereas SVM and RF algorithms were to the sarcasm detection dataset. The segregated experiments gained 64% accuracy on irony and 76% accuracy on sarcasm detection.
There exist a few models that use contextual information regarding tweets on Twitter to detect sarcasm. Bamman and Smith (Reference Bamman and Smith2021) focused on the context of authors and audiences on Twitter posts to figure out sarcastic content with 85.1% accuracy. Binary logistic regression was applied to train the model on 19534 tweets. Wang et al. (Reference Wang, Wu, Wang and Ren2015) also aimed at the context for identifying sarcasm accurately. They collected 1500 tweets and derived 6774 history-based, 453 conversation-based, and 2618 topic-based contextual tweets. The sequential SVM classifier exhibited a decent accuracy of 69.13%. Khatri and P (Reference Khatri and P.2020) extracted 5000 tweets that include texts, labels, and contexts and analyzed the dataset through linear SVC, Logistic Regression (LR), Gaussian NB, and RF classifiers. They utilized Bidirectional Encoder Representations from Transformers (BERT) and Global Vectors for Word Representation (GloVe) embeddings in the algorithms. Logistic Regression with GloVe embeddings gained 69% accuracy on the dataset that involves context.
Hashtags exhibit a meaningful role in the content on Twitter. Pawar and Bhingarkar (Reference Pawar and Bhingarkar2020) extracted 9104 tweets containing hashtags such as ‘#sarcasm’ and ‘#not’ in Hindi and English. They implemented three SVM, KNN, and RF classifiers. RF showed an 81% accuracy on sarcasm detection.
Riloff et al. (Reference Riloff, Qadir, Surve, De Silva, Gilbert and Huang2013) considered the impact of positive and negative situations on different sentiments to analyze sarcasm. They used a supervised SVM classifier and an N-gram classifier. To increase the accuracy, they optimized the RBF kernel, cost, and gamma parameters over 35000 tweets.
Lemmens et al. (Reference Lemmens, Burtenshaw, Lotfi, Markov and Daelemans2020) inflicted four models: bidirectional long short-term memory (LSTM), LSTM and Convolutional Neural Network (CNN), SVM, and Multi-layer perception on 9400 data collected from Reddit and Twitter. Each model used 10-fold cross-validation. The ensemble method achieved the best F1 score. Very few research works executed deep learning models alongside the transformers models to improve the accuracy of the prediction of sarcasm detection models.
Joshi et al. (Reference Joshi, Tripathi, Patel, Bhattacharyya and Carman2016) contended that sarcasm cannot be detected using current methods because they are unable to detect nuanced kinds of context incongruity. By utilizing semantic similarity/discordance between word embeddings, they suggested improving on earlier work. They examined four different word embeddings and found that sarcasm recognition has improved. The authors came to the conclusion that LSA and GloVe are less effective in sarcasm identification than Word2Vec and dependency weight-based features.
Ghosh, Fabbri, and Muresan (Reference Ghosh, Fabbri and Muresan2018) explained how crucial it is to spot irony and sarcasm in user-generated material on social media networks. They emphasized that recognizing these applications of figurative language is essential for appreciating people’s true thoughts and opinions. The research offers computer models for sarcasm identification in social media talks and looks into the efficacy of conversation context modeling in sarcasm detection.
3.2. Bengali language
Recently, emotion and specific sentiment analysis tasks like abusive text detection, toxicity detection, sarcasm detection, and hateful speech detection from Bengali text have received extra attraction from many researchers involved in the Bengali Language Processing area. Therefore, in this subsection, we discuss all these research areas on Bengali text as these all are subsets of sentiment analysis.
Tripto and Ali (Reference Tripto and Ali2018) presented a deep-learning approach to detect sentiment labels and emotions from Bengali, Romanized Bengali, and English YouTube comments. Skip-Gram and a continuous bag of words in Word2Vec are used to get the word embedding representation for CNN and LSTM models.
Ishmam and Sharmin (Reference Ishmam and Sharmin2019) presented a machine learning-based model and Gated Recurrent Unit (GRU)-based deep neural network model to detect hateful speech from Facebook public pages’ comments where GRU obtained a 70.10% accuracy. They collected 5126 comments, annotated them, and divided them into six classes.
Emon et al. (Reference Emon, Rahman, Banarjee, Das and Mittra2019) reported a deep-learning approach for detecting abusive Bengali comments. Using RNN on 4700 Bengali text documents, they achieved an accuracy of 82%. Hussain et al. (Reference Hussain, Mahmud and Akthar2018) used 300 Facebook comments without using any predictive algorithm to detect abusive Bengali text. Chakraborty and Seddiqui (Reference Chakraborty and Seddiqui2019) used Multinomial NB (MNB), SVM, and Linear SVM to identify offensive text from 5644 posts and comments with emoticons where Linear SVM achieved 78% accuracy. Awal, Rahman, and Rabbi (Reference Awal, Rahman and Rabbi2018) collected 2665 English texts from YouTube and translated them into Bengali to build the abusive text dataset. The NB classifier achieved 80.57% accuracy with a 39% f1 score using a 10-fold cross-validation.
Baruah et al. (Reference Baruah, Das, Barbhuiya and Dey2020) identified aggression and misogynistic aggression from English, Hindi, and Bengali texts. They utilized En-BERT, RoBERTa, DistilRoBERTa, and SVM for the English language but Multilingual BERT (M-BERT_, XLM RoBERTa, and SVM for Bengali, and Hindi. Akhter et al. (Reference Akhter2018) detected cyberbullying from Bengali text by NB, KNN, and SVM using 2400 Bengali text collected from Facebook and Twitter. Banik and Rahman (Reference Banik and Rahman2019) tried machine learning and deep learning models for toxicity detection using 4255 Bengali comments.
Lora et al. (Reference Lora, Jahan, Antora and Sakib2020) investigated deep learning algorithms for emotion recognition from Facebook comments. In this analysis, the distinction between positive and negative emotions was successfully produced using stacked LSTM, stacked LSTM with 1D convolution, CNN with pre-trained word embeddings, and RNN with pre-trained word embedding models, with the latter showing the greatest accuracy with 98.3%.
The limitations of all these works in the area of Bengali sentiment analysis symbolize the unavailability of a large-scale Bengali text corpus. For this reason, Ahmed et al. (Reference Ahmed, Mahmud, Biash, Ryen, Hossain and Ashraf2021) constructed a dataset containing 44001 Facebook public posts’ comments to help the researchers detect online harassment.
There is a limited number of contributions in the area of Satire, irony, or sarcasm detection. Sharma, Mridul, and Islam (Reference Sharma, Mridul and Islam2019) detected satire in Bengali documents. They created their own Word2Vec model and achieved an accuracy of 96.4% by using the CNN model but the dataset had insufficient data. Das and Clark (Reference Das and Clark2018) identified sarcasm from 41350 Facebook posts considering public reactions and interactive comments and images. They utilized machine learning algorithms and a CNN-based model to detect sarcasm from images. Though the dataset is adequately large, the annotation process should have received special attention.
Memes have lately gained popularity as a means of information dissemination on social media. A meme is an idea, habit, or style that circulates throughout a community through mimicry and frequently bears symbolic significance that refers to a specific occurrence or topic. Information can circulate through memes in a sarcastic way. Therefore, many researchers find interest in working on memes. Hossain, Sharif, and Hoque (Reference Hossain, Sharif and Hoque2022a) developed a multimodal hate speech dataset with 4158 memes featuring Bengali and code-mixed captions. Hossain et al. (Reference Hossain, Sharif, Hoque, Dewan, Siddique and Hossain2022b) proposed a methodology for identifying multilingual offense and trolling from social media memes that uses the weighted ensemble technique to apply weights to the contributing graphical, textual, and multimodal models. Table 1 shows a comparative analysis of our study and the existing literature.
As far as we have seen, there is no research work on sarcasm detection from Bengali text as there is no publicly available dataset and no comprehensive study that utilizes machine learning, deep learning, and transfer learning to detect sarcasm from Bengali text. Therefore, in this paper, we present a comprehensive approach that includes machine learning, deep learning, and transfer learning. Besides, we introduce a large-scale human-annotated dataset named ‘Ben-Sarc’ containing 25636 comments written in Bengali collected from Facebook.
4. Dataset construction
As far as we have seen, there is no available labeled dataset for sarcasm detection in Bengali. We felt the need to create our sarcasm detection dataset for the Bengali language. We defined our dataset as the Bengali Sarcasm dataset (Ben-Sarc). The duration of dataset construction is approximately three months. In the following subsections, we discuss the features of our Ben-Sarc dataset in detail.
4.1. Content source
As Facebook is one of the major sources of textual data (Salloum et al. Reference Salloum, Al-Emran, Monem and Shaalan2017), we have targeted public Facebook pages to construct the Ben-Sarc dataset. We have collected Bengali Facebook comments from 14 different public pages from Bangladesh and India dated from 2013 to 2021. The content of the pages is shown in Table 2.
4.2. Content search
Facebook comment section usually consists of the reaction of users based on the post. The commenters on targeted pages are mostly Bengali language people and there are lots of comments written in Bengali, English, and Romanized Bengali. We have only taken Bengali comments. All the comments have been scrapped manually by the authors of this paper.
4.3. Text cleaning and noise removal
Text preprocessing is generally a vital phase of NLP problems (Hemalatha, Varma, and Govardhan Reference Hemalatha, Varma and Govardhan2012). It converts text into a convenient format. The comment section of Facebook is very noisy and mostly contains errors, and useless information (Salloum et al. Reference Salloum, Al-Emran, Monem and Shaalan2017). A list of preprocessing steps has been executed on the texts collected to enrich the Ben-Sarc dataset. They are removing non-Bengali words, duplicate texts, emojis, links, and URLs; replacing #hashtag, all symbols, special characters (e.g. ‘∖n’, ‘%’, ‘ $\$$ ’, ‘&’, ‘@’) with a single space, and multiple punctuations (e.g. ‘?’, ‘ $|$ ’, ‘;’, ‘!’, ‘,’) with single punctuation.
4.4. Annotation process
The contrast between statements meant to communicate a genuine or literal meaning versus those intended to convey an opposite or ironic meaning is at the basis of the sarcastic and non-sarcastic labels in the sarcasm detection research challenge. In other words, sarcastic labels are assigned to utterances that are meant to be regarded as the inverse of their literal or surface-level meaning, whereas non-sarcastic labels are applied to utterances that are meant to be interpreted as their literal or surface-level meaning. For this reason, each text in the Ben-Sarc dataset has been annotated manually by the authors using ‘0’ and ‘1’ as we intend to work on a binary classification problem—sarcasm detection. ‘0’ means non-sarcastic comments and ‘1’ represents sarcastic comments. Each text in the Ben-Sarc dataset has been annotated by five annotators. The final choice on the polarity of a single text has been made using the majority voting method from five annotations. The decision to employ five annotators in the annotation process stems from a desire for the robustness of the dataset. We hope to improve the dataset’s annotations by having numerous annotators independently analyze each text. Multiple annotators’ diverse viewpoints and judgments help to have a more thorough and well-rounded knowledge of the sarcasm instances in the dataset. Facebook comments are frequently filled with harsh and filthy phrases, slang, and personal attacks (Hussain et al. Reference Hussain, Mahmud and Akthar2018; Ahmed et al. Reference Ahmed, Mahmud, Biash, Ryen, Hossain and Ashraf2021; Akhter et al. Reference Akhter2018). As a result, we made sure that all annotators were of adult age and had domain knowledge.
4.5. Human evaluation of Ben-Sarc dataset
To maintain the quality of a labeled dataset, evaluation is a necessary step. We have tried to make sure the data in the Ben-Sarc dataset is not labeled vaguely keeping in mind that the researchers can use it for further applications without hesitation. The assessment process has been carefully accomplished in the Ben-Sarc dataset by two external human evaluators experts in this field having 3 to 5 years of experience that involves dataset annotation and validation. Each evaluator is an adult, a native Bengali speaker, proficient in Bengali, and active on social media having a habit of reading sarcastic comments. Each evaluator has been provided the task of assessing the quality of the dataset by replying ‘Yes’ or ‘No’ to the given questions stated below:
Q1. Is the text ironic, caustic, or biting without emoji and emoticons?
Q2. If Q1 is ‘Yes’, is the text written in the dialect, contains spelling mistakes, or manipulated traditional phrases, sentences, songs, poems?
Q3. If Q1 is ‘Yes’, is there any totally opposite context in that text?
Q4. If Q1 is ‘Yes’, is there any information in the text that causes confusion to decide whether the text is sarcastic or not?
The motivation for designing the questions for human evaluation of the Ben-Sarc dataset is from Hasan et al. (Reference Hasan, Bhattacharjee, Islam, Mubasshir, Li, Kang, Rahman and Shahriyar2021). The recent advancement in the quality estimation of neural language generation (NLG) models has inspired the creation of these characteristics. Belinkov and Bisk (Reference Belinkov and Bisk2017) demonstrated that NLG models are sensitive to low-quality training samples. Thus, it is critical to evaluate the quality of comments using the characteristics of Q1. Moreover, to verify actual uniformity and fidelity, characteristics of Q2 and Q3 have been designed whereas Q4 determines if there is any ambiguous text or confusion to decide the polarity of the text. The text ‘(Please give me your address, I will post a letter with obscenities)’ creates confusion because someone may take it as an abusive text, or a threat, which leads it to a non-sarcastic text where others may take it as a joke that leads to sarcasm.
The inter-annotator agreement is measured using Cohen’s kappa coefficient (Cohen Reference Cohen1960) in Table 3. Cohen’s kappa measures annotator agreement and determines how well one annotator agrees with another. To evaluate the conventional inter-annotator agreement, a pairwise kappa coefficient is computed using Equation (1).
where $P_o$ represents relative observed agreement and $P_e$ denotes the hypothesized probability of chance agreement. The quality assessment of Ben-Sarc is done on 5000 random samples of Ben-Sarc data. In most cases, the evaluators agree that the text seems ironic without any emoticons. Besides, a high percentage for Q2 indicates that dialect, manipulation of the traditional poems and songs, and spelling mistakes also express sarcasm from the text whereas a low percentage for Q3 determines the opposite context is pretty normal. However, the Q4 raises an ambiguity to decide whether the text is sarcastic or not. In our situation, Q3 and Q4 should be in a very low percentage but the percentage of Q4 is comparatively higher than Q3 according to the inter-annotator agreement.
4.6. Dataset description
A detailed description of our Ben-Sarc dataset has been presented in this section. The dataset contains a total of 25,636 Bengali comments where 12818 are sarcastic and 12818 are non-sarcastic. The visualization of the data distribution according to the labels is shown in Figure 1.
Table 4 represents a short overview of our labeled dataset construction. The maximum length of a text in the Ben-Sarc dataset is 395 in words and the minimum is three in words. Thus, the average length of a comment is fifteen. The length-frequency distribution of the whole dataset is shown in Figure 2. For better visualization, the length of the text is limited to 100. The overall summary of the Ben-Sarc dataset including the number of comments, words, and unique words according to its classes has been shown in Table 5. The visualization of the statistics of the Ben-Sarc dataset has been shown in Figure 3.
5. Proposed methodology
In this section, we provide a concise overview of our proposed methodology for detecting sarcasm. Figure 4 represents our proposed approach. We have distributed our proposed approach into five phases. The first phase comprises dataset construction. The second phase involves dataset preprocessing by utilizing a few NLP techniques like punctuation removal and tokenization. The third phase incorporates the feature selection process. This process includes Term-Frequency, Inverse Document Frequency (TF-IDF) and n-grams for traditional machine learning models, word embeddings for deep learning models, and pre-trained transformer-based models for transfer learning. The fourth phase of our proposed method is the training phase. In this phase, we have employed traditional machine learning models, deep learning models, and transfer learning to classify text as sarcastic or non-sarcastic. We have examined the performance of each classifier and presented the best-performed classifier in the last phase. The details of all the phases are discussed in the following subsections.
5.1. Phase I—Dataset construction
We have collected 25636 Facebook comments written in Bengali. The overall dataset construction process is described in Section 4.
5.2. Phase II—Preprocessing
A few preprocessing steps have been executed before model training, which are punctuation removal (e.g. ‘!’, ‘?’) and tokenization.
Elongated words often contain some sentiment information. For example, “ (Veryyyy funnnnny)” emphasizes more positive sentiment than “ (Very funny)” (Tripto and Ali, Reference Tripto and Ali2018). So, we have not applied stemming and lemmatization to preserve the actual sense of the elongated words.
5.3. Phase III—Feature selection
Feature selection is the third phase of our proposed model. We have used three feature extraction approaches: n-grams, TF-IDF, and word embeddings. For traditional machine learning classifiers, we have used TF-IDF and n-grams methods. TF-IDF is the most extensively utilized traditional feature extractor approach in classification applications (Kumari, Jain, and Bhatia Reference Kumari, Jain and Bhatia2016). It is a mathematical statistic that reveals to us how essential a term is to a document in a collection. The increase in a word’s TF-IDF value is directly proportional to the number of times that term appears in the document but is offset by the frequency of the term in the corpus, which helps to balance out terms that come more commonly in general.
where, $tf_{t,d}$ indicates frequency of term $t$ in document $d$ , $df_t$ defines total number of documents containing term $t$ , and $N$ means the number of documents. To pick the features for deep learning models, different pre-trained word embedding for Bengali is used. All are explained in detail in Section 5.4.2. The selected features of transfer learning are pre-trained language models described in detail in Section 5.4.3. The use of n-grams, pre-trained word embeddings, and transformers in our task creates a connection with the field of grammar in NLP. N-grams capture sequential word patterns, which frequently reflect syntactic structures in language. These patterns can include grammatical links, assisting in the identification of sentences and context. Pre-trained word embeddings use massive language datasets to represent words in semantic space, capturing complex semantic and syntactic relationships. In doing so, they implicitly represent some grammatical connections based on co-occurrences. Transformers, with their attention mechanisms, succeed in capturing long-range relationships and complex contextual connections in text. This ability extends to capturing complicated sentence forms that represent grammatical syntax.
5.4. Phase IV—Training
To classify whether a text is sarcastic or not, we have investigated traditional classifiers, deep learning classifiers, and transfer learning techniques. A comprehensive description of all the models is manifested in the following subsections.
5.4.1. Traditional classifiers
We have initiated the sarcasm detection system by investigating traditional classifiers. We have used LR, DT, RF, MNB, KNN, Linear SVM, and Kernel SVM as traditional machine learning classifiers. Furthermore, we have applied all possible combinations of unigrams, bigrams, and trigrams by extracting the features using TF-IDF for both 5 and 10-fold cross-validation.
The traditional classifiers are incapable of capturing the sequential information present in the text. Besides, these are unsuitable for enhancing performance with a large number of data. So, we will experiment with the performance of the Ben-Sarc dataset with deep learning models in later.
5.4.2. Deep learning models
Recurrent neural networks (RNNs) are deep learning neural networks that are specially built to learn data sequences and are mostly used for textual data categorization. The learning process is carried out at hidden recurrent nodes based on their prior layers of nodes. However, when dealing with long sequences of data, RNNs suffer from the vanishing gradient problem. LSTM (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997) networks are a form of a recurrent neural network capable of learning order dependency in sequence prediction applications. LSTM has introduced a solution to the vanishing gradient problem and has shown to be efficient in various NLP-related applications. So, LSTM is chosen as our baseline model. Then, the LSTM model is expanded to understand the network’s behavior.
5.4.2.1. Required basic components for sarcasm detection models
In this subsection, the basic components of sarcasm detection models are explained. If the reader is knowledgeable about these components, this subsection can be omitted.
-
LSTM: LSTMs interpret input sequences as pairs $(x_i, y_i )\ldots . (x_z, y_z)$ . An LSTM maintains a hidden vector $\textbf{h}_t$ and a memory vector $\textbf{m}_t$ for each pair $(x_i, y_i)$ and at each time step $t$ , which are responsible for regulating state updates and outputs to create a target output $y_i$ depending on the previous state of the $x_i$ input. At time step $t$ , the computations are as follows (Graves, Reference Graves2013) (Kalchbrenner, Danihelka, and Graves, Reference Kalchbrenner, Danihelka and Graves2015):
(3) \begin{equation} h_t = f(W_xt + Uh_{t-1} + b) \end{equation}(4) \begin{equation} i_t = \sigma (W^ix_t + U^ih_{t-1} + b^i) \end{equation}(5) \begin{equation} f_t = \sigma (W^fx_t + U^fh_{t-1} + b^f) \end{equation}(6) \begin{equation} o_t = \sigma (W^ox_t + U^oh_{t-1} + b^o) \end{equation}(7) \begin{equation} g_t = \sigma (W^gx_t + U^gh_{t-1} + b^g) \end{equation}(8) \begin{equation} c_t = f_t \odot{c}_t-1 + i_t \odot g_t \end{equation}(9) \begin{equation} h_t = o_t \odot tanh(c_t) \end{equation}where $\sigma$ indicates sigmoid function and $\odot$ indicates element-wise multiplication. $W_i$ , $U_i$ , and $b_i$ are two weight matrices and a bias vector for input gate $i$ , respectively. The meaning is the same as for forget gate $f$ , output gate $o$ , tanh layer $u$ , memory cell $c$ , and hidden state $h$ . The forget gate selects which past information should be forgotten on its own, whereas the input gate decides what new information should be placed in the memory cell. Finally, the output gate determines how much information from the internal memory cell is revealed. This gate unit assists an LSTM model in remembering important information over numerous time steps. -
CNN: A CNN (Kim Reference Kim2014) is mainly made up of convolutional layers and pooling layers. The convolutional layers include weights that must be taught, whereas the pooling layers change the activation using a fixed function.
-
– Convolutional layer: A convolutional layer is made up of a number of kernels whose parameters must be learned. It is a local feature extractor layer with well-trained kernels for weight modification utilizing the back-propagation approach (Rumelhart, Hinton, and Williams Reference Rumelhart, Hinton and Williams1986). The kernels’ height and weight are less than those of the input volume. Every filter is convolved with the input volume to generate a neuron activation map. The convolutional layer’s output volume is calculated by stacking the activation maps of all filters along the depth dimension. Convolution operation output is calculated by convolving an input $(I)$ with a number of filters as follows.
(10) \begin{equation} x_k = I*W_k + b_k;\ k = 1,2,3,\ldots,F \end{equation}where $F$ is the number of filters, $x_k$ is the output corresponding to the $kth$ convolution filter, $W_k$ is the weights of the $kth$ filter, and $b_k$ is the $kth$ bias. -
– Global max-pooling layer: A pooling layer is an additional layer that is inserted after the convolutional layer. Pooling layers give a method for downsampling feature maps by summarizing the existence of features in feature map patches. Maximum pooling, or max pooling (Boureau, Ponce, and LeCun Reference Boureau, Ponce and LeCun2010), is a pooling operation that calculates the maximum value in each patch of each feature map. The global max-pooling layer is another form of pooling layer where the pool size can be fixed to the same as the input size so that the maximum of the total input is calculated as the output value.
-
-
Embedding layer: An embedding layer is learned alongside a neural network model on a particular NLP application, such as language modeling or text categorization. If an input sentence $s_i$ is given, the word sequences of this sentence $w_1$ , $w_2$ , $w_3$ ,…, $w_i$ are fed into a word embedding layer to produce embedding vectors $x_1$ , $x_2$ , $x_3$ ,…, $x_i$ before being sent to the next layer. The embedding layer is defined by an embedding matrix $E \epsilon R^{KX|V|}$ , where $K$ indicates the embedding dimension and $|V|$ indicates the vocabulary size.
-
Pre-trained word embeddings: Pre-trained word embeddings are embeddings that are learned in one task and then applied to solve another related problem. In this paper, we have used the following pre-trained word embeddings available in Bengali.
-
– GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) creates the feature vector based on global and local word counts, word-word co-occurrence, and local context with the center word. GloVe’s semantic and syntactic features can be extracted more effectively. However, owing to matrix factorization, it takes a long time. In our task, we have used Bengali-GloVe.Footnote c
-
– Word2Vec (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) is a prediction-based embedding approach that generates an embedding vector from the center word to the context word or vice versa. In this paper, we have used Bengali-Word2Vec.Footnote c
-
– BPEmb (Heinzerling and Strube Reference Heinzerling and Strube2017) model based on Byte-Pair encoding, which gives a collection of pre-trained subword embedding models for 275 languages including Bengali.Footnote d
-
– FastText (Grave et al. Reference Grave, Bojanowski, Gupta, Joulin and Mikolov2018) is a prediction-based embedding approach that conveys subword information. We have used FastText created for 157 languages including Bengali.Footnote e
-
5.4.2.2 Sarcasm detection models architecture
A detailed description of all deep-learning models for sarcasm detection is provided below.
-
a. LSTM: A single hidden LSTM layer is followed by a typical feedforward output layer in the original LSTM model. After preprocessing, texts are passed through a tokenizer and a one-hot encoding vector of length 100 is generated because Facebook comments are usually long. These vectors are then fed into the embedding layer. The output of the embedding layer is fed into the LSTM layer. Finally, a dense layer is added with a sigmoid activation function. The number of nodes in the dense layer is two because of the binary classification task. The vocabulary size is 10000. The architecture of this model is shown in Figure 5.
-
b. LSTM + CNN: This model is the combination of the LSTM and CNN models. The architecture of LSTM and the input of the embedding layer are the same as the model mentioned in Subsection 5.4.2.2(a). After the embedding layer, a 1D convolutional layer with 100 filters and kernel size 4 is added to speed up the longer training time. Next, a global max-pooling layer with pool size 5 is used to extract the maximum value from each filter and the output is the input of the LSTM layer. This vector is directly passed to a dense layer which is the output layer with a sigmoid activation function and the number of output nodes is the number of labels in the dataset.
-
c. LSTM + CNN + Pre-trained word embedding: The architecture of this model is the same as the model mentioned in Subsection 5.4.2.2(b). The weights of the embedding layer are initialized with the weights of pre-trained word embedding. A dropout layer and then a dense layer are added after the embedding layer. After that, a 1D convolutional layer and a global max-pooling layer are added and the output is the input of the LSTM layer. This vector is directly passed to the output layer with the sigmoid activation function as mentioned in Subsection 5.4.2.2(b). The architecture of this model is shown in Figure 6.
-
d. Stacked LSTM + CNN + Pre-trained word embedding: This model is the combination of stacked LSTM and CNN models. The stacked LSTM is a variation of the LSTM model that includes multiple hidden LSTM layers, each of which contains multiple memory cells. The architecture of CNN is the same as the model mentioned in Subsection 5.4.2.2(c). The output of the LSTM with 1D convolution is passed to another LSTM layer before being used as the input of a dense layer. The obtained vector is directly passed to a dense layer, which is the output layer with a sigmoid activation function and the number of output nodes is the number of labels in the dataset. The architecture of this model is shown in Figure 7.
-
e. BiLSTM + CNN + Pre-trained word embedding: A bidirectional LSTM (Schuster and Paliwal Reference Schuster and Paliwal1997), often known as biLSTM, is a sequence processing model that consists of two LSTMs, one of which takes the input forward and the other backward. BiLSTMs effectively improve the quantity of data available to the network, allowing the algorithm to understand the context better (knowing what words immediately follow and precede a word in a sentence). The architecture of the model remains the same as the model mentioned in Subsection 5.4.2.2(c). Only the LSTM layer is replaced with the BiLSTM layer.
-
f. BiLSTM + Pre-trained word embedding: The architecture of the model remains the same as the model mentioned in Subsection 5.4.2.2(e) by dropping the CNN portion of the model. The architecture of this model is shown in Figure 8.
Deep learning models require a longer training time as these process input sequence token by token. As a result, we will monitor how the Ben-Sarc dataset performs on transfer learning later to save computational costs.
5.4.3. Transfer learning
Transfer learning is a machine learning procedure in which the starting point of a new task is an already-produced model for similar tasks (Torrey and Shavlik Reference Torrey and Shavlik2010). Transfer learning approaches have been effectively used for speech recognition, document categorization, and sentiment analysis in NLP (Wang and Zheng Reference Wang and Zheng2015). Figure 9 represents an illustration of the transfer learning approach.
In transfer learning, we can utilize pre-trained source models available for developing new models. A plethora of transformer-based models for various NLP tasks have recently emerged. The significant improvement of transformer-based models over RNN-based models is that these models accept the complete sequence as input all at once instead of analyzing an input sequence token by token. For this reason, we have utilized BERT. But BERT is a large neural network architecture with a massive number of parameters that may vary from 100 million to over 300 million. As a result, training a BERT model from scratch on a limited dataset would result in overfitting. Moreover, the computational cost of pre-training a BERT model is very high. As a result, as a starting point, it is preferable to employ a pre-trained BERT model that was trained on a large dataset. We can then further train the model using our relatively small dataset for fine-tuning. This approach is called fine-tuning. That is why we have used transfer learning approaches.
5.4.3.1 BERT
BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) is one of the most prevalent transformer-based models that is used for pre-training a transformer (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017). BERT generates deep bidirectional word representations in unlabeled text based on the words’ contextual relationships to their surroundings. Depending on its vocabulary, it generates word-piece embeddings. BERT pre-training is carried out using a masked language model, which randomly masks words that the model will estimate and compute the loss, and a next sentence prediction task, in which the model can predict the next sentence from the present sentence.
Let $a1$ , $a2$ ,…., $a6$ be sentence words. $a5$ is randomly masked with the $[MASK]$ token. The output of the sentence’s words is thus $b1$ , $b2$ ,…, $b6$ . The outputs are then routed through a block that includes two fully connected layers: a GELU layer and a normalization layer. The sentence and the anticipated value of the masked token are both outputs of the block. Three pre-trained BERT-based transformer language models are used as source models available in Hugging Face Transformer’s libraryFootnote f as these are mostly used in downstream works like text classification. The transformer language models are:
-
Bangla BERT(base) (Sarker Reference Sarker2020), a pre-trained Bengali language model based on mask language modeling that has been pre-trained on Bengali Wikipedia Dump datasetFootnote g and a large Bengali corpus taken from Open Super-large Crawled Aggregated coRpus (OSCAR)Footnote h. The model follows the BERT-base-uncased model architecture, which means it has 12 layers, 768 hidden layers, 12 heads, and 110 M parameters.
-
Indic-Transformers Bengali BERT (Jain et al. Reference Jain, Deshpande, Shridhar, Laumann and Dash2020), a BERT language model that has been pre-trained on about 3 GB of monolingual training corpus, majorly taken from OSCAR. It has achieved state-of-the-art performance in the Bengali language for the text classification tasks.
-
M-BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), a pre-trained model on 102 languages with the largest Wikipedia including Bengali. We have used the model ‘BERT-base-multilingual-uncased’. It has 12 layers with 768 hidden layers, 12 multi-headed attention layers, and 110 M parameters.
5.4.3.2 Architecture of our model
The detailed architecture of our model is shown in Figure 10. The fine-tuning strategies of our new models can be divided as follows:
-
a. Selecting pre-trained source model: As explained earlier, BERT-based transformer models mentioned in section Section 5.4.3.1 are taken to this experiment as pre-trained source models to observe how these models work on transfer learning.
-
b. Freezing the entire architecture of source model: Before fine-tuning, all the layers of each pre-trained language model are kept frozen by freezing BERT’s weight. This process prevents the updating of model weights during fine-tuning.
-
c. Attaching our own neural network architecture: A different number of dense layers with different activations mentioned in Section 6.4.3 and softmax as an output layer of our own is appended to the architecture to train this new model. Softmax can be expressed as
(11) \begin{equation} a_i = \frac{e^{z_i}}{\sum _{k=1}^{c} e^{z_k}} where \sum _{i=1}^{c} a_i = 1 \end{equation}The weights of the appended layers are updated during model training. Different optimizers and learning rates are experimented with to get the optimized hyperparameter which is explained in Section 6.4.3.
5.5. Phase V—Evaluation
In the last phase, we measured the performance of all models of phase IV. Then, the achieved results are compared and the best-performed mode is reported. The details of measuring the performance of the models are discussed briefly in the experiment section.
6. Experimental evaluation
6.1. Experimental setup
Python Keras framework with Tensorflow is used as a background to implement all deep learning models and Pytorch library is used for transfer learning models for training, tuning, and testing. Experimental evaluation was conducted on a machine with an Intel Core i5 processor with 2.71 GHz clock speed and 4 GB RAM. Tensorflow-based experiments can utilize GPU instructions. Google CoLaboratory has been used for developing all the models described in this paper in later sections as we have used Python language.
6.2. Experiments
Our experiments are categorized into three parts. Experiment I is concerned with the experiments on traditional classifiers. Experiment II focuses on the experiments on deep learning classifiers and experiment III reports on the experiments on transfer learning approaches.
To judge the effectiveness of the models, accuracy, precision, recall, and f1-score measurements are taken into account. After hyperparameter tuning, the variation of results has been obtained for each model. So, only the better-performed model from each experiment has been taken in the Result Analysis section.
6.2.1. Experiment I
In this experiment, the performance of traditional machine learning classifiers mentioned in Section 5.4.1 for the Ben-Sarc dataset has been evaluated. For this experiment, 20% of our data is used for testing purposes. The rest is used for training. A full overview of the performance with necessary evaluation metrics for experiment I has been demonstrated in Table 6. The process of choosing hyperparameters for experiment I has been discussed in Section 6.4.1. From Table 6, it is shown that the MNB classifier has achieved the highest accuracy for the bigram technique with both 5-fold and 10-fold cross-validation among all traditional classifiers as it works well with high dimensional text data by taking advantage of probabilistic algorithm. It is 72.01% for 5-fold cross-validation and 72.36% for 10-fold cross-validation.
6.2.2. Experiment II
In this experiment, the performance of different deep learning classifiers mentioned in Section 5.4.2 for the Ben-Sarc dataset has been evaluated. For this experiment, 20% of our data is used for testing purposes. The rest is further divided into 60% for training and 20% for the validation set. A full overview of the performance with necessary evaluation metrics for experiment II has been demonstrated in Table 7.
For LSTM models, LSTM units have been taken as 100, and for stacked LSTM, LSTM units have been taken as 128 and 64. Hundred epochs have been used for all deep learning models. Though the epoch number was set as 100, it was stopped earlier due to early stopping criteria for monitoring two epochs with no improvement in the model’s performance. Binary cross-entropy is used as the loss function for all cases as the task is a binary classification problem. For all cases, dropout probability, and recurrent dropout probability have been set as 0.2. The hyperparameter setting is shown in the Table 8. The procedure for picking hyperparameters for experiment II is covered in Section 6.4.2.
From Table 7, it is clear that LSTM without pre-trained word embedding has achieved the highest accuracy of 72.48%. When an extra CNN and max-pooling layer is added to this model, the performance of the model has decreased slightly. Then, the performance of this LSTM + CNN model decreased more after using pre-trained word embedding. After that, the performance of other models decreased gradually by adding or removing certain parts of the model. The reason for decreasing the models’ accuracy using pre-trained word embedding is that pre-trained word embeddings are mainly trained on a large dataset like Wikipedia where most of the language is very formal. But in the Ben-Sarc dataset, 87.65% of the text is written in dialect, manipulating phrases, sentences, and spelling mistakes, which determines the text as sarcastic as mentioned in Section 4.5.
6.2.3. Experiment III
In this experiment, the performance of transfer learning techniques mentioned in Section 5.4.3 for the Ben-Sarc dataset has been evaluated. A full overview of the performance with necessary evaluation metrics for experiment III has been demonstrated in Table 9. The hyperparameter setting is shown in Table 10. The approach for optimizing hyperparameters for experiment III is outlined in Section 6.4.3.
Here, for all cases, negative log-likelihood (Platt Reference Platt1999) loss has been used as a loss function as it is the classic loss function used in any classification task (Ruan et al. Reference Ruan, Nechaev, Chen, Su and Kiss2020). Softmax activation has been used in the output layer for all cases. From Table 9, it can be concluded that transfer learning approaches for Indic-Transformers Bengali BERT pre-trained model have obtained the highest accuracy among all pre-trained models. It has achieved 75.05% accuracy by using seven hidden layers and the settings mentioned above.
For the transfer learning approach, at first, we have taken the m-BERT transformer model as a pre-trained model. But the overall performance was not satisfactory. Then, we replaced the pre-trained model with Indic-Transformers Bengali BERT keeping the same hyperparameter setting. Here, a significant increase in all the performance measurement metrics has been observed. Almost 9% accuracy has been increased by changing only the pre-trained model from m-BERT to Indic-Transformers Bengali BERT. Then, we experimented with another pre-trained model Bangla BERT, but the performance degraded significantly.
6.3. Result analysis
The highest accuracy from each experiment has been shown briefly in Table 11. From Table 11, it can be concluded that the performance of experiment III, which means the transfer learning approach, is slightly better than traditional machine learning and deep learning classifiers. By using Indic-Transformers Bengali BERT as a pre-trained model, transfer learning has obtained the highest accuracy of 75.05% for the Ben-Sarc dataset where LSTM without pre-trained word embeddings from deep learning classifiers and multinomial NB from traditional classifiers achieved a maximum 72.48% and 72.36% accuracy, respectively.
6.4. Hyperparameter tuning
Hyperparameter tuning is a necessary stage in each experiment to boost performance. The hyperparameter tuning has been carried out on all of our experiments I, II, and III. A thorough explanation of all of the models is provided in the following subsections.
6.4.1. For experiment I
For experiment I, we have applied 5-Fold and 10-Fold cross-validation on seven traditional classifiers listed in Table 6. Unigram, bigram, and trigram techniques have been applied for each classifier. Among them, the best results from each classifier have been demonstrated in Table 6 for both cross-validation techniques. MNB classifier for 5-fold cross-validation has attained 71.85% accuracy and 72.13% for 10-fold cross-validation for the unigram technique. These are the second-best results for both 5-fold and 10-fold cross-validation, respectively. The performance of other classifiers cannot surpass these results.
6.4.2. For experiment II
For experiment II, all hyperparameters that have been tuned for several combinations across all models are mentioned in Table 12. We have experimented with all possible combinations of pre-trained word embedding, dense layer size, batch size, activation in the hidden layer, optimizer, and learning rate mentioned in Table 12 for all deep learning models for hyperparameter tuning.
Among them, LSTM without pre-trained word embedding has achieved 71.37% on dense layer 1000, the number of LSTM layers 2, batch size 16, hidden layer activation tanh, and Nadam optimizer with 0.0001 learning rate. This is the second-highest accuracy for experiment II. Other models with different combinations cannot obtain better results.
6.4.3. For experiment III
For experiment III, all hyperparameters that have been tuned for several combinations for all models are mentioned in Table 13. We have experimented with all possible combinations of the pre-trained source model, batch size, number of hidden layers, number of nodes in hidden layers, activation, dropout, optimizer, and learning rate mentioned in Table 13 for all transfer learning models for hyperparameter tuning.
The second-highest accuracy from experiment III is 74.00%, which is achieved from two settings—first one: 3 hidden layers with 512,256, 128 hidden layer nodes, tanh hidden layer activation, 0.1 dropout, SGD optimizer with 0.01 learning rate, and batch size 8 with 30 epochs and the second one: 4 hidden layers with 512,256, 128, 64 hidden layer nodes, sigmoid hidden layer activation, 0.2 dropout, Adam optimizer with 0.001 learning rate, and batch size 4 with 30 epochs.
6.5. Error analysis
In this subsection, we discuss the error analysis of the three best-performing models from experiments I, II, and III—MNB, LSTM, and Indic-Transformer Bengali-BERT-based Transfer learning as we have shown in Table 11 for four types of sample input to demonstrate that which type of input deep learning can predict but traditional machine learning cannot. Table 14 shows the sample input-output with the predicted score of each best-performing model.
For SI 1, all models predict the text as sarcastic as all predicted scores are greater than 0.5 as it is a binary classification problem. The input SI 1 is clearly a funny text and there is no confusion about this text. In this text, the rhythm of the poem is used to express sarcasm which is very common in Bengali sarcasm. Similarly, input SI 2 is predicted as sarcasm with a very low score. Therefore, it is correctly classified as non-sarcasm by all models as the input text is clearly a non-sarcastic text.
For input SI 3, the MNB model misclassifies the text as non-sarcasm whereas the LSTM model classifies it correctly with the margin score. On the contrary, the transfer learning model with Indic BERT classifies it correctly. MNB model fails here as the traditional classifiers are unable to capture the sequential information present in the text. As LSTM is a sequential model, it classifies the text correctly though the predicted score is not so good. On the contrary, the Indic-Transformers Bengali-BERT-based transfer learning model successfully classifies it as it can use the knowledge and advantage of the transformer model.
For input SI 4, all models fail to classify it correctly as a sarcastic text. The text seems non-sarcastic but if the text contains one word which refers to poisonous instead of which refers to very. By intentionally creating mistakes in spelling, this text indicates a sarcastic text. Therefore, MNB and LSTM fail to predict it correctly. Though the transfer learning model misclassifies it, it obtains almost the threshold of binary classification as a prediction score.
6.6. Statistical analysis
To determine whether the variations among the predictions produced by the traditional machine learning models, deep learning models, and transfer learning models are statistically significant or not, we performed a statistical test. We utilized Friedman test (Friedman Reference Friedman1937) to inspect whether the observed differences in performances across different models were statistically significant or just due to chance. The obtained p-values for the evaluated paradigms–machine learning (0.42319008112684364), deep learning (0.4158801869955079), and transfer learning (0.36787944117144245)–in the context of the Friedman test indicate interesting observations into the potential differences between these methodologies. For example, all three p-values are more than the traditional threshold of 0.05, indicating a lack of solid evidence for significant variations in performance between approaches within the provided dataset and experimental setting. The findings indicate that, while there might be differences in performance outcomes, they are not significant enough to approach statistical significance at this threshold.
7. Conclusion and future work
In this paper, we have presented a benchmark dataset for Bengali sarcastic comments on Facebook to influence one of the low-resource languages named Bengali. Then, we demonstrated a thorough and comprehensive strategy to utilize different models of machine learning, deep learning, and transfer learning. This is an attempt to make a contribution to the discipline of sentiment analysis on the Bengali language domain to achieve a boon in the branch of consumer research, opinion mining, branding, and so on. In the future, we wish to improve the quality of our work by increasing the size of our Ben-Sarc dataset by adding various social media data like YouTube, Twitter, and comments from different newspapers or product websites. Besides, emojis and emoticons play a vital role in articulating the actual connotation of a comment on social media. So, we will consider emojis and emoticons along with the text.