Stance Detection | Jakhongir Saydaliev

Technical Report

Stance detection is the task of determining whether an argument is in favor of, against, or neutral towards a given topic. This has significant applications in social media analysis, misinformation detection, and political discourse understanding. Our work explores how well Large Language Models (LLMs) can be fine-tuned for this task and, importantly, how well they generalize to unseen datasets.

This project is part of CommPass, a larger initiative aimed at creating awareness about media polarity by providing readers with visualizations showing where content sits in the “space” of media events like the Russia-Ukraine war or COVID-19.

Methods

Models and Fine-Tuning Approach

We experimented with three LLMs:

Mistral-7B - 7 billion parameters with advanced features like Grouped Query Attention
Llama-2-7B - 7 billion parameters
Phi-1.5 - A smaller 1.3 billion parameter model trained primarily on textbook data

Rather than fine-tuning all parameters (which would be computationally expensive), we used Low-Rank Adaptation (LoRA$^{[1]}$) - a parameter-efficient technique that inserts trainable rank decomposition matrices into selected layers while freezing the pre-trained weights. This dramatically reduces the number of trainable parameters while maintaining performance.

Datasets

We trained and evaluated on two distinct datasets to test generalization:

SemEval2016 - Twitter data focusing on six targets (Abortion, Atheism, Climate Change, Feminist Movement, Hillary Clinton, Donald Trump) with three labels: Favor, Neutral, Against
IBM-Debater - Claims and evidence from Wikipedia articles covering 33 controversial topics, with only two labels: PRO and CON

A key difference: SemEval uses short targets (1-2 words) while IBM-Debater uses complete sentences.

Experiments

Finding the Right LoRA Rank

We tested LoRA ranks from 1 to 64 on the full SemEval dataset. The results showed that Mistral consistently outperformed Llama and Phi, but interestingly, there was no clear trend with rank size - lower ranks performed just as well as higher ones.

Assessing the role of LoRA rank: no significant differences between lower and higher ranks across models.

Low-Data Regimes

We tested how well models perform when fine-tuned on limited data (1%, 10%, and 50% of the training set). Mistral again proved superior, especially in low-data scenarios. We found that rank choice depends on data volume - rank 1 works better with less data, while rank 8 improves with more data (likely because higher ranks overfit small datasets).

Fine-tuning LLMs in different data regimes shows Mistral's robustness even with limited training data.

Main Results

Our best model - Mistral with LoRA rank 16, trained on 70% of both SemEval and IBM-Debater datasets - significantly outperformed all baselines:

F1 scores across training configurations

F1-scores of Mistral trained on different dataset combinations. Models with rank 1 used 10% of training data, while rank 8 and 16 used 70%.

Performance Table

Model	Abortion	Atheism	Climate Change	Feminist Movement	Hillary Clinton	SemEval (avg)	IBM (avg)
BERTweet (baseline)	0.65	0.76	0.79	0.65	0.69	0.70	-
RoBERTa (baseline)	0.54	0.79	0.80	0.64	0.71	0.68	-
StanceBERTa (baseline)	-	-	-	-	-	-	0.61
Mistral Zero-shot	0.54	0.33	0.55	0.57	0.66	0.54	0.44
Mistral Fine-tuned (Ours)	0.71	0.73	0.84	0.76	0.80	0.76	0.92

Surprising Findings

Cross-dataset generalization: Models fine-tuned on SemEval alone generalized remarkably well to IBM-Debater, outperforming the baseline despite never seeing that data format during training.
Training on both datasets improved neutral class recall on SemEval, even though IBM-Debater has no neutral labels - suggesting the model learned more nuanced representations.
Fine-tuning on SemEval and extrapolating to IBM might lead to better results than directly fine-tuning on IBM alone.

Confusion matrices showing how training dataset combinations affect classification performance on SemEval test set.

References

LoRA: Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685