Stance Detection

Fine-tuning Large Language Models for argument stance detection in unseen domains

Stance detection is the task of determining whether an argument is in favor of, against, or neutral towards a given topic. This has significant applications in social media analysis, misinformation detection, and political discourse understanding. Our work explores how well Large Language Models (LLMs) can be fine-tuned for this task and, importantly, how well they generalize to unseen datasets.

This project is part of CommPass, a larger initiative aimed at creating awareness about media polarity by providing readers with visualizations showing where content sits in the “space” of media events like the Russia-Ukraine war or COVID-19.

Methods

Models and Fine-Tuning Approach

We experimented with three LLMs:

  • Mistral-7B - 7 billion parameters with advanced features like Grouped Query Attention
  • Llama-2-7B - 7 billion parameters
  • Phi-1.5 - A smaller 1.3 billion parameter model trained primarily on textbook data

Rather than fine-tuning all parameters (which would be computationally expensive), we used Low-Rank Adaptation (LoRA$^{[1]}$) - a parameter-efficient technique that inserts trainable rank decomposition matrices into selected layers while freezing the pre-trained weights. This dramatically reduces the number of trainable parameters while maintaining performance.

Datasets

We trained and evaluated on two distinct datasets to test generalization:

  1. SemEval2016 - Twitter data focusing on six targets (Abortion, Atheism, Climate Change, Feminist Movement, Hillary Clinton, Donald Trump) with three labels: Favor, Neutral, Against

  2. IBM-Debater - Claims and evidence from Wikipedia articles covering 33 controversial topics, with only two labels: PRO and CON

A key difference: SemEval uses short targets (1-2 words) while IBM-Debater uses complete sentences.

Experiments

Finding the Right LoRA Rank

We tested LoRA ranks from 1 to 64 on the full SemEval dataset. The results showed that Mistral consistently outperformed Llama and Phi, but interestingly, there was no clear trend with rank size - lower ranks performed just as well as higher ones.

Assessing the role of LoRA rank: no significant differences between lower and higher ranks across models.

Low-Data Regimes

We tested how well models perform when fine-tuned on limited data (1%, 10%, and 50% of the training set). Mistral again proved superior, especially in low-data scenarios. We found that rank choice depends on data volume - rank 1 works better with less data, while rank 8 improves with more data (likely because higher ranks overfit small datasets).

Fine-tuning LLMs in different data regimes shows Mistral's robustness even with limited training data.

Main Results

Our best model - Mistral with LoRA rank 16, trained on 70% of both SemEval and IBM-Debater datasets - significantly outperformed all baselines:

F1-scores of Mistral trained on different dataset combinations. Models with rank 1 used 10% of training data, while rank 8 and 16 used 70%.

Performance Table

Model Abortion Atheism Climate Change Feminist Movement Hillary Clinton SemEval (avg) IBM (avg)
BERTweet (baseline) 0.65 0.76 0.79 0.65 0.69 0.70 -
RoBERTa (baseline) 0.54 0.79 0.80 0.64 0.71 0.68 -
StanceBERTa (baseline) - - - - - - 0.61
Mistral Zero-shot 0.54 0.33 0.55 0.57 0.66 0.54 0.44
Mistral Fine-tuned (Ours) 0.71 0.73 0.84 0.76 0.80 0.76 0.92

Surprising Findings

  1. Cross-dataset generalization: Models fine-tuned on SemEval alone generalized remarkably well to IBM-Debater, outperforming the baseline despite never seeing that data format during training.

  2. Training on both datasets improved neutral class recall on SemEval, even though IBM-Debater has no neutral labels - suggesting the model learned more nuanced representations.

  3. Fine-tuning on SemEval and extrapolating to IBM might lead to better results than directly fine-tuning on IBM alone.

Confusion matrices showing how training dataset combinations affect classification performance on SemEval test set.

References

  1. LoRA: Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685