Dr Carolina Scarton

BSc, MSc, PhD

School of Computer Science

Senior Lecturer in Natural Language Processing

Research Innovation Grant Support Lead

Member of the Natural Language Processing research group

c.scarton@sheffield.ac.uk
+44 114 222 1892

Regent Court (CS)

Full contact details

Dr Carolina Scarton
School of Computer Science
Regent Court (CS)
211 Portobello
Sheffield
S1 4DP

Profile

Carolina Scarton is a Senior Lecturer in Natural Language Processing at the Department of Computer Science, University of Sheffield, UK. She is a member of the Natural Language Processing group and part of the GATE team.

Previously, she worked as an Academic Fellow (from September 2019 to November 2021) and as a Research Associate for the WeVerify (from March 2019 to August 2019) and SIMPATICO (from July 2016 to February 2019) European projects.

Qualifications

In 2017, she was awarded a PhD degree in Computer Science from the University of Sheffield, under the supervision of Professor Lucia Specia. Her PhD was funded by the EXPERT project (a Marie Curie ITN network).

She also has a MSc and a BSc degree from the University of São Paulo, Brazil (awarded in 2013).

Her MSc supervisor was Dr. Sandra Aluísio and she was a member of the Interinstitutional Center for Computational Linguistics (NILC). Since 2018, she is the Secretary for the European Association for Machine Translation (EAMT).

Research interests: Dr Scarton's research area is Natural Language Processing (NLP). She is particularly interested in text adaptation, machine translation, online misinformation detection and verification, evaluation of NLP task outputs, NLP applied to healthcare and robotics, and dialog systems.

Publications

Books

Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, Magro C & Pinto H (2022) Preface.
Specia L, Scarton C, Paetzold GH & Hirst G (2018) Quality Estimation for Machine Translation.
Specia L, Scarton C & Paetzold GH (2018) Quality Estimation for Machine Translation. Springer International Publishing.

Journal articles

Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2026) LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems.. CoRR, abs/2601.16890.
Zareie A, Bakir ME, Greenwood MA, Bontcheva K & Scarton C (2025) Identifying coordination in online social networks through anomalous sharing behaviour. Online Social Networks and Media, 50, 100341-100341.
Moreira DAB, Ferreira AI, Silva J, Santos GOD, Bonil G, Gondim JM, Santos VBD, Maia HA, Hashiguti ST, Silva NFFD , Scarton C et al (2025) CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning.. CoRR, abs/2512.00496.
Li Y, Zhao Z & Scarton C (2025) It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs.. CoRR, abs/2508.19089.
He W, Vieira TK, Garcia M, Scarton C, Idiart M & Villavicencio A (2025) Investigating idiomaticity in word representations. Computational Linguistics, 51(2), 505-555. View this article in WRRO
Li Y, Vasilakes J, Zhao Z & Scarton C (2025) SCRum-9: Multilingual Stance Classification over Rumours on Social Media.. CoRR, abs/2505.18916.
Singh I, Scarton C & Bontcheva K (2025) GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification.. CoRR, abs/2505.22867.
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2025) Weakly supervised veracity classification with LLM-predicted credibility signals. EPJ Data Science, 14. View this article in WRRO
Goldsack T, Scarton C & Lin C (2025) Leveraging Large Language Models for Zero-shot Lay Summarisation in Biomedicine and Beyond.. CoRR, abs/2501.05224.
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2024) EUvsDisinfo: a dataset for multilingual detection of pro-Kremlin disinformation in news articles. CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, 5380-5384. View this article in WRRO
Berto MVV, Freitas BL, Scarton C, Machado-Neto JA & Almeida TA (2024) Accelerating discoveries in medicine using distributed vector representations of words. Expert Systems with Applications, 250, 123566-123566.
Razuvayevskaya O, Wu B, Leite JA, Heppell F, Srba I, Scarton C, Bontcheva K & Song X (2024) Comparison between parameter-efficient techniques and full fine-tuning: a case study on multilingual news article classification. PLoS ONE, 19(5). View this article in WRRO
Mu Y, Wu BP, Thorne W, Robinson A, Aletras N, Scarton C, Bontcheva K & Song X (2024) Navigating prompt complexity for zero-shot classification: a study of large language models in computational social science. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 12074-12086. View this article in WRRO
Scarton C, Oakley C, Prescott C, Wright J, Bayliss C, Wrigley S & Song X (2024) Message from the Organising Committee. Proceedings of the 25th Annual Conference of the European Association for Machine Translation Eamt 2024, 2, iv-v.
He W, Idiart M, Scarton C & Villavicencio A (2024) Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss. Findings of the Association for Computational Linguistics ACL 2024, 12473-12485.
Gow-Smith E, Phelps D, Tayyar Madabushi H, Scarton C & Villavicencio A (2024) Word Boundary Information Isn’t Useful for Encoder Language Models. Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024), 118-135.
Zhang Z, Goldsack T, Scarton C & Lin C (2024) ATLAS: Improving Lay Summarisation with Attribute-based Control. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 337-345.
Vincent S, Prescott C, Bayliss C, Oakley C & Scarton C (2024) A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling. Proceedings of the 25th Annual Conference of the European Association for Machine Translation Eamt 2024, 1, 561-572.
Li Y, Zhao Z & Scarton C (2024) Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models.. CoRR, abs/2410.19195.
Scarton C, Prescott C, Bayliss C, Oakley C, Wright J, Wrigley S & Song X (2024) Message from the Organising Committee. Proceedings of the 25th Annual Conference of the European Association for Machine Translation Eamt 2024, 1, iv-v.
Leal SE, Duran MS, Scarton CE, Hartmann NS & Aluísio SM (2024) NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Language Resources and Evaluation, 58(1), 73-110.
Singh I, Scarton C & Bontcheva K (2023) UTDRM: unsupervised method for training debunked-narrative retrieval models. EPJ Data Science, 12(1). View this article in WRRO
Wu B, Li Y, Mu Y, Scarton C, Bontcheva K & Song X (2023) Don’t waste a single annotation: improving single-label classifiers through soft labels. Findings of the Association for Computational Linguistics: EMNLP 2023, 5347-5355. View this article in WRRO
Li Y, Scarton C, Song X & Bontcheva K (2023) Classifying COVID-19 vaccine narratives. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, 648-657. View this article in WRRO
Mu Y, Jin M, Grimshaw C, Scarton C, Bontcheva K & Song X (2023) VaxxHesitancy: A dataset for studying hesitancy towards COVID-19 vaccination on Twitter. Proceedings of the International AAAI Conference on Web and Social Media, 17(1), 1052-1062. View this article in WRRO
Goldsack T, Zhang Z, Tang C, Scarton C & Lin C (2023) Enhancing Biomedical Lay Summarisation with External Knowledge Graphs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 8016-8032.
Vincent ST, Flynn R & Scarton C (2023) MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation.. CoRR, abs/2305.15904.
Li Y & Scarton C (2023) Evaluating the Role of Target Arguments in Rumour Stance Classification.. CoRR, abs/2303.12665.
Vincent ST, Sumner R, Dowek A, Blundell C, Preston E, Bayliss C, Oakley C & Scarton C (2023) Personalised Language Modelling of Screen Characters Using Rich Metadata Annotations.. CoRR, abs/2303.16618.
Singh I, Scarton C & Bontcheva K (2021) Multistage BiCross encoder for multilingual access to COVID-19 health information. PLoS ONE, 16(9). View this article in WRRO
Alva-Manchego F, Scarton C & Specia L (2021) The (un)suitability of automatic evaluation metrics for text simplification. Computational Linguistics, 47(4), 861-889.
Mejova Y, Petrocchi M & Scarton C (2021) Special Issue on Disinformation, Hoaxes and Propaganda within Online Social Networks and Media. Online Social Networks and Media, 23, 100132-100132.
Leite JA, Silva DF, Bontcheva K & Scarton C (2020) Toxic language detection in social media for Brazilian Portuguese : new dataset and multilingual analysis. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 914-924. View this article in WRRO
Scarton C, Silva DF & Bontcheva K (2020) Measuring what counts : the case of rumour stance classification. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 925-932. View this article in WRRO
Alva-Manchego F, Scarton C & Specia L (2020) Data-driven sentence simplification: Survey and benchmark. Computational Linguistics, 46(1), 135-187.
Scarton C (2020) Horacio Saggion, Automatic Text Simplification. Synthesis lectures on human language technologies, April 2017.. Nat. Lang. Eng., 26, 489-492.
Toledo CM, Cunha A, Scarton C & Aluísio S (2014) Automatic classification of written descriptions by healthy adults: An overview of the application of natural language processing and machine learning techniques to clinical discourse analysis. Dement Neuropsychol, 8(3), 227-235. View this article in WRRO
Srba I, Razuvayevskaya O, Leite JA, Moro R, Schlicht IB, Tonelli S, García FM, Lottmann SB, Teyssou D, Porcellini V , Scarton C et al () A Survey on Automatic Credibility Assessment Using Textual Credibility Signals in the Era of Large Language Models. ACM Transactions on Intelligent Systems and Technology.
Haouari F, Scarton C, Faggiani N, Nikolaidis N, Kotseva B, Abu Farha I, Linge J & Bontcheva K () UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections. Proceedings of the International AAAI Conference on Web and Social Media, 19, 2477-2495.
He W, Vieira TK, Gonzalez MG, Scarton C, Idiart M & Villavicencio A () Finding Idiomaticity in Word Representations. Computational Linguistics.
A. Leite J, Scarton C & F. Silva D () Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks. Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings, 631-640.
Scarton C & Specia L () A Quantitative Analysis of Discourse Phenomena in Machine Translation. Discours(16).

Book chapters

Zareie A, Bontcheva K & Scarton C (2025) A lightweight approach for user and keyword classification in controversial topics In Maria Aiello L, Chakraborty T & Gaito S (Ed.), Lecture Notes in Computer Science (pp. 243-253). Springer Nature Switzerland View this article in WRRO

Book reviews

Scarton C (2019) Horacio Saggion, automatic text simplification. Synthesis lectures on human language technologies, April 2017. 137 pages, ISBN:1627058680 9781627058681. Natural Language Engineering, 26(4), 489-492.

Conference proceedings

Leite JA, Razuvayevskaya O, Scarton C & Bontcheva K (2025) A Cross-Domain Study of the Use of Persuasion Techniques in Online Disinformation. Companion Proceedings of the ACM on Web Conference 2025 (pp 1100-1103)
Singh I, Scarton C, Song X & Bontcheva K (2025) Breaking Language Barriers with MMTweets: Advancing Cross-Lingual Debunked Narrative Retrieval for Fact-Checking. Ceur Workshop Proceedings, Vol. 4070 (pp 1-19)
Haouari F, Scarton C, Faggiani N, Nikolaidis N, Kotseva B, Farha IA, Linge JP & Bontcheva K (2025) UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections.. ICWSM (pp 2477-2495)
Li Y, Zhao Z & Scarton C (2025) Label Set Optimization via Activation Distribution Kurtosis for Zero-Shot Classification with Generative Models.. EMNLP (pp 31724-31741)
Li Y, Zhao Z & Scarton C (2025) It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs.. EMNLP (pp 29544-29559)
Vasilakes J, Zhao Z, Vykopal I, Gregor M, Hyben M & Scarton C (2024) ExU: AI models for examining multilingual disinformation narratives and understanding their spread. Proceedings of the 25th Annual Conference of the European Association for Machine Translation, EAMT 2024, Vol. 2 (pp 39-40). Sheffield, United Kingdom, 24 June 2024 - 24 June 2024. View this article in WRRO
Zareie A, Bontcheva K & Scarton C (2024) A Lightweight Approach for User and Keyword Classification in Controversial Topics.. ASONAM (2), Vol. 15212 (pp 243-253)
Vincent S, Dowek A, Sumner R, Prescott C, Preston E, Bayliss C, Oakley C & Scarton C (2024) Reference-less Analysis of Context Specificity in Translation with Personalised Language Models. 2024 Joint International Conference on Computational Linguistics Language Resources and Evaluation Lrec Coling 2024 Main Conference Proceedings (pp 13769-13784)
Mu Y, Wu BP, Thorne W, Robinson A, Aletras N, Scarton C, Bontcheva K & Song X (2024) Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science.. LREC/COLING (pp 12074-12086)
Li Y & Scarton C (2024) Can We Identify Stance Without Target Arguments? A Study for Rumour Stance Classification. 2024 Joint International Conference on Computational Linguistics Language Resources and Evaluation Lrec Coling 2024 Main Conference Proceedings (pp 2844-2851)
He W, Idiart M, Scarton C & Villavicencio A (2024) Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss.. ACL (Findings) (pp 12473-12485)
Goldsack T, Scarton C, Shardlow M & Lin C (2024) Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles. Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (pp 122-131), August 2024 - August 2024.
Zhang Z, Goldsack T, Scarton C & Lin C (2024) ATLAS: Improving Lay Summarisation with Attribute-based Control.. ACL (Short Papers) (pp 337-345)
Goldsack T, Scarton C, Shardlow M & Lin C (2024) Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles.. BioNLP@ACL (pp 122-131)
(2024) Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2), EAMT 2024, Sheffield, UK, June 24-27, 2024. EAMT (2)
Spillane B, Scarton C, Moro R, Ivanov P, Tagarev A, Smiko J, Farha IA, Munnelly G, Uhlárik F & Heppell F (2024) Multilinguality in the VIGILANT project. Proceedings of the 25th Annual Conference of the European Association for Machine Translation Eamt 2024, Vol. 2 (pp 41-42)
Vincent ST, Prescott C, Bayliss C, Oakley C & Scarton C (2024) A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling.. EAMT (1) (pp 561-572)
(2024) Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1), EAMT 2024, Sheffield, UK, June 24-27, 2024. EAMT (1)
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2024) EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles.. CIKM (pp 5380-5384)
Jiang Y, Song X, Scarton C, Singh I, Aker A & Bontcheva K (2023) Categorising fine-to-coarse grained misinformation: an empirical study of the COVID-19 infodemic. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (pp 556-567). Varna, Bulgaria, 8 September 2023 - 8 September 2023. View this article in WRRO
Goldsack T, Zhang Z, Lin C & Scarton C (2023) Domain-Driven and Discourse-Guided Scientific Summarisation (pp 361-376)
Wu B, Razuvayevskaya O, Heppell F, Leite JA, Scarton C, Bontcheva K & Song X (2023) SheffieldVeraAI at SemEval-2023 Task 3: Mono and Multilingual Approaches for News Genre, Topic and Persuasion Technique Classification. Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023) (pp 1995-2008), July 2023 - July 2023.
(2023) Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023. EAMT
Vincent S, Flynn R & Scarton C (2023) MTCue: learning zero-shot control of extra-textual attributes by leveraging unstructured context in neural machine translation. Findings of the Association for Computational Linguistics: ACL 2023 (pp 8210-8226). Toronto, Canada, 9 July 2023 - 9 July 2023. View this article in WRRO
Goldsack T, Luo Z, Xie Q, Scarton C, Shardlow M, Ananiadou S & Lin C (2023) BioLaySumm 2023 Shared Task: Lay Summarisation of Biomedical Research Articles. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks (pp 468-477), July 2023 - July 2023.
Goldsack T, Luo Z, Xie Q, Scarton C, Shardlow M, Ananiadou S & Lin C (2023) Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles. Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp 468-477)
Heppell F, Bontcheva K & Scarton C (2023) Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp 5729-5741), December 2023 - December 2023.
Mu Y, Jiang Y, Heppell F, Singh I, Scarton C, Bontcheva K & Song X (2023) A Large-Scale Comparative Study of Accurate COVID-19 Information versus Misinformation.. CoRR, Vol. abs/2304.04811
Leite JA, Scarton C & Silva DF (2023) Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks.. RANLP (pp 631-640)
Li Y, Scarton C, Song X & Bontcheva K (2023) Classifying COVID-19 Vaccine Narratives.. RANLP (pp 648-657)
Goldsack T, Zhang Z, Tang C, Scarton C & Lin C (2023) Enhancing Biomedical Lay Summarisation with External Knowledge Graphs.. EMNLP (pp 8016-8032)
Wu B, Li Y, Mu Y, Scarton C, Bontcheva K & Song X (2023) Don't waste a single annotation: improving single-label classifiers through soft labels.. EMNLP (Findings) (pp 5347-5355)
Singh I, Bontcheva K, Song X & Scarton C (2022) Comparative analysis of engagement, themes, and causality of Ukraine-related debunks and disinformation. Social Informatics: 13th International Conference, SocInfo 2022, Glasgow, UK, October 19–21, 2022, Proceedings (pp 128-143). Glasgow, UK, 19 October 2022 - 19 October 2022. View this article in WRRO
Madabushi HT, Gow-Smith E, Garcia M, Scarton C, Idiart M & Villavicencio A (2022) SemEval-2022 Task 2 : multilingual idiomaticity detection and sentence embedding. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp 107-121). Seattle, WA, USA, 14 July 2022 - 14 July 2022. View this article in WRRO
Phelps D, Fan X-R, Gow-Smith E, Madabushi HT, Scarton C & Villavicencio A (2022) Sample efficient approaches for idiomaticity detection. Proceedings of The 18th Workshop on Multiword Expressions @LREC2022 (pp 105-111). Marseille, France, 20 June 2022 - 20 June 2022. View this article in WRRO
Vincent ST, Barrault L & Scarton C (2022) Controlling Extra-Textual Attributes about Dialogue Participants: A Case Study of English-to-Polish Neural Machine Translation. Eamt 2022 Proceedings of the 23rd Annual Conference of the European Association for Machine Translation (pp 121-130)
Vincent S, Barrault L & Scarton C (2022) Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022. Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022) (pp 341-350), May 2022 - May 2022.
Singh I, Li Y, Thong M & Scarton C (2022) GateNLP-UShef at SemEval-2022 Task 8: Entity-Enriched Siamese Transformer for Multilingual News Article Similarity. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp 1121-1128), July 2022 - July 2022.
(2022) Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, EAMT 2022, Ghent, Belgium, June 1-3, 2022. EAMT
Madabushi HT, Gow-Smith E, García M, Scarton C, Idiart M & Villavicencio A (2022) SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding.. SemEval@NAACL (pp 107-121)
Gow-Smith E, Tayyar Madabushi H, Scarton C & Villavicencio A (2022) Improving Tokenisation by Alternative Treatment of Spaces. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp 11430-11443), December 2022 - December 2022.
Goldsack T, Zhang Z, Lin C & Scarton C (2022) Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp 10589-10604), December 2022 - December 2022.
Singh I, Bontcheva K, Song X & Scarton C (2022) Comparative Analysis of Engagement, Themes, and Causality of Ukraine-Related Debunks and Disinformation.. SocInfo, Vol. 13618 (pp 128-143)
(2022) Computational Processing of the Portuguese Language - 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. PROPOR, Vol. 13208
Singh I, Li Y, Thong M & Scarton C (2022) GateNLP-UShef at SemEval-2022 Task 8: Entity-Enriched Siamese Transformer for Multilingual News Article Similarity.. SemEval@NAACL (pp 1121-1128)
Goldsack T, Zhang Z, Lin C & Scarton C (2022) Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature.. EMNLP (pp 10589-10604)
Garcia M, Kramer Vieira T, Scarton C, Idiart MAP & Villavicencio A (2021) Assessing idiomaticity representations in vector models with a noun compound dataset labeled at type and token levels. Proceedings of ACL-IJCNLP 2021 (pp 2730-2741). Bangkok, Thailand, 1 August 2021 - 1 August 2021. View this article in WRRO
Garcia M, Vieira TK, Scarton C, Idiart M & Villavicencio A (2021) Probing for idiomaticity in vector space models. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (pp 3551-3564). Virtual conference, 19 April 2021 - 19 April 2021. View this article in WRRO
Tayyar Madabushi H, Gow-Smith E, Scarton C & Villavicencio A (2021) AStitchInLanguageModels : dataset and methods for the exploration of idiomaticity in pre-trained language models. Findings of the Association for Computational Linguistics: EMNLP 2021 (pp 3464-3477). Punta Cana, Dominican Republic, 7 November 2021 - 7 November 2021. View this article in WRRO
Madabushi HT, Gow-Smith E, Scarton C & Villavicencio A (2021) AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models.. EMNLP (Findings) (pp 3464-3477)
Scarton C & Li Y (2021) Cross-lingual Rumour Stance Classification: a First Study with BERT and Machine Translation.. TTO (pp 50-59)
Scarton C, Madhyastha P & Specia L (2020) Deciding when, how and for whom to simplify. ECAI 2020, Vol. 325 (pp 2172-2179). Santiago de Compostela, Spain, 29 August 2020 - 29 August 2020. View this article in WRRO
Wick-Pedro G, Santos RLS, Vale OA, Pardo TAS, Bontcheva K & Scarton C (2020) Linguistic analysis model for monitoring user reaction on satirical news for Brazilian Portuguese. Computational Processing of the Portuguese Language, Vol. 12037 (pp 313-320). Evora, Portugal, 2 March 2020 - 2 March 2020. View this article in WRRO
Santos RLS, Wick-Pedro G, Leal S, Vale OA, Pardo TAS, Bontcheva K & Scarton C (2020) Measuring the impact of readability features in fake news detection. Lrec 2020 12th International Conference on Language Resources and Evaluation Conference Proceedings (pp 1404-1413)
Alva-Manchego F, Martin L, Bordes A, Scarton C, Sagot B & Specia L (2020) ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp 4668-4679)
Leite JA, Silva D, Bontcheva K & Scarton C (2020) Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (pp 914-924), December 2020 - December 2020.
Scarton C, Silva D & Bontcheva K (2020) Measuring What Counts: The Case of Rumour Stance Classification. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (pp 925-932), December 2020 - December 2020.
Alva-Manchego F, Martin L, Scarton C & Specia L (2019) EASSE: easier automatic sentence simplification evaluation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations (pp 49-54). Hong Kong, China, 3 November 2019 - 3 November 2019. View this article in WRRO
Alva-Manchego F, Scarton C & Specia L (2019) Cross-Sentence Transformations in Text Simplification.. WNLP@ACL (pp 181-184)
Alva-Manchego F, Martin L, Scarton C & Specia L (2019) EASSE: Easier Automatic Sentence Simplification Evaluation. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF SYSTEM DEMONSTRATIONS (pp 49-54)
Scarton C, Forcada ML, Esplà-Gomis M & Specia L (2019) Estimating post-editing effort: a study on human judgements, task-based and reference-based metrics of MT quality.. IWSLT
Forcada ML, Scarton C, Specia L, Haddow B & Birch A (2018) Exploring gap filling as a cheaper alternative to reading comprehension questionnaires when evaluating machine translation for gisting. Proceedings of the Third Conference on Machine Translation, Vol. 1 (pp 192-203). Brussels, Belgium, 31 October 2018 - 31 October 2018. View this article in WRRO
Forcada ML, Scarton C, Specia L, Haddow B & Birch A (2018) Exploring gap filling as a cheaper alternative to reading comprehension questionnaires when evaluating machine translation for gisting. Proceedings of the Third Conference on Machine Translation: Research Papers (pp 192-203), October 2018 - October 2018.
Ive J, Scarton C, Blain F & Specia L (2018) Sheffield Submissions for the WMT18 Quality Estimation Shared Task. Proceedings of the Third Conference on Machine Translation: Shared Task Papers (pp 794-800), October 2018 - October 2018.
Lala C, Madhyastha PS, Scarton C & Specia L (2018) Sheffield Submissions for WMT18 Multimodal Translation Shared Task. Proceedings of the Third Conference on Machine Translation: Shared Task Papers (pp 624-631), October 2018 - October 2018.
Scarton C, Paetzold GH & Specia L (2018) Text simplification from professionally produced corpora. Lrec 2018 11th International Conference on Language Resources and Evaluation (pp 3504-3510)
Scarton C, Henrique Paetzold G & Specia L (2018) Simpa: A sentence-level simplification corpus for the public administration domain. Lrec 2018 11th International Conference on Language Resources and Evaluation (pp 4333-4338)
Scarton C & Specia L (2018) Learning Simplifications for Specific Target Audiences. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp 712-718), July 2018 - July 2018.
Lala C, Madhyastha P, Scarton C & Specia L (2018) Sheffield Submissions for WMT18 Multimodal Translation Shared Task. Wmt 2018 3rd Conference on Machine Translation Proceedings of the Conference, Vol. 2 (pp 624-631)
Ive J, Scarton C, Blain F & Specia L (2018) Sheffield Submissions for the WMT18 Quality Estimation Shared Task. Wmt 2018 3rd Conference on Machine Translation Proceedings of the Conference, Vol. 2 (pp 794-800)
Blain F, Scarton C & Specia L (2017) Bilexical Embeddings for Quality Estimation. Proceedings of the Second Conference on Machine Translation (pp 545-550), September 2017 - September 2017.
Alva-Manchego F, Bingel J, Paetzold G, Scarton C & Specia L (2017) Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs.. IJCNLP(1) (pp 295-305)
Scarton C, Specia L, Aprosio AP, Tonelli S & Wanton TM (2017) MUSST: A Multilingual Syntactic Simplification Tool. 8th International Joint Conference on Natural Language Processing Proceedings of the Ijcnlp 2017 System Demonstrations (pp 25-28)
Graham Y, Ma Q, Baldwin T, Liu Q, Parra C & Scarton C (2017) Improving Evaluation of Document-level Machine Translation Quality Estimation. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (pp 356-361), April 2017 - April 2017.
Aluísio S, Cunha A & Scarton C (2016) Evaluating Progression of Alzheimer’s Disease by Regression and Classification Methods in a Narrative Language Test in Portuguese (pp 109-114)
Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C , Negri M et al (2016) Findings of the 2016 Conference on Machine Translation. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (pp 131-198), August 2016 - August 2016.
Scarton C, Paetzold GH & Specia L (2016) Quality estimation for language output applications. Coling 2016 26th International Conference on Computational Linguistics Proceedings of Coling 2016 Tutorial Abstracts (pp 14-17)
Scarton C & Specia L (2016) A reading comprehension corpus for machine translation evaluation. Proceedings of the 10th International Conference on Language Resources and Evaluation Lrec 2016 (pp 3652-3658)
Scarton C, Beck D, Shah K, Sim Smith K & Specia L (2016) Word embeddings and discourse information for Quality Estimation. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (pp 831-837), August 2016 - August 2016.
Tan L, Scarton C, Specia L & van Genabith J (2016) SAARSHEFF at SemEval-2016 Task 1: Semantic Textual Similarity with Machine Translation Evaluation Metrics and (eXtreme) Boosted Tree Ensembles. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp 628-633), June 2016 - June 2016.
Scarton C, Beck D, Shah K, Smith KS & Specia L (2016) Word embeddings and discourse information for Machine Translation Quality Estimation. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 2 (pp 831-837)
Specia L, Paetzold GH & Scarton C (2015) Multi-level translation quality prediction with QuEst++. Proceedings of ACL-IJCNLP 2015 System Demonstrations (pp 115-120). Beijing, China, 26 July 2015 - 26 July 2015. View this article in WRRO
Scarton C, Zampieri M, Vela M, van Genabith J & Specia L (2015) Searching for context: A study on document-level labels for translation quality estimation. Eamt 2015 Proceedings of the 18th Annual Conference of the European Association for Machine Translation (pp 121-128)
Scarton C, Tan L & Specia L (2015) USHEF and USAAR-USHEF participation in the WMT15 QE shared task. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp 336-341), September 2015 - September 2015.
Tan L, Scarton C, Specia L & van Genabith J (2015) USAAR-SHEFFIELD: Semantic Textual Similarity with Deep Regression and Machine Translation Evaluation Metrics. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (pp 85-89), June 2015 - June 2015.
Scarton C (2015) Discourse and Document-level Information for Evaluating Language Output Tasks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop (pp 118-125), June 2015 - June 2015.
Bojar O, Chatterjee R, Federmann C, Haddow B, Huck M, Hokamp C, Koehn P, Logacheva V, Monz C, Negri M , Post M et al (2015) Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp 1-46), September 2015 - September 2015.
Scarton C, Tan L & Specia L (2015) Ushef and usaar-ushef participation in the wmt15 quality estimation shared task. 10th Workshop on Statistical Machine Translation Wmt 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing Emnlp 2015 Proceedings (pp 336-341)
Bojar O, Chatterjee R, Federmann C, Haddow B, Hokamp C, Huck M, Logacheva V, Pecina P, Koehn P, Monz C , Negri M et al (2015) Preface. 10th Workshop on Statistical Machine Translation Wmt 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing Emnlp 2015 Proceedings (pp III)
Scarton C, Sanches Duran M & Aluísio SM (2014) Using Cross-Linguistic Knowledge to Build VerbNet-Style Lexicons: Results for a (Brazilian) Portuguese VerbNet (pp 149-160)
Scarton C & Specia L (2014) Exploring Consensus in Machine Translation for Quality Estimation. Proceedings of the Ninth Workshop on Statistical Machine Translation (pp 342-347), June 2014 - June 2014.
Scarton C, Sun L, Kipper-Schuler K, Duran MS, Palmer M & Korhonen A (2014) Verb Clustering for Brazilian Portuguese (pp 25-39)
Scarton C & Specia L (2014) Document-level translation quality estimation: Exploring discourse and pseudo-references. Proceedings of the 17th Annual Conference of the European Association for Machine Translation Eamt 2014 (pp 101-108)
Scarton C, Sun L, Kipper-Schuler K, Duran MS, Palmer M & Korhonen A (2014) Verb clustering for Brazilian Portuguese. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8403 LNCS(PART 1) (pp 25-39)
(2014) Computational Processing of the Portuguese Language
Duran MS, Scarton CE, Aluísio SM & Ramisch C (2013) Identifying Pronominal Verbs: Towards Automatic Disambiguation of the Clitic se in Portuguese. Proceedings of the 9th Workshop on Multiword Expressions Mwe 2013 in Conjunction with the 2013 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies Naacl Hlt 2013 (pp 93-100)
Scarton C, Gasperin C & Aluisio S (2010) Revisiting the readability assessment of texts in Portuguese. Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, Vol. 6433 LNAI (pp 306-315)
Scarton C, de Oliveira M, Candido A, Gasperin C & Aluísio SM (2010) SIMPLIFICA: A tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments. Naacl Hlt 2010 Human Language Technologies 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Demonstration Session (pp 41-44)
Scarton CE, De Almeida DM & Aluísio SM (2009) Text readability analysis with natural language processing tools: The adaptation of coh-metrix metrics for Portuguese. Stil 2009 2009 7th Brazilian Symposium in Information and Human Language Technology (pp 53-62)

Theses

Scarton CE VerbNet.Br: construção semiautomática de um léxico verbal online e independente de domínio para o português do Brasil.

Preprints

Leite JOA, Razuvayevskaya O, Bontcheva K & Scarton C (2026) LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems, arXiv.
Moreira DAB, Ferreira AI, Silva J, Santos GOD, Bonil G, Gondim JO, Santos MD, Maia H, Hashiguti S, da Silva ND , Scarton C et al (2025) CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning, arXiv.
Leite JOA, Arora A, Gargova S, Luz JO, Sampaio G, Roberts I, Scarton C & Bontcheva K (2025) A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation, arXiv.
Li Y, Vasilakes J, Zhao Z & Scarton C (2025) SCRum-9: Multilingual Stance Classification over Rumours on Social Media, arXiv.
Li Y, Zhao Z & Scarton C (2025) Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models, arXiv.
Li Y, Zhao Z & Scarton C (2025) It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs, arXiv.
Singh I, Scarton C & Bontcheva K (2025) GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification, arXiv.
Haouari F, Scarton C, Faggiani N, Nikolaidis N, Kotseva B, Farha IA, Linge J & Bontcheva K (2025) UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections, arXiv.
Zareie A, Bontcheva K & Scarton C (2025) A Lightweight Approach for User and Keyword Classification in Controversial Topics, arXiv.
Goldsack T, Scarton C & Lin C (2025) Leveraging Large Language Models for Zero-shot Lay Summarisation in Biomedicine and Beyond, arXiv.
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2024) Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals, Springer Science and Business Media LLC.
He W, Vieira TK, Garcia M, Scarton C, Idiart M & Villavicencio A (2024) Investigating Idiomaticity in Word Representations, arXiv.
Goldsack T, Scarton C, Shardlow M & Lin C (2024) Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles, arXiv.
Vincent S, Prescott C, Bayliss C, Oakley C & Scarton C (2024) A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling, arXiv.
He W, Idiart M, Scarton C & Villavicencio A (2024) Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss, arXiv.
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2024) EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles, arXiv.
Zhang Z, Goldsack T, Scarton C & Lin C (2024) ATLAS: Improving Lay Summarisation with Attribute-based Control, arXiv.
Vasilakes J, Zhao Z, Vykopal I, Gregor M, Hyben M & Scarton C (2024) ExU: AI Models for Examining Multilingual Disinformation Narratives and Understanding their Spread, arXiv.
Gow-Smith E, Phelps D, Madabushi HT, Scarton C & Villavicencio A (2024) Word Boundary Information Isn't Useful for Encoder Language Models, arXiv.
Wu B, Li Y, Mu Y, Scarton C, Bontcheva K & Song X (2023) Don't Waste a Single Annotation: Improving Single-Label Classifiers Through Soft Labels, arXiv.
Goldsack T, Zhang Z, Tang C, Scarton C & Lin C (2023) Enhancing Biomedical Lay Summarisation with External Knowledge Graphs, arXiv.
Heppell F, Bontcheva K & Scarton C (2023) Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study, arXiv.
Goldsack T, Luo Z, Xie Q, Scarton C, Shardlow M, Ananiadou S & Lin C (2023) Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of Biomedical Research Articles, arXiv.
Leite JA, Razuvayevskaya O, Bontcheva K & Scarton C (2023) Weakly Supervised Veracity Classification with LLM-Predicted Credibility Signals, arXiv.
Razuvayevskaya O, Wu B, Leite JA, Heppell F, Srba I, Scarton C, Bontcheva K & Song X (2023) Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification, arXiv.
Singh I, Scarton C, Song X & Bontcheva K (2023) Breaking Language Barriers with MMTweets: Advancing Cross-Lingual Debunked Narrative Retrieval for Fact-Checking, arXiv.
Leite JA, Scarton C & Silva DF (2023) Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks, arXiv.
Vincent S, Flynn R & Scarton C (2023) MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation, arXiv.
Mu Y, Wu BP, Thorne W, Robinson A, Aletras N, Scarton C, Bontcheva K & Song X (2023) Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science, arXiv.
Mu Y, Jiang Y, Heppell F, Singh I, Scarton C, Bontcheva K & Song X (2023) A Large-Scale Comparative Study of Accurate COVID-19 Information versus Misinformation, arXiv.
Vincent S, Dowek A, Sumner R, Blundell C, Preston E, Bayliss C, Oakley C & Scarton C (2023) Reference-less Analysis of Context Specificity in Translation with Personalised Language Models, arXiv.
Li Y & Scarton C (2023) Can We Identify Stance Without Target Arguments? A Study for Rumour Stance Classification, arXiv.
Wu B, Razuvayevskaya O, Heppell F, Leite JA, Scarton C, Bontcheva K & Song X (2023) SheffieldVeraAI at SemEval-2023 Task 3: Mono and multilingual approaches for news genre, topic and persuasion technique classification, arXiv.
Mu Y, Jin M, Grimshaw C, Scarton C, Bontcheva K & Song X (2023) VaxxHesitancy: A Dataset for Studying Hesitancy towards COVID-19 Vaccination on Twitter, arXiv.
Singh I, Bontcheva K, Song X & Scarton C (2022) Comparative Analysis of Engagement, Themes, and Causality of Ukraine-Related Debunks and Disinformation, arXiv.
Goldsack T, Zhang Z, Lin C & Scarton C (2022) Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature, arXiv.
Li Y, Scarton C, Song X & Bontcheva K (2022) Classifying COVID-19 vaccine narratives, arXiv.
Singh I, Li Y, Thong M & Scarton C (2022) GateNLP-UShef at SemEval-2022 Task 8: Entity-Enriched Siamese Transformer for Multilingual News Article Similarity, arXiv.
Vincent ST, Barrault L & Scarton C (2022) Controlling Formality in Low-Resource NMT with Domain Adaptation and Re-Ranking: SLT-CDT-UoS at IWSLT2022, arXiv.
Vincent ST, Barrault L & Scarton C (2022) Controlling Extra-Textual Attributes about Dialogue Participants -- A Case Study of English-to-Polish Neural Machine Translation, arXiv.
Jiang Y, Song X, Scarton C, Singh I, Aker A & Bontcheva K (2022) Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of the COVID-19 Infodemic, Research Square Platform LLC.
Singh I, Bontcheva K & Scarton C (2021) The False COVID-19 Narratives That Keep Being Debunked: A Spatiotemporal Analysis, arXiv.
Jiang Y, Song X, Scarton C, Aker A & Bontcheva K (2021) Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of COVID-19 Infodemic, arXiv.
Singh I, Scarton C & Bontcheva K (2021) Multistage BiCross encoder for multilingual access to COVID-19 health information, arXiv.
Leite JA, Silva DF, Bontcheva K & Scarton C (2020) Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis, arXiv.
Scarton C, Silva DF & Bontcheva K (2020) Measuring What Counts: The case of Rumour Stance Classification, arXiv.
Alva-Manchego F, Martin L, Bordes A, Scarton C, Sagot B & Specia L (2020) ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations, arXiv.
Scarton C, Forcada ML, Esplà-Gomis M & Specia L (2019) Estimating post-editing effort: a study on human judgements, task-based and reference-based metrics of MT quality, arXiv.
Alva-Manchego F, Martin L, Scarton C & Specia L (2019) EASSE: Easier Automatic Sentence Simplification Evaluation, arXiv.
Forcada ML, Scarton C, Specia L, Haddow B & Birch A (2018) Exploring Gap Filling as a Cheaper Alternative to Reading Comprehension Questionnaires when Evaluating Machine Translation for Gisting, arXiv.
Madabushi HT, Gow-Smith E, Scarton C & Villavicencio A () AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models.

Grants

Longitudinal, Multilingual, and Multi-format Investigation and Detection of LLM-Generated Disinformation, EPSRC, 12/2025 - 11/2028, £552,355, as Co-PI
Narrative analysis AI-based models for countering disinformation at scale, Research England, 11/2025 - 03/2026, £14,872, as PI
ExU: AI Models for Examining Multilingual Disinformation Narratives and Understanding their Spread, European Media and Information Fund, 11/2023 - 04/2025, €399,926, as PI
VIGILANT: Equipping Police Authorities with an Ecosystem of Advanced Research Tools to Identify and Investigate Disinformation, Horizon Europe, 11/2022 - 10/2025, £555,747, as PI
vera.ai (Verification assisted by AI), Horizon Europe, 09/2022 - 11/2025, £901,250, as Co-I
XAIvsDisinfo: eXplainable AI Methods for Categorisation and Analysis of COVID-19 Vaccine Disinformation and Online Debates, UKRI, 06/2021 - 03/2023, £288,337, as Co-I
Modeling Idiomaticity in Human and Artificial Language Processing, EPSRC, 12/2020 - 11/2024, £446,163, as Co-PI
Modelling the link between working memory and language deficits in schizophrenia, Royal Society, 12/2020 - 11/2022, £74,000, as Co-PI

School of Computer Science

School of Computer Science

Dr Carolina Scarton

Books

Journal articles

Book chapters

Book reviews

Conference proceedings

Theses

Preprints

Links