Keynote Lectures – QUALICO 2023

Tanja Samardžić, Language and Space Lab, University of Zurich :

Subword tokenization as a method for discovering and comparing linguistic structures

Subword tokenization is unsupervised surface segmentation of words, applied as a
preprocessing step when text is given as input to neural networks. For example, the
word coworking, can be split into co work ing and each part is assigned a vector repre-
sentation (embedding). All pretrained large language models apply some kind of sub-
word tokenization, but the decisions on how this step should be performed remain
largely arbitrary and with little reference to the structure of words.
A popular algorithm for performing subword tokenization is Byte-Pair Encoding (BPE), a
general-purpose compression algorithm, which, applied to text, improves machine
translation and other end-user tasks. Despite its usefulness in language processing,
this method is commonly judged as not linguistically relevant, since its output is hard
to align with any morphological analysis. The misalignment between BPE and linguistic
analysis is puzzling: to compress language data efficiently, BPE needs to find subword
patterns that reduce text redundancy. These patterns might not correspond to usual
morphological analyses, but they are structural elements.
In this talk, I will show that a systematic analysis of subword units identified by BPE
across a set of around 50 typologically diverse languages reveals linguistically rele-
vant patterns. The types of units that have the strongest impact on compression are
an indicator of morphological typology: for languages with richer inflectional morphol-
ogy there is a preference for highly productive units, while for languages with less in-
flectional morphology, idiosyncratic units are more prominent. The features of BPE
subword units can thus distinguish automatically between different morphological
types of languages using only raw text.
By monitoring the outcome of compression steps, we can track the cross-linguistic dif-
ferences in what kinds of redundancy are gradually removed in different languages.,
which opens a new possibility for describing and comparing languages. For instance,
the output of BPE allows us to study the relative length of subword units, revisiting the
famous Menzerath–Altmann law on a wide scale. The results of one such analysis show
that the length of subword units identified by BPE tends to be rather evenly distrib-
uted: as the length of words increases, the length of subword units decreases evenly
and not only on average. Cross-linguistic variation in the degree of evenness in sub-
word units turns out to be a good criterion for deciding what kinds of languages should
be taken into consideration for cross-lingual transfer of pretrained language models,
making quantitative linguistic analysis highly relevant to contemporary multilingual
natural language processing.

George Mikros, Dept. of Middle Eastern Studies, College of Humanities and Social Sciences (CHSS), Hamad Bin Khalifa University (HBKU) :

Detection of AI-Generated Texts and Quantitative Analysis of Large Language Model Outputs

In recent years, there has been a seismic shift in the landscape of Natural Language Understanding
(NLU) and Language Generation (LG) tasks, precipitated by the advent of Large Language Models
(LLMs). These models, notably OpenAI’s GPT-4 and Anthropic’s Claude, have been recognized for
their ability to produce high-quality, coherent, and context-specific textual content (Brown et al.,
2020). The sophistication of these models is such that their written outputs frequently mirror
human-produced text to the extent that eludes detection by most current AI-writing detectors.
In this lecture, we intend to present a quantitative analysis of the textual outputs generated by these
two leading-edge LLMs, focusing on discerning linguistic features that distinguish them from human
text production. We will scrutinize an extensive array of stylometric and linguistic characteristics and
investigate the interrelations among these features utilizing a broad range of statistical
methodologies and visualization techniques. Moreover, we will explore the latest advancements in
detecting AI-generated writing. However, we argue that, given the current state of technology, it’s
not feasible to achieve this goal, especially in real-world educational scenarios.
Our wider objective in this talk is to develop a deeper understanding of the stochastic nature of AI-
generated writing and to distinguish it from human text production. By doing so, we aim to shed
light on the nuanced distinctions between machine-generated and human-generated writing,
thereby offering new insights into the evolving field of AI-assisted text production.