Navigation auf uzh.ch
Time & Location: every 2-3 weeks on Tuesdays from 10:15 am to 12:00 pm in room BIN-2-A.10.
Please note that the room has changed from the previous semester.
Online participation via the MS Teams Team CL Colloquium is also possible.
Responsible: Marius Huber
Embedding models are fundamental components of semantic search engines and other Natural Language Processing (NLP) systems, as they provide us with powerful vectorized representations of text ("embeddings"). But how can we judge whether one embedding model is better than another or diagnose perspectives for their improvement? While for English and even English-X language pairs, the situation appears mostly clear due to the availability of large-scale benchmarks, we still don't know much about the robustness of embeddings towards extremely heterogeneous texts as we can encounter "in the wild", i.e., texts that can be from a different language, from a different time, contain transcription errors and/or code-mixes, just to name a few common phenomena. To test such an open setting, we plan to build a testbed for embedding models from the IMPRESSO corpus that contains millions of digitized multi-lingual and tempo-spatially distributed news texts from more than two centuries. Are current embedding models up to the challenge?
Applying NLP tasks to sign languages is challenging primarily due to data scarcity and the absence of a well-established methodology. While it is still unclear whether an end-to-end or a pipeline approach will take the lead, we notice more basic problems to solve in sign language processing, including segmentation, alignment, and representation. On the one hand, we are working on releasing more and better-quality data that is publicly available. On the other, we draw inspiration from the recent advances in LLMs and deep pretrained models to guide our research in tackling the above-mentioned basic problems.
The highest probability sequences of most neural language generation models tend to be degenerate in some way, a problem known as the inadequacy of the mode. While many approaches to tackling particular aspects of the problem exist, such as dealing with too short sequences or excessive repetitions, explanations of why it occurs in the first place are rarer and do not agree with each other. In this talk we will discuss the current attempts at explaining this phenomenon and why we believe those to not paint a full picture. We will also provide an alternative hypothesis that links the inadequacy of the mode to the desire for our models to generalise to previously unseen contexts.
Viewing linguistic communication as information transmission between cognitive agents, successful language production can be understood as an act of reducing the uncertainty over future states that a comprehender may be anticipating. When an individual utters a sentence, they narrow down the comprehender's expectations, and they do so by an amount proportional to the contextual predictability of the utterance. I will discuss two recent studies that demonstrate how we can empirically estimate utterance uncertainty and predictability by simulating potential upcoming linguistic contributions using neural text generators. The first study introduces a statistical framework to quantify utterance uncertainty as production variability, and evaluates the alignment of language generators to the production variability observed in humans. We find that different types of production tasks exhibit distinct levels of lexical, syntactic, and semantic variability, and neural text generators generally achieve satisfactory calibration of uncertainty. In the second study, we use the previously introduced statistical framework to define a novel measure of utterance predictability, which we term information value. Information value quantifies predictability by measuring the distance from contextually plausible alternatives and offers advantages over traditional measures by disentangling various dimensions of uncertainty and being less influenced by surface form competition. Psycholinguistic experiments demonstrate that information value is a superior predictor of utterance acceptability in written and spoken dialogue compared to token-level surprisal aggregates, and that it complements surprisal in predicting eye-tracked reading times.
Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca. 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness.
"Who does what to whom?" The goal of a graph-based meaning representation (in short: MR) is to represent the meaning of a text in a structured format. With an MR, we can explicate the meaning of a text, describe occurring events and entities, and their semantic relations. Thus, a metric of MRs would measure a distance (or similarity) between MRs. A main hypothesis of my PhD thesis was that such a meaning-focused similarity measurement can be useful for several important AI tasks, for instance, testing the capability of systems to produce meaningful output (system evaluation), or when searching for similar texts (information retrieval). Moreover, due to the natural explicitness of MRs, I hypothesized that MR metrics could provide us with valuable explainability of their similarity measurement. Indeed, if texts reside in a space where their meaning has been isolated and structured, we might directly see in which aspects two texts are actually similar (or dissimilar).In this talk, I'll give a brief overview of some findings of my thesis, showing the usefulness of MR metrics to important AI applications, including explainable NLG evaluation and semantic search.
Language models have transitioned into ubiquitous commercial web APIs, with recent research highlighting their proficiency in multilingual applications. These APIs operate on a token-based pricing system, where the definition of a token varies depending on the specific model and training data, resulting in varying cost efficiencies across languages. Previous studies have identified several drawbacks of tokenization in multilingual settings, including increased costs, latency, and limitations in contextual learning. This talk discusses an ongoing project aimed at identifying critical factors influencing tokenization parity across languages.
Multimodal learning involves integrating and combining information from various modalities or sources to enhance learning and comprehension. The fusion of data from different modalities can improve performance in identity recognition scenarios. In our paper, we compare three modality fusion approaches in identity identification and verification scenarios by processing two modalities: voice and face. We explore sensor fusion, feature fusion, and score fusion approaches. Our evaluation, conducted on the VoxCeleb2 dataset using K-fold cross-validation, shows that the feature fusion strategy achieves the highest performance, with an accuracy of 98.33% in identity identification and 0.62% for EER in verification tasks.
Reading is a highly complex cognitive skill that relies on the efficient interplay of several underlying factors. Visual search is a task where the interplay of these underlying factors is especially relevant, as visual search perfomance has been used to predict reading ability but visual search behavior is simultaneously influenced by reading experience. In this talk, I will present an experiment where we investigated the influence of stimulus type and reading experience on reaction times in a visual search task, revealing a letter familiarity effect.This work is part of an ongoing longitudinal eye-tracking study on reading development ("Lesen im Blick") on which I will give an update.
There is currently significant interest in utilizing automated metrics to rate the quality of generated texts. Learned metrics, which are trained to replicate human quality ratings, have gained prominence due to their promise of providing reliable evaluation outcomes at a reduced time and cost compared to human evaluation. However, these automated metrics are not without flaws and can introduce errors, such as assigning low ratings to high-quality generations or vice versa. Such errors can significantly impact the evaluation process, especially when averaging scores to measure system-level performance. We propose a model that explicitly accounts for these inaccuracies. This model integrates both human ground-truth ratings and automated metric scores to derive system-level performance estimates that more closely align with human evaluations. We will discuss insights and observations from this error-focused approach and highlight open problems and potential extensions.
In the context of the SMILE II and IICT projects, a Swiss German Sign Language (DSGS) learning and assessment tool has been developed with the objective of integrating it into an existing sign language learning platform. This presentation focuses on a usability study conducted to validate the tool's effectiveness. We recruited 36 DSGS learners who completed a one-hour task with the tool and answered an online survey. DSGS learners expressed a desire for explanations alongside visual feedback for their errors. To address this, we conducted a pilot study leveraging GPT-4 to transform sign language annotations into textual explanations as feedback, with the aim of including this approach in the tool to facilitate the comprehension of error information. To validate the tool’s capability of assessing learners, we conducted a rater study involving 23 native signers/early learners of DSGS to score the learner performances collected during the usability study.
Despite the recent success of end-to-end (E2E) automatic speech recognition (ASR) models, recognizing out-of-vocabulary (OOV) words, rare and specialized words, and domain adaptation remains a challenge. If parallel (audio-transcription) data is available for a new domain, an ASR model can be adapted by continuing to train the model with this data. However, transcribing data is expensive and often we only have access to text data. Using text to improve an E2E model trained on parallel data is not straightforward. In my presentation, I will discuss various methods for improving the ASR model with text data and present some preliminary results.
Eye movements in reading have been utilized for a variety of purposes: studying psycholinguistic phenomena, investigating surprisal theory, and exploring the cognitive plausibility and enhancement of neural language models. I will present EMTeC, a corpus of eye movements in reading machine-generated texts, whose purpose entails not only the study of human language processing within the field of psycholinguistics or the evaluation of machine language processing but also enables to directly investigate the alignment of neural language processing and human language comprehension. I will outline the processes of stimuli generation, data collection, and data pre-processing and the challenges that have occurred along the way, as well as present preliminary data analyses.
In the context of the interdisciplinary NRP77 project "Monitoring Task and Skill Profiles in the Digital Economy," NLP techniques have been utilized on a broad range of applications within a longitudinal data set of job postings. This effort aims to facilitate social science research on the evolving nature of work and required skills amidst digital transformation over the last 30 years. This presentation will focus on the extraction of job tasks from Swiss job postings in German and their mapping to the detailed work activities classification in the US labor market ontology ONET. For this very fine-grained (multi-label) classification challenge, we employ SBERT models to perform semantic similarity comparisons between job tasks and O*NET work activities, utilizing ontological data with a Multiple Negatives Ranking loss. We experiment with subspan markup, leveraging non-hierarchical ontological relationships, and contextualizing both training data and similarity-lookup queries to enhance classification accuracy.