Bachelor Thesis of Richard Socher

Abstract

This work investigates and extends a bootstrapping approach which permits to extend high quality lexical resources with the help of large corpora. The emphasis lies on the extraction of lexical-semantic information and word meaning, which are fundamental components for advanced applications such as information retrieval, summarizing textual information or semantic web.

The approach is based on co-occurrences of verbs with nouns in a specific context such as object, subject or certain theta-roles. The experiments use a large parsed corpus and are compared to past investigations with adjectives and nouns in order to find out whether adjective modifiers or certain verb - noun relations are more suitable for classifying nouns with respects to their semantic characteristics. The algorithm starts with several seed words whose characteristics are known and which stand in certain relations to the respective verb or adjective in the sentence. Other unknown nouns that co-occur in the same context then inherit some of the characteristics of these seed words.

The aim is to find the most effective relations to enhance semantic resources for nouns in general and apply the findings to the German lexicon Ha Gen Lex? automatically and weakly supervised. The findings of this work will help to extend already existing lexicons. This is necessary since there are still no sufficiently large semantic lexicons for German. The first chapter outlines computer linguistics and corpus linguistics and explains the semantic structure that is used in the lexicon. Furthermore basics of bootstrapping and other similar approaches are provided to better understand the scientific context of this work and to show the general applicability of such an approach.

In chapter 2 the pre-processing steps are presented and the algorithm is explained by theory and examples. Different parameters that can result in major changes of the results are shown.

Chapter 3 describes the experiments in great detail. While experiments with adjectives have already been done and are compared to new experiments, the extension to verbal relations such as subject, object and theta-roles has hitherto not been examined for German. By means of extensive experiments, effective relations for bootstrapping are discovered and optimal new parameter combinations are found. The chapter ends with the combination of the three main relations, which outperforms separately obtained solutions and increases precision significantly.

In the last chapter an outlook with suggestions for further improvements and extensions is given and an absolutely novel approach which combines genetic algorithms and bootstrapping is outlined.