Atila Meeting - November 14, 2007

Location

Adres

Het Pand
Onderbergen1
9000 Ghent

Program

09.30-10.00

Welcome coffee

10.00-10.15

Opening

10.15-12.30

Work in progress in Predicting Personality from Text
Kim Luyckx & Walter Daelemans

We describe a machine learning approach to the prediction of the personality of an author on the basis of linguistic properties of the text he or she wrote. We collected a corpus of 145 essays written by BA-level students on a single topic and had each student take a personality test providing us with a Meyers-Briggs personality profile. The focus of this study is a systematic study of the effectiveness of lexical and syntactic features that have been proven useful in the field of stylometry. Syntactic features like part-of-speech n-grams are generally accepted as not being under the author's conscious control and therefore provide good clues for predicting gender or authorship. We want to test whether these and similar features are helpful for personality detection.
The approach we took is an automatic text categorization approach. First a document representation is constructed based on feature selection from the linguistically analysed texts. For the linguistic analysis we used the Memory-Based Shallow Parser (MBSP). The document representations using these features are associated with each of the four components of the Meyers-Briggs Type Indicator (Introverted-Extraverted, Sensing-Intuitive, Thinking-Feeling, Judging-Perceiving). This produces four binary classification tasks that are learned using memory-based learning. Results are however not unequivocal, since ten-fold cross validation experiments exhibit a rather high degree of variance in the data.

Reasoning about Fuzzy Temporal and Spatial Information from the Web
Steven Schokaert

As the concepts of time and space are paramount in our perception of  the world, much of the information users are looking for is subject to  temporal and spatial constraints. Current IR systems, on the other  hand, do not have access to the temporal and spatial information needed to support such constraints. Moreover, due to the vagueness of many real-world temporal and spatial concepts, non-traditional reasoning frameworks are required. In this talk, we will discuss how (vague) temporal and spatial information can be extracted from web documents using redundancy-based, rather than linguistic techniques.
We will furthermore show how fuzzy reasoning can be used to deal with  inconsistencies in the extracted knowledge base, and to improve the  overall effectiveness of the resulting retrieval systems.

Entity Recognition with Wikipedia
Erik Tjong Kim Sang

Wikipedia is a fast-growing online encyclopedia. In this talk we will explore the opportunities it offers for entity recognition. We will examine two tasks. First, we will look at applying knowledge extracted from Wikipedia for increasing the coverage of a machine learner trained to identify Dutch names. Second, we will apply Wikipedia for a more challenging task: identifying arbitrary entities and linking them to the appropriate article in the encyclopedia.

Automatic post-correction of OCR'ed Cultural Heritage corpora
Martin Reynaert

On behalf of the Dutch Royal Library - The Hague we have studied and partially solved the problem of OCR-induced typographical variation. Text-Induced Corpus Clean-up or TICCL (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and exhaustively gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (further: LD).
It next employs effective text-induced filtering techniques to retain as many as possible of the true positives, while discarding as many as possible of the false positives.
TICCL has been evaluated on a contemporary OCR-ed text corpus, the Staten-Generaal Digitaal 1989-1995 (SGD) and on a corpus of historical newspaper articles, i.e. 'Het Volk 1918'(HV). The latter presents greater challenges: its OCR-quality is far lower and it is in the older Dutch spelling `De Vries-Te Winkel'. We have annotated representative samples of typographical variants from both corpora, allowing us not only to evaluate our system, but also to draw effective conclusions towards the adaptation of the correction mechanism to OCR-error resolution
TICCL obtains a cumulative F-score of around 95% at LD 2, with recall at around 99%. If one would look no further than 2 edits, these scores mean that for the SGD almost 89% and for HV almost 55% of the undesirable OCR-induced typographical variation present can fully automatically be removed, as these are the summed percentages of LD 1 and 2 errors observed in our 5,047 SGD and 3,799 HV error samples.

Aligning linguistically motivated phrases
Lieve Macken & Walter Daelemans

In this talk, we describe a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Sub-sentential alignments are used a.o. to create phrase tables for statistical phrase-based machine translation (SMT) systems. However, a stand-alone sub-sentential alignment module is also useful for human translators if incorporated in CAT-tools, e.g. sophisticated bilingual concordance systems, or in sub-sentential translation memory systems.
In existing SMT systems, a phrase is not linguistically motivated, it can be any contiguous sequence of words. We expect that the use of linguistically relevant phrases can improve the performance of phrase-based SMT systems.
We present the first results of our sub-sentential alignment system, which links linguistically motivated chunks based on lexical clues (word alignments) and syntactic similarity measures.

 

12.30-13.30

Lunch

13.30-14.30

Invited speaker

Data-driven Machine Translation: Conceptualisations and Implementations
Michael Carl

The talk is the essence of an ESSLLI course summary on Example-based Machine Translation, held 2007 in Dublin. I will give a historical review on the first ideas of how manually translated texts could be re-used for automatic machine translation. Three apparently fundamentally different ideas emerged in the 1980's: the translator's amenuensis, statistical- and example-based machine translation. I briefly trace the first implementations of example-based MT and recent achievements in statistical MT. I conclude that the fundamental questions raised in the early 1990s are still basic research topics today. What formerly seemed to be distinctive features of example-based and statistical approaches to machine translation can now be perceived as axes in multi-dimensional MT model spaces, which allow to better understand the essential differences and system component implementation. A thorough integration of fully-automatic, data-driven translation methods with Translation Memories is yet to be investigated and remains a research topic for the future.

 

14.30-15.00

AnTiGen, AnTiGeL, TAnGent, GAnTile, GenTiLa, GATila, GAnTila, TAG, GAT, AnTiGua, AnTiGoon?

 

15.00-15.30

Coffee

15.30-16.45

Graphical Language Processing: combining shallow parsing and domain modeling
Vincent Van Asch & Walter Daelemans

Advances in robust automatic analysis of text and in graphical domain modeling andreasoning have made possible new applications in which natural language input to graphical design software is interpreted, augmented and translated to graphical output. We present preliminary results in the GRAVITAL project, which aims at developing a concrete end-to-end implementation of such an approach. The application takes natural language descriptions of a design as input and analyzes it using a memory-based shallow parser and domain-specific ontologies and semantic rules. The linguistic representation is then translated into instructions in the Python-based graphical programming language NodeBox and executed. We also describe a recent redirection in the project in which information extraction techniques are used to translate free-text-based briefings of graphical designers into templates for further processing by a graphical expert system.

Protein Relation Extraction using Full Parsing Information
Timur Fayruzov, Martine De Cock, Chris Cornelis and Vèronique Hoste

Studying protein interactions is an essential task in biomedical research, hence a lot of effort is devoted to constructing interaction knowledge bases. However, as the biomedical domain is very dynamic, manual maintenance of such knowledge bases is highly labour-intensive. Here we present a new approach to (semi)automatically mine protein relations from scientific texts, based on syntactic information. Our approach aims at supporting humans in finding relevant information, rather than to exclude them entirely from the data processing flow.
Traditional classification-based algorithms for learning from texts such as Support Vector Machines (SVM) or Hidden Markov Models (HMM) require large annotated training data sets, which unfortunately are not available for the problem at hand.
Hence, we abstract from pure linguistic data and concentrate on more general language structures such as parsing and dependency information. First, for each sentence we extract a dependency tree and consider all linked chains between every two proteins in the dependency tree as potential reactions (we assume that proteins were recognized somewhere else). Secondly, we build a parse tree for the sentence and compute the depth and number of nested subordinate sentences for every protein in the tree. Using both information sources, we build a feature vector expressing the potential interaction.
This abstract representation reduces the number of variants for expressing the same fact, which in turn allows to use smaller datasets to train classification algorithms in detecting positive and negative examples of protein reactions.
For our experiments we used the AImed corpus and the LLL05 dataset. Using WEKA, we tested rule induction machine-learning algorithms such as decision trees and RIPPER and statistical BayesNet classifier.
All algorithms were run on standalone datasets and on the combination of datasets.
For standalone datasets, we used 10-fold cross-validation, for combined datasets we used one dataset for training and one for evaluation. Best results were obtained on the LLL05 dataset with the C4.5 decision tree algorithm --- recall 0.74 for precision 0.8, which is competitive with some state-of-the-art kernel-based and genetic algorithms.

From expedition field books to a knowledge base
Piroska Lendvai & Steve Hunt

We describe the process of turning flat cultural heritage data of a museum collection into a searchable knowledge base, using domain-independent machine learning techniques and xml structure. First, digitised expedition field notes are automatically segmented into a domain-specific database.  In order to enter only perfect data into the mBase knowledge base, an annotation interface combined with selective sampling is created in which users can validate a labelled field note entry.  Next, the records in mBase are marked up with semi-automatically derived secondary metadata. Work in progress includes exploiting the metadata layers for advanced querying, the results of which are additionally visualised using maps and photos.

 

16.45-18.00

Data collection, constitution and exploitation in the Sawa Corpus Project
Guy De Pauw & Peter Waiganjo Wagacha (University of Nairobi)

This presentation describes on-going work in the Sawa Corpus Project (2007-2008), which aims to construct a parallel corpus English - Kiswahili for the purpose of contrastive linguistic analysis, projection of annotation and machine translation. In this talk, we will focus on the inherent fuzziness of the boundaries between the data collection phase (manual translation), the data constitution phase (word alignment) and the data exploitation phase. The talk is a recap of a IPRA 2007 panel discussion on "Data constitution in African languages", where researchers for the first time discussed these issues for corpus collection for African languages.

Memory-based machine translation
Antal van den Bosch

Memory-based machine translation (MBMT) is an approach to MT that is   complementary to example-based MT and statistical MT, while being an  exponent of both types of MT. Its key strength is its use of (a fast  approximation of) the k-NN classifier, a machine learning algorithm insensitive to the number of classes, for the selection of  translation candidate n-grams. Rather than leaving this selection open and spending all effort on selecting the most likely translation  through the target language model, as an SMT system does, MBMT uses context in the source language to reduce the number of possible translations of n-grams. When integrated in a phrase-based SMT  system, Stroppa, Van den Bosch, and Way (2007) showed that PB-SMT can  profit from this source-language-sensitive filtering of target language n-grams. We review these results. We also present new results obtained with the pure MBMT system introduced in Van den Bosch, Stroppa, and Way (2007) that performs selection of  translations either without resorting to a language model, or using a new memory-based language model.

IGForest; From tree to forest
Herman Stehouwer

A well known algorithm in TiMBL is IGTree. IGTree is a fast trie-based approximation of k-nn. Because of its trie-based nature IGTree can mis-match on a feature quite fast, resulting in sub-optimal classification compared to IB1.
In this talk an early version of IGForest is presented that tries to lessen the impact of this problem. The performance of IGForest is compared to both IGTree and IB1 on a diverse set of NLP problems.

Semeval: Machine learning of semantic relations with shallow features and almost no data
Iris Hendrickx, Roser Morante, Caroline Sporleder & Antal van den Bosch

We summarize our approach to the Semeval 2007 shared task on ``Classification of Semantic Relations between Nominals''. Our overall strategy is to develop machine-learning classifiers making use of a few easily computable and effective features, selected independently for each classifier in wrapper experiments. We train two types of classifiers for each of the seven relations: with and without any WordNet information.

Towards a robust semantic role labeling system
Roser Morante

This talk presents two memory-based semantic role labeling (SRL) systems. Semantic role labeling is a sentence-level natural-language processing (NLP) task in which semantic roles are assigned to all arguments of a predicate. Memory-based language processing is based on the idea that NLP problems can be solved by storing annotated examples of the problem in their literal form in memory, and applying similarity-based reasoning on these examples in order two solve new ones. One of the SRL systems uses manually annotated syntactic information and performs at state-of-the-art level. The other system does not use any syntactic information and the current performance is lower, but it frames the SRL task in a more realistic scenario by avoiding the parsing step.

18.30-23.00 Amuse Gueule Route in Ghent

Registration Fee

Full day

  65 Euro

Full day + diner

125 Euro

Participants

CNTS (UA)

Walter Daelemans
Kim Luyckx
Vincent Van Asch
Guy De Pauw
Iris Hendrickx
Eric Van Horenbeeck

ILK (UvT)

Antal van den Bosch
Roser Morante
Erik Tjong Kim Sang
Martin Reynaert
Toine Bogers
Herman Stehouwer
Piroska Lendvai
Jeroen Geertzen
Peter Berck
Steve Hunt

Language Technology and Computational Intelligence (Associatie UGent)

Veronique Hoste
Martine De Cock
Lieve Macken
Els Lefever
Klaartje Vanopstal
Timur Fayruzov
Steven Schockaert
Chris Cornelis
Sofie Niemegeers
Kathelijne Denturck