| 10.15-12.30 |
Work
in progress in Predicting Personality from Text
Kim Luyckx & Walter Daelemans
We describe a machine learning approach to the
prediction of the personality of an author on the basis of linguistic
properties of the text he or she wrote. We collected a corpus of 145
essays written by BA-level students on a single topic and had each student
take a personality test providing us with a Meyers-Briggs personality
profile. The focus of this study is a systematic study of the
effectiveness of lexical and syntactic features that have been proven
useful in the field of stylometry. Syntactic features like part-of-speech
n-grams are generally accepted as not being under the author's conscious
control and therefore provide good clues for predicting gender or
authorship. We want to test whether these and similar features are helpful
for personality detection. The approach we took is
an automatic text categorization approach. First a document representation
is constructed based on feature selection from the linguistically analysed
texts. For the linguistic analysis we used the Memory-Based Shallow Parser
(MBSP). The document representations using these features are associated
with each of the four components of the Meyers-Briggs Type Indicator
(Introverted-Extraverted, Sensing-Intuitive, Thinking-Feeling,
Judging-Perceiving). This produces four binary classification tasks that
are learned using memory-based learning. Results are however not
unequivocal, since ten-fold cross validation experiments exhibit a rather
high degree of variance in the data.
Reasoning
about Fuzzy Temporal and Spatial Information from the Web
Steven Schokaert
As the concepts of time and space are paramount in our
perception of the world, much of
the information users are looking for is subject to temporal
and spatial constraints. Current IR systems, on the other hand,
do not have access to the temporal and spatial information needed to support
such constraints. Moreover, due to the vagueness of many real-world temporal
and spatial concepts, non-traditional reasoning frameworks are required. In
this talk, we will discuss how (vague) temporal and spatial information can be
extracted from web documents using redundancy-based, rather than linguistic
techniques.
We will furthermore show how fuzzy reasoning can be used to deal with
inconsistencies in the extracted knowledge base, and to improve the
overall effectiveness of the resulting retrieval systems.
Entity Recognition with Wikipedia
Erik Tjong Kim Sang
Wikipedia is a fast-growing online encyclopedia. In this talk we will explore
the opportunities it offers for entity recognition. We will examine two tasks.
First, we will look at applying knowledge extracted from Wikipedia for
increasing the coverage of a machine learner trained to identify Dutch names.
Second, we will apply Wikipedia for a more challenging task: identifying
arbitrary entities and linking them to the appropriate article in the
encyclopedia.
Automatic
post-correction of OCR'ed Cultural Heritage corpora
Martin Reynaert
On behalf of the Dutch Royal Library - The Hague we have studied
and partially solved the problem of OCR-induced typographical variation.
Text-Induced Corpus Clean-up or TICCL (pronounce 'tickle') focuses on
high-frequency words derived from the corpus to be cleaned and exhaustively
gathers all typographical variants for any particular focus word that lie
within the predefined Levenshtein distance (further: LD).
It next employs effective text-induced filtering techniques to retain as many
as possible of the true positives, while discarding as many as possible of the
false positives.
TICCL has been evaluated on a contemporary OCR-ed text corpus, the
Staten-Generaal Digitaal 1989-1995 (SGD) and on a corpus of historical
newspaper articles, i.e. 'Het Volk 1918'(HV). The latter presents greater
challenges: its OCR-quality is far lower and it is in the older Dutch spelling
`De Vries-Te Winkel'. We have annotated representative samples of typographical
variants from both corpora, allowing us not only to evaluate our system, but
also to draw effective conclusions towards the adaptation of the correction
mechanism to OCR-error resolution
TICCL obtains a cumulative F-score of around 95% at LD 2, with recall at around
99%. If one would look no further than 2 edits, these scores mean that for the
SGD almost 89% and for HV almost 55% of the undesirable OCR-induced
typographical variation present can fully automatically be removed, as these
are the summed percentages of LD 1 and 2 errors observed in our 5,047 SGD and
3,799 HV error samples.
Aligning
linguistically motivated phrases
Lieve Macken & Walter Daelemans
In this talk, we describe a sub-sentential
alignment system that links linguistically motivated phrases in parallel texts.
Sub-sentential alignments are used a.o. to create phrase tables for statistical
phrase-based machine translation (SMT) systems. However, a stand-alone
sub-sentential alignment module is also useful for human translators if
incorporated in CAT-tools, e.g. sophisticated bilingual concordance systems, or
in sub-sentential translation memory systems.
In existing SMT systems, a phrase is not linguistically motivated, it can be
any contiguous sequence of words. We expect that the use of linguistically
relevant phrases can improve the performance of phrase-based SMT systems.
We present the first results of our sub-sentential alignment system, which
links linguistically motivated chunks based on lexical clues (word alignments)
and syntactic similarity measures.
|
| 13.30-14.30 |
Invited speaker
Data-driven Machine Translation: Conceptualisations and Implementations
Michael Carl
The talk is the essence of an ESSLLI course summary on Example-based Machine
Translation, held 2007 in Dublin. I will give a historical review on the first
ideas of how manually translated texts could be re-used for automatic machine
translation. Three apparently fundamentally different ideas emerged in the
1980's: the translator's amenuensis, statistical- and example-based machine
translation. I briefly trace the first implementations of example-based MT and
recent achievements in statistical MT. I conclude that the fundamental
questions raised in the early 1990s are still basic research topics today. What
formerly seemed to be distinctive features of example-based and statistical
approaches to machine translation can now be perceived as axes in
multi-dimensional MT model spaces, which allow to better understand the
essential differences and system component implementation. A thorough
integration of fully-automatic, data-driven translation methods with
Translation Memories is yet to be investigated and remains a research topic for
the future.
|
| 14.30-15.00 |
AnTiGen,
AnTiGeL,
TAnGent,
GAnTile,
GenTiLa,
GATila,
GAnTila,
TAG,
GAT,
AnTiGua, AnTiGoon?
|
| 15.30-16.45 |
Graphical
Language Processing: combining shallow parsing and domain modeling
Vincent Van Asch & Walter Daelemans
Advances in robust automatic analysis of text and in graphical domain
modeling andreasoning have made possible new applications in which natural
language input to graphical design software is interpreted, augmented and
translated to graphical output. We present preliminary results in the GRAVITAL
project, which aims at developing a concrete end-to-end implementation of such
an approach. The application takes natural language descriptions of a design as
input and analyzes it using a memory-based shallow parser and domain-specific
ontologies and semantic rules. The linguistic representation is then translated
into instructions in the Python-based graphical programming language NodeBox
and executed. We also describe a recent redirection in the project in which
information extraction techniques are used to translate free-text-based
briefings of graphical designers into templates for further processing by a
graphical expert system.
Protein
Relation Extraction using Full Parsing Information
Timur
Fayruzov, Martine De Cock, Chris Cornelis and Vèronique Hoste
Studying
protein interactions is an essential task in biomedical research, hence a
lot of effort is devoted to constructing interaction knowledge bases.
However, as the biomedical domain is very dynamic, manual maintenance of
such knowledge bases is highly labour-intensive. Here we present a new
approach to (semi)automatically mine protein relations from scientific
texts, based on syntactic information. Our approach aims at supporting
humans in finding relevant information, rather than to exclude them
entirely from the data processing flow. Traditional
classification-based algorithms for learning from texts such as Support
Vector Machines (SVM) or Hidden Markov Models (HMM) require large
annotated training data sets, which unfortunately are not available for
the problem at hand. Hence, we abstract from pure linguistic data and
concentrate on more general language structures such as parsing and
dependency information. First, for each sentence we extract a dependency
tree and consider all linked chains between every two proteins in the
dependency tree as potential reactions (we assume that proteins were
recognized somewhere else). Secondly, we build a parse tree for
the sentence and compute the depth and number of nested subordinate
sentences for every protein in the tree. Using both information sources,
we build a feature vector expressing the potential interaction. This
abstract representation reduces the number of variants for expressing the
same fact, which in turn allows to use smaller datasets to train
classification algorithms in detecting positive and negative examples of
protein reactions. For our experiments we used the AImed corpus and the
LLL05 dataset. Using WEKA, we tested rule induction machine-learning
algorithms such as decision trees and RIPPER and statistical BayesNet
classifier. All algorithms were run on standalone datasets and on the
combination of datasets. For standalone datasets, we used 10-fold
cross-validation, for combined datasets we used one dataset for training
and one for evaluation. Best results were obtained on the LLL05 dataset
with the C4.5 decision tree algorithm --- recall 0.74 for precision 0.8,
which is competitive with some state-of-the-art kernel-based and genetic
algorithms.
From
expedition field books to a knowledge base
Piroska Lendvai
& Steve Hunt
We describe the
process of turning flat cultural heritage data of a museum collection into
a searchable knowledge base, using domain-independent machine learning
techniques and xml structure. First, digitised expedition field notes are
automatically segmented into a domain-specific database. In order to enter only perfect
data into the mBase knowledge base, an annotation interface combined with
selective sampling is created in which users can validate a labelled field
note entry. Next, the records
in mBase are marked up with semi-automatically derived secondary metadata.
Work in progress includes exploiting the metadata layers for advanced
querying, the results of which are additionally visualised using maps and
photos.
|
| 16.45-18.00 |
Data collection, constitution and exploitation in the Sawa Corpus Project
Guy De Pauw & Peter Waiganjo Wagacha (University of Nairobi)
This presentation describes on-going work in the Sawa Corpus Project
(2007-2008), which aims to construct a parallel corpus English - Kiswahili for
the purpose of contrastive linguistic analysis, projection of annotation and
machine translation. In this talk, we will focus on the inherent fuzziness of
the boundaries between the data collection phase (manual translation), the data
constitution phase (word alignment) and the data exploitation phase. The talk
is a recap of a IPRA 2007 panel discussion on "Data constitution in African
languages", where researchers for the first time discussed these issues for
corpus collection for African languages.
Memory-based machine translation
Antal van den Bosch
Memory-based
machine translation (MBMT) is an approach to MT that is complementary to
example-based MT and statistical MT, while being an exponent of both types of MT. Its
key strength is its use of (a fast
approximation of) the k-NN classifier, a machine learning algorithm
insensitive to the number of classes, for the selection of translation candidate n-grams.
Rather than leaving this selection open and spending all effort on
selecting the most likely translation through the target language model,
as an SMT system does, MBMT uses context in the source language to reduce
the number of possible translations of n-grams. When integrated in a
phrase-based SMT system,
Stroppa, Van den Bosch, and Way (2007) showed that PB-SMT can profit from this
source-language-sensitive filtering of target language n-grams. We review
these results. We also present new results obtained with the pure MBMT
system introduced in Van den Bosch, Stroppa, and Way (2007) that performs
selection of translations either without resorting to a language model, or using
a new memory-based language model.
IGForest; From tree to forest
Herman Stehouwer
A
well known algorithm in TiMBL is IGTree. IGTree is a fast trie-based
approximation of k-nn. Because of its trie-based nature IGTree can
mis-match on a feature quite fast, resulting in sub-optimal classification
compared to IB1. In this talk an early version of
IGForest is presented that tries to lessen the impact of this problem. The
performance of IGForest is compared to both IGTree and IB1 on a diverse
set of NLP problems.
Semeval: Machine learning of semantic
relations with shallow features and almost no data
Iris Hendrickx, Roser Morante, Caroline Sporleder & Antal van
den Bosch
We summarize our approach to the Semeval 2007 shared task on ``Classification
of Semantic Relations between Nominals''. Our overall strategy is to develop
machine-learning classifiers making use of a few easily computable and
effective features, selected independently for each classifier in wrapper
experiments. We train two types of classifiers for each of the seven relations:
with and without any WordNet information.
Towards a robust semantic role labeling system
Roser Morante
This talk presents two memory-based
semantic role labeling (SRL) systems. Semantic role labeling is a
sentence-level natural-language processing (NLP) task in which semantic
roles are assigned to all arguments of a predicate. Memory-based language
processing is based on the idea that NLP problems can be solved by storing
annotated examples of the problem in their literal form in memory, and
applying similarity-based reasoning on these examples in order two solve
new ones. One of the SRL systems uses manually annotated syntactic
information and performs at state-of-the-art level. The other system does
not use any syntactic information and the current performance is lower,
but it frames the SRL task in a more realistic scenario by avoiding the
parsing step.
|