Christof Monz

Home

Projects

Research

Publications

Teaching

Activities

Other Stuff

Bio

I'm currently involved in the following research projects:

MataHari: Machine Translation with Harvested Internet Resources
   Duration: 2008-2011
Role: Principal Investigator
Summary: The main objective of the proposed research is to set the first step in building a machine translation framework that achieves truly global translation capabilities by covering a large number of languages. To this end this project will investigate a number of languages that have not---or just to a small extent---been covered so far by existing research.

The methods investigated in this project fall under the paradigm of statistical machine translation, which uses a parallel corpus, i.e., documents that have been translated by a professional translator, and then automatically learns the translation rules from this set of documents.

As the proposed project focuses on languages that have not been covered so far to a large extent, it has to address novel challenges and goes beyond existing academic and commercial research in a number of ways. There are hardly any readily available bilingual training data for the languages considered here, unlike for Arabic or Chinese, where sizable parallel corpora are distributed by the Linguistic Data Consortium (LDC). This means that we have to acquire the necessary training data ourselves.

To this end we will utilize internet resources to learn translation models. By exploiting online resources for machine translation this project will address a number of vital research issues:

  1. How can multi-lingual resources be automatically identified and harvested?
  2. How can translation rules be learned from smaller and only partially translated resources?
  3. How do existing search strategies for finding the most likely translation have to be adapted to cope with limited resources?
  4. How can one rapidly build evaluation benchmarks for languages with limited resources?
The MataHari project has a fully funded open PhD position. Go here for more information.

Information Retrieval for Data Selection in Machine Translation
   Duration: 2006-2008
Funder: Nuffield Organization
Role: Principal Investigator
Summary: Well-performing statistical MT approaches require very large amounts of training data to achieve this quality, The challenge is to select those subsets within the training data that are most likely to be relevant for a given document or sentence that needs to be translated. This project investigates how information retrieval techniques can be used to build more contextually sensitive methods for identifying training data for building language models used for machine translation.