Information Retrieval (IR), for the purpose of this course, is the study of the indexing, processing, and querying of textual data. The growing importance of the Web means that IR has acquired added significance in recent years. The course will also look at how models of language similar to those used in IR can be applied to the problem of Machine Translation (MT), which is becoming increasingly important as more and more non-English text appears on the Web.
The aim of the course is to provide an introduction to the basic principles and techniques used in IR; to demonstrate how statistical models of language can be used to solve the document retrieval problem; to consider specific IR applications such as cross-language retrieval; and to show how statistical models of language can be used to develop Machine Translation systems.
Basics of information retrieval
- Text representation and processing
- Retrieval models (Boolean, vector space, language model)
- Indexing
- Evaluation
Advanced IR topics: - Relevance feedback - real feedback, pseudo-relevance feedback
- Document and concept clustering - hierarchical clustering, k-means
- Web retrieval - Page rank, difficulties of Web retrieval
- Cross-language retrieval - queries in one language, documents in another
- Distributional and semantic similarity - automatic thesaurus construction
Basics of statistical machine translation (MT) - Language models for MT
- Estimation from parallel texts
- Decoding (finding the most probable translation)