Wednesday, July 4, 2007

The Long Road from Text to Meaning

I stumbled upon this very interesting lecture with Adam Kilgarriff on google videos.

Key points from the talk:

  • Approaches to language study: Rationalist vs. Empiricist

  • Lemmatizers and Part-of-speech tagging

  • Word sense, use and meaning

  • Word sketching and thesaurus creation from corpora is discussed along with important problems such as representation and ambiguity.

  • Using google as a NLP tool. Very interesting perspective!

Abstract: Computers have given us a new way of thinking about language. Given a large sample of language, or corpus, and computational tools to process it, we can approach language as physicists approach forces and chemists approach chemicals. This approach is noteworthy for missing out what, from a language-user's point of view, is important about a piece of language: its meaning.

I shall present this empiricist approach to the study of language and show how, as we develop accurate tools for lemmatisation, part-of-speech tagging and parsing, we move from the raw input -- a character stream -- to an analysis of that stream in increasingly rich terms: words, lemmas, grammatical structures, Fillmore-style frames. Each step on the journey builds on a large corpus accurately analysed at the previous levels. A distributional thesaurus provides generalisations about lexical behaviour which can then feed into an analysis at the ‘frames' level. The talk will be illustrated with work done within the ‘Sketch Engine' tool.

For much NLP and linguistic theory, meaning is a given. Thus formal semantics assumes meanings for words, in order to address questions of how they combine, and WSD (word sense disambiguation) typically takes a set of meanings (as found in a dictionary) as a starting point and sets itself the challenge of identifying which meaning applies. But, since the birth of philosophy, meaning has been problematic. In our approach meaning is an eventual output of the research programme, not an input.


Adam Kilgarriff is a research scientist working at the intersection of computational linguistics, corpus linguistics, and dictionary-making. Following a PhD on "Polysemy" from Sussex University, he has worked at Longman Dictionaries, Oxford University Press, and the University of Brighton, and is now Director of two companies, Lexicography MasterClass ( and Lexical Computing Ltd ( which provide software, training and consultancy in the research areas.

Sketch Engine (SkE, also known as Word Sketch Engine) is a Corpus Query System incorporating word sketches, grammatical relations, and a distributional thesaurus. A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. You can try it using a free trial account.

Google sets creates a list of similar items given a few items. For instance, the set {apple,banana,strawberry} will result in a larger set with different fruits.

No comments: