Computational Linguistics

Ling. 110/210

Susanna Cumming

April 8, 2002

Tagging

Assume:

Tags are usually represented in the corpus text file with some kind of "markup language": slash tags, SGML

Computational problem:

· Preprocessing: "tokenization" (finding word boundaries), normalizing words, morphological parsing (not covered in this chapter)

· Match every word in the part of speech to set of tags from the tagset by dictionary lookup

· If there's more than one possibility, choose a tag. (Or, retain the whole list for later disambiguation.)

Only 11.5% of English word types in the Brown corpus are ambiguous; but over 40% of the tokens are. This is because the most frequent items are the most likely to be ambiguous.

Taggers can achieve 96-97% "correct" tags (relative to hand-tagging).

Humans have a 96-97% agreement rate (up to 100% if they can discuss the tags).

Lexical categories ("parts of speech")

The following list is given by Jurafsky and Martin, and is based on categories commonly used in linguistics. Note that this is a rather different list from the actual tagsets given in the text. Tagsets distinguish morphological forms and include tags for punctuation and symbols; certain very common items have their own tags.

Theoretical problems:

Note that this approach makes several questionable assumptions: