DE  EN  
 TextGrid  >  TextGridLab Beta

Tokenizer

The Tokenizer allows you to split a text into logical units (tokens), i.e. into words and punctuation marks that computer systems can classify as a recognizable unit. These units are marked by opening and closing tags. The Tokenizer implements the algorithm for determining word boundaries according to the guidelines of the Unicode Consortium. The corresponding elements (for words and characters) as well as pre-defined tokens -- for example, abbreviations, proper nouns or regular expressions (e.g. for date format specification) -- can be defined in the tool configuration. Tokenized texts can be processed further, for instance with the Lemmatizer.

The Tokenizer web service (SOAP) accepts two parameters:

  • indata (xs:string): the XML encoded text data to be tokenized
  • config (xs:string): the configuration in XML syntax

The Tokenizer can only be used via the Workflow tools.

Further information is available here:

R2.3: User's Manual TextGrid-Tools (on page 75-77)