Skip to main content

Search from vocabulary

Content language

Concept information

Preferred term

Tokenization  

Type

  • Model function

  • Named individual

Broader concept

Entry terms

  • Segmentation
  • Segmentation and Tokenization

Scope note

  • Tokenization is commonly seen as an independent process of linguistic analysis, in which the input stream of characters is segmented into an ordered sequence of word-like units, usually called tokens, which function as input items for subsequent steps of linguistic processing. Tokens may correspond to words, numbers, punctuation marks or even proper names.The recognized tokens are usually classified according to their syntax. Since the notion of tokenization seems to have different meanings to different people, some tokenization tools fulfil additional tasks like for instance sentence boundary detection, handling of end-line hyphenations or conjoined clitics and contractions.

Description

  • The task/process of recognizing and tagging tokens (words, punctuation marks, digits etc.) in a text

URI

http://w3id.org/meta-share/omtd-share/Tokenization

Download this concept: