TOKENIZE(<X>, **kwargs): This operation tokenizes long texts into token windows of a given length with a given stride. This is an element-wise column operation. X should be a VARCHAR column. Each row is treated and tokenized as an independent long text.

Alternatively, one can specify the exact number of splits to tokenize the long texts into. This operation returns an array-valued column that can be exploded with an unnesting operation.

Args

window
int (literal)
default:"1"

The length of the tokenization window.

stride
int (literal)
default:"1"

The stride of the tokenization.

splits
int (literal)
default:"-1"

An alternative way to specify the number of splits the tokenizer produces.

Examples

-- This returns an array-valued column. 
-- Each row containts the tokens for one input row.
SELECT TOKENIZE(call_transcript) FROM calls_table;
-- This explodes each token into an individual row.
-- Therefore, this returns a table with more rows than calls_table.
SELECT UNNEST(TOKENIZE(call_transcript)) FROM calls_table;
SELECT TOKENIZE(call_transcript, window=8, stride=4) FROM calls_table;
SELECT TOKENIZE(call_transcript, splits=8) FROM calls_table;