TOKENIZE

TOKENIZE(<X>, **kwargs): This operation tokenizes long texts into token windows of a given length with a given stride. This is an element-wise column operation. X should be a VARCHAR column. Each row is treated and tokenized as an independent long text. Alternatively, one can specify the exact number of splits to tokenize the long texts into. This operation returns an array-valued column that can be exploded with an unnesting operation.

Args

window

int (literal)

default:"1"

The length of the tokenization window.

stride

int (literal)

default:"1"

The stride of the tokenization.

splits

int (literal)

default:"-1"

An alternative way to specify the number of splits the tokenizer produces.

Examples

-- This returns an array-valued column. 
-- Each row containts the tokens for one input row.
SELECT TOKENIZE(call_transcript) FROM calls_table;

-- This explodes each token into an individual row.
-- Therefore, this returns a table with more rows than calls_table.
SELECT UNNEST(TOKENIZE(call_transcript)) FROM calls_table;

SELECT TOKENIZE(call_transcript, window=8, stride=4) FROM calls_table;

SELECT TOKENIZE(call_transcript, splits=8) FROM calls_table;

Introduction

Statements and Ops

Limitations

Args

Examples

Introduction

Statements and Ops

Limitations

​Args

​Examples

Args

Examples