CREATE MODEL: This statement allows you to invoke Drifto’s autoML capabilities. CREATE MODEL creates a new model that is stored in your model registry with the given name. By default, Drifto uses an internal MLflow instance, but can be configured to use an external instance as well. Like regular create statements, one should use CREATE OR REPLACE MODEL to overwrite an existing model with the same name. You may also use this statement to copy a pre-existing ML model (see examples below).

The SELECT subquery in the create statement should return the full training dataset (held-out sets are automatically subsampled but can be manually specified as well).

Each row in the dataset corresponds to one data point for ML training and each column is a feature. The WITH clause is optional but for ML training and each column is a feature. The WITH clause is optional but the target column to be called target.

The ML training routine will train and validate a variety of ML architectures. You can control this automated model search with argument flags.

Syntax

CREATE MODEL <MODEL> WITH (**kwargs) AS <SELECT>;

Keyword Args

target
column or string literal
default:"target"

The name of the target column.

algo
string literal
default:"auto"

The training algorithm. Options = {'auto', 'gbm', 'linear', 'dnn', 'drf'}.

  • The 'auto' option selects an autoML routine that will try multiple different algorithms and hyperparameters and select the best one.
  • The 'gbm' option selects a gradient-boosting machine (GBM) algorithm.
  • The 'linear' option selects a generalized linear model (GLM).
  • The 'dnn' option selects a fully-connected feed-forward deep neural net.
  • The 'drf' option selects a distributed random forest (DRF) architecture that includes extreme random forests (XRF).
ensemble
boolean (literal)
default:"false"

Allow autoML algorithm to generate model ensembles.

num_folds
int (literal)
default:"0"

The number of cross validation folds used in autoML. When set to 0, autoML does not use cross validation and instead will subsample a 10% validation and a 10% test set for validation and testing.

max_models
int (literal)
default:"3"

Maximum number of models to search for in an ML training routine.

seed
int (literal)
default:"1"

Random seed to use.

sort_metric
str (literal)
default:"accuracy or mae"

The figure-of-merit used to select the best model. The default is metric is accuracy for classification tasks and mae for regression tasks. Options (Classification) = {'auc', 'accu', 'logloss'}. Options (Regression) = {'mse', 'mae', 'r2'}.

Examples

CREATE MODEL my_model 
WITH (target=y, max_models=10, sort_metric='logloss') 
AS SELECT * FROM mydata;
-- This will overwrite an existing model.
CREATE MODEL OR REPLACE my_other_model 
WITH (max_models=1, algo='linear') 
AS SELECT * FROM mydata;
-- This will overwrite an existing model.
CREATE MODEL OR REPLACE my_other_model 
WITH (max_models=1, algo='linear') 
AS SELECT * FROM mydata;
-- This will create a copy of my_model.
-- Notice that here we've directly selected a model name.
CREATE MODEL my_model_copy 
AS SELECT my_model;