CREATE MODEL
CREATE MODEL
: This statement allows you to invoke Drifto’s autoML
capabilities. CREATE MODEL
creates a new model that is stored in your model
registry with the given name. By default, Drifto uses an internal
MLflow instance, but can be configured to use an external
instance as well. Like regular create statements, one should use
CREATE OR REPLACE MODEL
to overwrite an existing model with the same name.
You may also use this statement to copy a pre-existing ML model (see examples below).
The SELECT
subquery in the create statement should return the full training
dataset (held-out sets are automatically subsampled but can be manually specified
as well).
Each row in the dataset corresponds to one data point for ML training and
each column is a feature. The WITH
clause is optional but for ML training
and each column is a feature. The WITH
clause is optional but the target
column to be called target
.
The ML training routine will train and validate a variety of ML architectures. You can control this automated model search with argument flags.
Syntax
Keyword Args
The name of the target column.
The training algorithm.
Options = {'auto', 'gbm', 'linear', 'dnn', 'drf'}
.
- The
'auto'
option selects an autoML routine that will try multiple different algorithms and hyperparameters and select the best one. - The
'gbm'
option selects a gradient-boosting machine (GBM) algorithm. - The
'linear'
option selects a generalized linear model (GLM). - The
'dnn'
option selects a fully-connected feed-forward deep neural net. - The
'drf'
option selects a distributed random forest (DRF) architecture that includes extreme random forests (XRF).
Allow autoML algorithm to generate model ensembles.
The number of cross validation folds used in autoML. When set to 0, autoML does not use cross validation and instead will subsample a 10% validation and a 10% test set for validation and testing.
Maximum number of models to search for in an ML training routine.
Random seed to use.
The figure-of-merit used to select the best model. The default is metric is
accuracy for classification tasks and mae for regression tasks.
Options (Classification) = {'auc', 'accu', 'logloss'}
.
Options (Regression) = {'mse', 'mae', 'r2'}
.