Skip to content

LudwigModel

LudwigModel class [source]

ludwig.api.LudwigModel(
  config,
  logging_level=40,
  backend=None,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None
)

Class that allows access to high level Ludwig functionalities.

Inputs

  • config (Union[str, dict]): in-memory representation of config or string path to a YAML config file.
  • logging_level (int): Log level that will be sent to stderr.
  • backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
  • gpus (Union[str, int, List[int]], default: None): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
  • allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.

Example usage:

from ludwig.api import LudwigModel

Train a model:

config = {...}
ludwig_model = LudwigModel(config)
train_stats, _, _ = ludwig_model.train(dataset=file_path)

or

train_stats, _, _ = ludwig_model.train(dataset=dataframe)

If you have already trained a model you can load it and use it to predict

ludwig_model = LudwigModel.load(model_dir)

Predict:

predictions, _ = ludwig_model.predict(dataset=file_path)

or

predictions, _ = ludwig_model.predict(dataset=dataframe)

Evaluation:

eval_stats, _, _ = ludwig_model.evaluate(dataset=file_path)

or

eval_stats, _, _ = ludwig_model.evaluate(dataset=dataframe)

PublicAPI: This API is stable across Ludwig releases.


LudwigModel methods

collect_activations

collect_activations(
  layer_names,
  dataset,
  data_format=None,
  split='full',
  batch_size=128
)

Loads a pre-trained model model and input data to collect the values of the activations contained in the tensors.

Inputs

  • layer_names (list): list of strings for layer names in the model to collect activations.
  • dataset (Union[str, Dict[str, list], pandas.DataFrame]): source containing the data to make predictions.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • split (str, default= 'full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
  • batch_size (int, default: 128): size of batch to use when making predictions.

Return

  • return (list): list of collected tensors.

collect_weights

collect_weights(
  tensor_names=None
)

Load a pre-trained model and collect the tensors with a specific name.

Inputs

  • tensor_names (list, default: None): List of tensor names to collect weights

Return

  • return (list): List of tensors

create_model

create_model(
  config_obj,
  random_seed=42
)

Instantiates BaseModel object.

Inputs

  • config_obj (Union[Config, dict]): Ludwig config object
  • random_seed (int, default: ludwig default random seed): Random seed used for weights initialization, splits and any other random function.

Return

  • return (ludwig.models.BaseModel): Instance of the Ludwig model object.

evaluate

ludwig.evaluate(
  dataset=None,
  data_format=None,
  split='full',
  batch_size=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=True,
  skip_save_eval_stats=True,
  collect_predictions=False,
  collect_overall_stats=False,
  output_directory='results',
  return_type=<class 'pandas.core.frame.DataFrame'>
)

This function is used to predict the output variables given the input variables using the trained model and compute test statistics like performance measures, confusion matrices and the like.

Inputs

  • dataset (Union[str, dict, pandas.DataFrame]): source containing the entire dataset to be evaluated.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • split (str, default='full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
  • batch_size (int, default: None): size of batch to use when making predictions. Defaults to model config eval_batch_size
  • skip_save_unprocessed_output (bool, default: True): if this parameter is False, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
  • skip_save_predictions (bool, default: True): skips saving test predictions CSV files.
  • skip_save_eval_stats (bool, default: True): skips saving test statistics JSON file.
  • collect_predictions (bool, default: False): if True collects post-processed predictions during eval.
  • collect_overall_stats (bool, default: False): if True collects overall stats during eval.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • return_type (Union[str, dict, pd.DataFrame], default: pandas.DataFrame): indicates the format to of the returned predictions.

Return

  • return (evaluation_statistics, predictions, output_directory): evaluation_statistics dictionary containing evaluation performance statistics, postprocess_predictions contains predicted values, output_directory is location where results are stored.

experiment

experiment(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='experiment',
  model_name='run',
  model_resume_path=None,
  eval_split='test',
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_collect_predictions=False,
  skip_collect_overall_stats=False,
  output_directory='results',
  random_seed=42
)

Trains a model on a dataset's training and validation splits and uses it to predict on the test split. It saves the trained model and the statistics of training and testing.

Inputs

  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
  • training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
  • validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
  • test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
  • training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • experiment_name (str, default: 'experiment'): name for the experiment.
  • model_name (str, default: 'run'): name of the model that is being used.
  • model_resume_path (str, default: None): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics and loss for epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process.
  • eval_split (str, default: test): split on which to perform evaluation. Valid values are training, validation and test.
  • skip_save_training_description (bool, default: False): disables saving the description JSON file.
  • skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
  • skip_save_unprocessed_output (bool, default: False): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
  • skip_save_predictions (bool, default: False): skips saving test predictions CSV files
  • skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file
  • skip_collect_predictions (bool, default: False): skips collecting post-processed predictions during eval.
  • skip_collect_overall_stats (bool, default: False): skips collecting overall stats during eval.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.

Return

  • return (Tuple[dict, dict, tuple, str)): (evaluation_statistics, training_statistics, preprocessed_data, output_directory) evaluation_statistics dictionary with evaluation performance statistics on the test_set, training_statistics is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint. preprocessed_data tuple containing preprocessed (training_set, validation_set, test_set), output_directory filepath string to where results are stored.

forecast

forecast(
  dataset,
  data_format=None,
  horizon=1,
  output_directory=None,
  output_format='parquet'
)

free_gpu_memory

free_gpu_memory(
)

Manually moves the model to CPU to force GPU memory to be freed.

For more context: https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/35


is_merge_and_unload_set

is_merge_and_unload_set(
)

Check whether the encapsulated model is of type LLM and is configured to merge_and_unload QLoRA weights.

Return

:return (bool): whether merge_and_unload should be done.


load

load(
  model_dir,
  logging_level=40,
  backend=None,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None
)

This function allows for loading pretrained models.

Inputs

  • model_dir (str): path to the directory containing the model. If the model was trained by the train or experiment command, the model is in results_dir/experiment_dir/model.
  • logging_level (int, default: 40): log level that will be sent to stderr.
  • backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
  • gpus (Union[str, int, List[int]], default: None): GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES)
  • gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
  • allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.
  • callbacks (list, default: None): a list of ludwig.callbacks.Callback objects that provide hooks into the Ludwig pipeline.

Return

  • return (LudwigModel): a LudwigModel object

Example usage

ludwig_model = LudwigModel.load(model_dir)

load_weights

load_weights(
  model_dir
)

Loads weights from a pre-trained model.

Inputs

  • model_dir (str): filepath string to location of a pre-trained model

Return

  • return ( Non):None`

Example usage

ludwig_model.load_weights(model_dir)

predict

ludwig.predict(
  dataset=None,
  data_format=None,
  split='full',
  batch_size=128,
  generation_config=None,
  skip_save_unprocessed_output=True,
  skip_save_predictions=True,
  output_directory='results',
  return_type=<class 'pandas.core.frame.DataFrame'>,
  callbacks=None
)

Using a trained model, make predictions from the provided dataset.

Inputs

  • dataset (Union[str, dict, pandas.DataFrame]):: source containing the entire dataset to be evaluated.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • split (str, default= 'full'):: if the input dataset contains a split column, this parameter indicates which split of the data to use. Possible values are 'full', 'training', 'validation', 'test'.
  • batch_size (int, default: 128): size of batch to use when making predictions.
  • generation_config (Dict, default: None): config for the generation of the predictions. If None, the config that was used during model training is used. This is only used if the model type is LLM. Otherwise, this parameter is ignored. See Large Language Models under "Generation" for an example generation config.
  • skip_save_unprocessed_output (bool, default: True): if this parameter is False, predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
  • skip_save_predictions (bool, default: True): skips saving test predictions CSV files.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • return_type (Union[str, dict, pandas.DataFrame], default: pd.DataFrame): indicates the format of the returned predictions.
  • callbacks (Optional[List[Callback]], default: None): optional list of callbacks to use during this predict operation. Any callbacks already registered to the model will be preserved.

Return

:return (predictions, output_directory): (Tuple[Union[dict, pd.DataFrame], str]) predictions predictions from the provided dataset, output_directory filepath string to where data was stored.


preprocess

preprocess(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  skip_save_processed_input=True,
  random_seed=42
)

This function is used to preprocess data.

Args:

  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
  • training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
  • validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
  • test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
  • training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
  • random_seed (int, default: 42): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling

Returns:

  • __:return__: (PreprocessedDataset) data structure containing (proc_training_set, proc_validation_set, proc_test_set, training_set_metadata).

Raises:

  • RuntimeError: An error occured while preprocessing the data. Examples include training dataset being empty after preprocessing, lazy loading not being supported with RayBackend, etc.

save

save(
  save_path
)

This function allows to save models on disk.

Inputs

  • __ save_path__ (str): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.

Return

  • return (None): None

Example usage

ludwig_model.save(save_path)

save_config

save_config(
  save_path
)

Save config to specified location.

Inputs

  • save_path (str): filepath string to save config as a JSON file.

Return

  • return ( None):None`

save_torchscript

save_torchscript(
  save_path,
  model_only=False,
  device=None
)

Saves the Torchscript model to disk.

Inputs

  • save_path (str) (str):: The path to the directory where the model will be saved.
  • model_only (bool, optional) (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, the preprocessing and postprocessing steps will also be converted to Torchscript.
  • device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.

Return

  • return ( None):None`

set_logging_level

set_logging_level(
  logging_level
)

Sets level for log messages.

Inputs

  • logging_level (int): Set/Update the logging level. Use logging constants like logging.DEBUG , logging.INFO and logging.ERROR.

Return

  • return ( None):None`

to_torchscript

to_torchscript(
  model_only=False,
  device=None
)

Converts the trained model to Torchscript.

Inputs

  • __ model_only (bool, optional)__ (bool, optional):: If True, only the ECD model will be converted to Torchscript. Else, preprocessing and postprocessing steps will also be converted to Torchscript.
  • device (TorchDevice, optional) (TorchDevice, optional):: If None, the model will be converted to Torchscript on the same device to ensure maximum model parity.

Returns

  • return ( A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs): A torch.jit.ScriptModule that can be used to predict on a dictionary of inputs.

train

train(
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='api_experiment',
  model_name='run',
  model_resume_path=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  output_directory='results',
  random_seed=42
)

This function is used to perform a full training of the model on the specified dataset.

During training if the skip parameters are False the model and statistics will be saved in a directory [output_dir]/[experiment_name]_[model_name]_n where all variables are resolved to user specified ones and n is an increasing number starting from 0 used to differentiate among repeated runs.

Inputs

  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
  • training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
  • validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
  • test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
  • training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • experiment_name (str, default: 'experiment'): name for the experiment.
  • model_name (str, default: 'run'): name of the model that is being used.
  • model_resume_path (str, default: None): resumes training of the model from the path specified. The config is restored. In addition to config, training statistics, loss for each epoch and the state of the optimizer are restored such that training can be effectively continued from a previously interrupted training process.
  • skip_save_training_description (bool, default: False): disables saving the description JSON file.
  • skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • random_seed (int, default: 42): a random seed that will be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling
  • kwargs (dict, default: {}): a dictionary of optional parameters.

Return

  • return (Tuple[Dict, Union[Dict, pd.DataFrame], str]): tuple containing (training_statistics, preprocessed_data, output_directory). training_statistics is a nested dictionary of dataset -> feature_name -> metric_name -> List of metrics. Each metric corresponds to each training checkpoint. preprocessed_data is the tuple containing these three data sets (training_set, validation_set, test_set). output_directory filepath to where training results are stored.

train_online

train_online(
  dataset,
  training_set_metadata=None,
  data_format='auto',
  random_seed=42
)

Performs one epoch of training of the model on dataset.

Inputs

  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
  • training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • random_seed (int, default: 42): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling

Return

  • return (None): None

upload_to_hf_hub

ludwig.upload_to_hf_hub(
  repo_id,
  model_path,
  repo_type='model',
  private=False,
  commit_message='Upload trained [Ludwig](https://ludwig.ai/latest/) model weights',
  commit_description=None
)

Uploads trained model artifacts to the HuggingFace Hub.

Inputs

  • repo_id (str) (str):: A namespace (user or an organization) and a repo name separated by a /.
  • model_path (str) (str):: The path of the saved model. This is the top level directory where the models weights as well as other associated training artifacts are saved.
  • private (bool, optional, defaults to False) (bool, optional, defaults to False):: Whether the model repo should be private.
  • repo_type (str, optional) (str, optional):: Set to "dataset" or "space" if uploading to a dataset or space, None or "model" if uploading to a model. Default is None.
  • commit_message (str, optional) (str, optional):: The summary / title / first line of the generated commit. Defaults to: f"Upload {path_in_repo} with huggingface_hub"
  • commit_description (str optional) (str optional):: The description of the generated commit

Returns

  • return (bool): True for success, False for failure.

Module functions


kfold_cross_validate

ludwig.api.kfold_cross_validate(
  num_folds,
  config,
  dataset=None,
  data_format=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_collect_predictions=False,
  skip_collect_overall_stats=False,
  output_directory='results',
  random_seed=42,
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  backend=None,
  logging_level=20
)

Performs k-fold cross validation and returns result data structures.

Inputs

  • num_folds (int): number of folds to create for the cross-validation
  • config (Union[dict, str]): model specification required to build a model. Parameter may be a dictionary or string specifying the file path to a yaml configuration file. Refer to the User Guide for details.
  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used for k_fold processing.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'. Currently hdf5 format is not supported for k_fold cross validation.
  • skip_save_training_description (bool, default: False): disables saving the description JSON file.
  • skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
  • skip_save_predictions (bool, default: False): skips saving test predictions CSV files.
  • skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file.
  • skip_collect_predictions (bool, default: False): skips collecting post-processed predictions during eval.
  • skip_collect_overall_stats (bool, default: False): skips collecting overall stats during eval.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • random_seed (int, default: 42): Random seed used for weights initialization, splits and any other random function.
  • gpus (list, default: None): list of GPUs that are available for training.
  • gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
  • allow_parallel_threads (bool, default: True): allow Torch to use multithreading parallelism to improve performance at the cost of determinism.
  • backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
  • logging_level (int, default: INFO): log level to send to stderr.

Return

  • return (tuple(kfold_cv_statistics, kfold_split_indices), dict): a tuple of dictionaries kfold_cv_statistics: contains metrics from cv run. kfold_split_indices: indices to split training data into training fold and test fold.

PublicAPI: This API is stable across Ludwig releases.


hyperopt

ludwig.hyperopt.run.hyperopt(
  config,
  dataset=None,
  training_set=None,
  validation_set=None,
  test_set=None,
  training_set_metadata=None,
  data_format=None,
  experiment_name='hyperopt',
  model_name='run',
  resume=None,
  skip_save_training_description=False,
  skip_save_training_statistics=False,
  skip_save_model=False,
  skip_save_progress=False,
  skip_save_log=False,
  skip_save_processed_input=True,
  skip_save_unprocessed_output=False,
  skip_save_predictions=False,
  skip_save_eval_stats=False,
  skip_save_hyperopt_statistics=False,
  output_directory='results',
  gpus=None,
  gpu_memory_limit=None,
  allow_parallel_threads=True,
  callbacks=None,
  tune_callbacks=None,
  backend=None,
  random_seed=42,
  hyperopt_log_verbosity=3
)

This method performs an hyperparameter optimization.

Inputs

  • config (Union[str, dict]): config which defines the different parameters of the model, features, preprocessing and training. If str, filepath to yaml configuration file.
  • dataset (Union[str, dict, pandas.DataFrame], default: None): source containing the entire dataset to be used in the experiment. If it has a split column, it will be used for splitting (0 for train, 1 for validation, 2 for test), otherwise the dataset will be randomly split.
  • training_set (Union[str, dict, pandas.DataFrame], default: None): source containing training data.
  • validation_set (Union[str, dict, pandas.DataFrame], default: None): source containing validation data.
  • test_set (Union[str, dict, pandas.DataFrame], default: None): source containing test data.
  • training_set_metadata (Union[str, dict], default: None): metadata JSON file or loaded metadata. Intermediate preprocessed structure containing the mappings of the input dataset created the first time an input file is used in the same directory with the same name and a '.meta.json' extension.
  • data_format (str, default: None): format to interpret data sources. Will be inferred automatically if not specified. Valid formats are 'auto', 'csv', 'df', 'dict', 'excel', 'feather', 'fwf', 'hdf5' (cache file produced during previous training), 'html' (file containing a single HTML <table>), 'json', 'jsonl', 'parquet', 'pickle' (pickled Pandas DataFrame), 'sas', 'spss', 'stata', 'tsv'.
  • experiment_name (str, default: 'experiment'): name for the experiment.
  • model_name (str, default: 'run'): name of the model that is being used.
  • resume (bool): If true, continue hyperopt from the state of the previous run in the output directory with the same experiment name. If false, will create new trials, ignoring any previous state, even if they exist in the output_directory. By default, will attempt to resume if there is already an existing experiment with the same name, and will create new trials if not.
  • skip_save_training_description (bool, default: False): disables saving the description JSON file.
  • skip_save_training_statistics (bool, default: False): disables saving training statistics JSON file.
  • skip_save_model (bool, default: False): disables saving model weights and hyperparameters each time the model improves. By default Ludwig saves model weights after each epoch the validation metric improves, but if the model is really big that can be time consuming. If you do not want to keep the weights and just find out what performance a model can get with a set of hyperparameters, use this parameter to skip it, but the model will not be loadable later on and the returned model will have the weights obtained at the end of training, instead of the weights of the epoch with the best validation performance.
  • skip_save_progress (bool, default: False): disables saving progress each epoch. By default Ludwig saves weights and stats after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will uses twice as much space, use this parameter to skip it, but training cannot be resumed later on.
  • skip_save_log (bool, default: False): disables saving TensorBoard logs. By default Ludwig saves logs for the TensorBoard, but if it is not needed turning it off can slightly increase the overall speed.
  • skip_save_processed_input (bool, default: False): if input dataset is provided it is preprocessed and cached by saving an HDF5 and JSON files to avoid running the preprocessing again. If this parameter is False, the HDF5 and JSON file are not saved.
  • skip_save_unprocessed_output (bool, default: False): by default predictions and their probabilities are saved in both raw unprocessed numpy files containing tensors and as postprocessed CSV files (one for each output feature). If this parameter is True, only the CSV ones are saved and the numpy ones are skipped.
  • skip_save_predictions (bool, default: False): skips saving test predictions CSV files.
  • skip_save_eval_stats (bool, default: False): skips saving test statistics JSON file.
  • skip_save_hyperopt_statistics (bool, default: False): skips saving hyperopt stats file.
  • output_directory (str, default: 'results'): the directory that will contain the training statistics, TensorBoard logs, the saved model and the training progress files.
  • gpus (list, default: None): list of GPUs that are available for training.
  • gpu_memory_limit (float: default: None): maximum memory fraction [0, 1] allowed to allocate per GPU device.
  • allow_parallel_threads (bool, default: True): allow PyTorch to use multithreading parallelism to improve performance at the cost of determinism.
  • callbacks (list, default: None): a list of ludwig.callbacks.Callback objects that provide hooks into the Ludwig pipeline.
  • backend (Union[Backend, str]): Backend or string name of backend to use to execute preprocessing / training steps.
  • random_seed (int: default: 42): random seed used for weights initialization, splits and any other random function.
  • hyperopt_log_verbosity (int: default: 3): controls verbosity of ray tune log messages. Valid values: 0 = silent, 1 = only status updates, 2 = status and brief trial results, 3 = status and detailed trial results.

Return

  • return (List[dict]): List of results for each trial, ordered by descending performance on the target metric.