Skip to content

Supported Formats

File Formats

Ludwig is able to read UTF-8 encoded data from 14 file formats. Supported formats are:

  • Comma Separated Values (csv)
  • Excel Workbooks (excel)
  • Feather (feather)
  • Fixed Width Format (fwf)
  • Hierarchical Data Format 5 (hdf5)
  • Hypertext Markup Language (html) Note: limited to single table in the file.
  • JavaScript Object Notation (json and jsonl)
  • Parquet (parquet)
  • Pickled Pandas DataFrame (pickle)
  • SAS data sets in XPORT or SAS7BDAT format (sas)
  • SPSS file (spss)
  • Stata file (stata)
  • Tab Separated Values (tsv)

Ludwig uses Pandas and Dask under the hood to read the UTF-8 encoded dataset files, which allows support for CSV, Excel, Feather, fwf, HDF5, HTML (containing a <table>), JSON, JSONL, Parquet, pickle (pickled Pandas DataFrame), SAS, SPSS, Stata and TSV formats. Ludwig tries to automatically identify the format by the extension.

In case a *SV file is provided, Ludwig tries to identify the separator (generally ,) from the data. The default escape character is \. For example, if , is the column separator and one of your data columns has a , in it, Pandas would fail to load the data properly. To handle such cases, we expect the values in the columns to be escaped with backslashes (replace , in the data with \,).

Hugging Face Datasets

Ludwig now also supports direct Hugging Face dataset imports with the following syntax (dataset_subset is not always present in Hugging Face datasets, so omit it if necessary).

"hf://{dataset_name}--{dataset_subset}"

For example: train_stats, _, _ = ludwig_model.train(dataset="hf://mbpp") train_stats, _, _ = ludwig_model.train(dataset="hf://Open-Orca/OpenOrca") train_stats, _, _ = ludwig_model.train(dataset="hf://gsm8k--main")

Please note that "subset" is not the same as "split". Make sure that you are including the subset name and not the split name when specifying the dataset:

Alt text