Datamodule
muppet.benchmark.datasets.datamodule
Data module for handling different data types in the MUPPET benchmark framework.
This module provides generic dataset classes for loading and preprocessing datasets across multiple modalities (image, tabular, time series) using Hydra-based configuration.
Classes:
-
DataModule–Main data module for handling multi-modal datasets
-
ImageDataset–PyTorch dataset for image data
-
TabularDataset–PyTorch dataset for tabular data
-
TimeseriesDataset–PyTorch dataset for time series data
Classes
DataModule
DataModule(
name,
type,
loader,
transform=None,
test_size=None,
train_size=None,
dataloader_kwargs={},
labels_mapping=None,
)
A generic data module for handling different data types (image, tabular, timeseries), using Hydra-based configuration for dynamic data loading and preprocessing.
This module accepts a function path via Hydra (under 'func') in the loader argument
to dynamically load data (either a directory or a DataFrame/Series), depending on the data type.
It supports dynamic data loading, automatic train/test splitting, and provides convenient
access to benchmark data for evaluation purposes.
Supported types: - image: Expects the loader to return a directory path containing image files. - tabular: Expects the loader to return a tuple (features: pd.DataFrame, target: pd.Series). - timeseries: Placeholder for future implementation.
After calling prepare_data(), the train_loader and test_loader
become available. The benchmark_data property provides access to all the input
data from the test_loader, concatenated into a single tensor on the target device.
This data can be used for benchmarking purposes, such as benchmarking different explainers
on the entire test set at once.
Note
Accessing benchmark_data before calling prepare_data() will raise a RuntimeError.
Initialize the DataModule instance.
Parameters:
-
name(str) –Name of the dataset.
-
type(Literal['image', 'tabular', 'timeseries']) –Type of data to load.
-
loader(dict[str, Any]) –A dictionary containing 'func' key pointing to the data loader function as a string path (resolved by Hydra), and additional args. This function could perform a download (if data doesn't already exist), or simply load data from path (it it already exists).
-
transform(callable, default:None) –Transformation to apply to the data. It could be a torch transformation object or sklearn preprocessing object.
-
test_size(int | None, default:None) –Number of data to use as test set.
-
train_size(int, default:None) –Number of data to use as training set. Can only be used together with test_size as an integer. When provided, the dataset will first be reduced to train_size + test_size total samples, then split according to the specified sizes. Both train_size and test_size must be positive integers less than the total dataset size, and their sum must not exceed the dataset size.
-
dataloader_kwargs(dict[str, Any], default:{}) –Additional keyword arguments for DataLoader.
Source code in muppet/benchmark/datasets/datamodule.py
Functions
Prepares the datasets and initializes train/test DataLoaders based on the type of data. Loads the data using the Hydra-specified function from the loader config.
Source code in muppet/benchmark/datasets/datamodule.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
ImageDataset
Bases: Dataset
PyTorch Dataset for loading image files. A simple dataset class for loading images from file paths with optional transformations.
Initialize ImageDataset instance.
Parameters:
-
image_files(List[str]) –List of image file paths.
-
transform(callable, default:None) –Transformations to apply to each image.
Source code in muppet/benchmark/datasets/datamodule.py
TabularDataset
Bases: Dataset
PyTorch Dataset for loading tabular data.
A PyTorch dataset for structured data by processing features and targets, applying optional transformations, and handling categorical target encoding.
Initialize the TabularDataset instance.
Parameters:
-
data(Tuple[DataFrame, Series]) –Tuple containing (features, target) where features is a pandas DataFrame with input variables and target is a pandas Series with the prediction target values.
-
transform(callable, default:None) –Optional transformation pipeline (e.g., sklearn preprocessors) to apply to the features. If provided, will be fitted and applied to the feature data.
Source code in muppet/benchmark/datasets/datamodule.py
TimeseriesDataset
Bases: TabularDataset
Dataset class for time series data that extends TabularDataset.
A specialized dataset implementation for time series data that inherits from TabularDataset but adds specific handling for temporal data structures.
Initialize TimeseriesDataset instance.
Parameters:
-
data(tuple[DataFrame, Series]) –A tuple containing time series features (DataFrame) and target (Series).
-
transform(callable, default:None) –Optional transformations to apply to the features.