merlin.datasets.utils module
- merlin.datasets.utils.get_venv_data_dir()
Get the data directory within the current virtual environment. Creates a ‘datasets’ directory in the site-packages folder.
- Returns:
Dataset cache directory path.
- Return type:
- merlin.datasets.utils.url_to_filename(url)
Convert URL to a filename, using hash to ensure uniqueness while keeping it readable.
- merlin.datasets.utils.fetch(url, data_dir=None, force=False)
Fetch a file and cache it locally.
If the file is gzipped, it is extracted before returning the local path.
- Parameters:
url (str) – URL to fetch.
data_dir (pathlib.Path | None) – Optional override for the cache directory.
force (bool) – Whether to re-download the file even if it is already cached.
- Returns:
Path to the downloaded or extracted file.
- Return type:
- merlin.datasets.utils.read_idx(filepath)
Read an IDX file as used by MNIST-style datasets.
- Parameters:
filepath (pathlib.Path) – Path to the IDX file.
- Returns:
- Tuple[numpy.ndarray, dict]: Tuple containing:
numpy array with the data
metadata dictionary with magic number, data type, and dimensions
numpy array with the data
metadata dictionary with magic number, data type, and dimensions
- Return type:
Tuple[numpy.ndarray, dict]
- merlin.datasets.utils.df_to_xy(df, feature_cols=None, label_cols=None)
Convert a pandas DataFrame to feature and label arrays.
- Parameters:
df (pandas.DataFrame) – Input DataFrame.
feature_cols (list | None) – Column names to use as features. If
None, all non-label columns are used.label_cols (list | None) – Column names to use as labels. If
None, the last column is treated as the label.
- Returns:
Feature matrix
Xand label arrayy.- Return type:
- merlin.datasets.utils.read_mnist_images(filepath)
Read an MNIST image file.
- Parameters:
filepath (pathlib.Path) – Path to the MNIST image file.
- Returns:
Image array with shape
(n_images, 28, 28).- Return type:
- merlin.datasets.utils.read_mnist_labels(filepath)
Read an MNIST label file.
- Parameters:
filepath (pathlib.Path) – Path to the MNIST label file.
- Returns:
Label array.
- Return type:
- merlin.datasets.utils.get_data_generic(subset, url_images, url_labels, metadata)
Load an IDX-based image dataset split and wrap its metadata.
- Parameters:
- Returns:
Images, labels, and structured dataset metadata.
- Return type: