merlin.datasets.utils module

merlin.datasets.utils.get_venv_data_dir()

Get the data directory within the current virtual environment. Creates a ‘datasets’ directory in the site-packages folder.

Returns:

Dataset cache directory path.

Return type:

pathlib.Path

merlin.datasets.utils.url_to_filename(url)

Convert URL to a filename, using hash to ensure uniqueness while keeping it readable.

Parameters:

url (str) – URL to convert.

Returns:

Filename derived from the URL.

Return type:

str

merlin.datasets.utils.fetch(url, data_dir=None, force=False)

Fetch a file and cache it locally.

If the file is gzipped, it is extracted before returning the local path.

Parameters:
  • url (str) – URL to fetch.

  • data_dir (pathlib.Path | None) – Optional override for the cache directory.

  • force (bool) – Whether to re-download the file even if it is already cached.

Returns:

Path to the downloaded or extracted file.

Return type:

pathlib.Path

merlin.datasets.utils.read_idx(filepath)

Read an IDX file as used by MNIST-style datasets.

Parameters:

filepath (pathlib.Path) – Path to the IDX file.

Returns:

Tuple[numpy.ndarray, dict]: Tuple containing:
  • numpy array with the data

  • metadata dictionary with magic number, data type, and dimensions

  • numpy array with the data

  • metadata dictionary with magic number, data type, and dimensions

Return type:

Tuple[numpy.ndarray, dict]

merlin.datasets.utils.df_to_xy(df, feature_cols=None, label_cols=None)

Convert a pandas DataFrame to feature and label arrays.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame.

  • feature_cols (list | None) – Column names to use as features. If None, all non-label columns are used.

  • label_cols (list | None) – Column names to use as labels. If None, the last column is treated as the label.

Returns:

Feature matrix X and label array y.

Return type:

tuple[numpy.ndarray, numpy.ndarray]

merlin.datasets.utils.read_mnist_images(filepath)

Read an MNIST image file.

Parameters:

filepath (pathlib.Path) – Path to the MNIST image file.

Returns:

Image array with shape (n_images, 28, 28).

Return type:

numpy.ndarray

merlin.datasets.utils.read_mnist_labels(filepath)

Read an MNIST label file.

Parameters:

filepath (pathlib.Path) – Path to the MNIST label file.

Returns:

Label array.

Return type:

numpy.ndarray

merlin.datasets.utils.get_data_generic(subset, url_images, url_labels, metadata)

Load an IDX-based image dataset split and wrap its metadata.

Parameters:
  • subset (str) – Split name, for example "train" or "test".

  • url_images (str) – URL of the image IDX file.

  • url_labels (str) – URL of the label IDX file.

  • metadata (dict) – Dataset metadata dictionary to enrich with split-specific fields.

Returns:

Images, labels, and structured dataset metadata.

Return type:

tuple[numpy.ndarray, numpy.ndarray, DatasetMetadata]