merlin.datasets.utils module

merlin.datasets.utils.df_to_xy(df, feature_cols=None, label_cols=None)

Convert pandas DataFrame to numpy arrays for features (X) and labels (y)

Return type:

tuple[ndarray, ndarray]

Args:

df: Input DataFrame feature_cols: List of column names to use as features. If None, uses all columns except label_cols label_cols: List of column names to use as labels. If None, assumes last column is label

Returns:

X: numpy array of features y: numpy array of labels

merlin.datasets.utils.fetch(url, data_dir=None, force=False)

Fetch a file from URL, storing it in the virtual environment’s data directory. If the file already exists, return its path unless force=True. If the file is gzipped, extract it.

Return type:

Path

Args:

url: URL to fetch the file from data_dir: Optional override for the data directory force: If True, re-download even if file exists

Returns:

Path: Path to the downloaded (and potentially extracted) file

merlin.datasets.utils.read_idx(filepath)

Read an IDX file format as used in MNIST dataset.

Return type:

tuple[ndarray, dict]

Args:

filepath: Path to the IDX file

Returns:
Tuple[np.ndarray, dict]: Tuple containing:
  • numpy array with the data

  • metadata dictionary with magic number, data type, and dimensions