conmo.splitters.SklearnSplitter

class conmo.splitters.SklearnSplitter(splitter: Union[GroupKFold, GroupShuffleSplit, KFold, LeaveOneGroupOut, LeavePGroupsOut, LeaveOneOut, LeavePOut, PredefinedSplit, RepeatedKFold, RepeatedStratifiedKFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit], groups: Optional[Iterable[int]] = None)[source]

__init__(splitter: Union[GroupKFold, GroupShuffleSplit, KFold, LeaveOneGroupOut, LeavePGroupsOut, LeaveOneOut, LeavePOut, PredefinedSplit, RepeatedKFold, RepeatedStratifiedKFold, ShuffleSplit, StratifiedKFold, StratifiedShuffleSplit, TimeSeriesSplit], groups: Optional[Iterable[int]] = None) → None[source]

already_splitted(df: DataFrame) → bool

Checks if the dataset was already splitted.

Parameters

df (Pandas Dataframe) – Input dataset.

Returns

True in case the dataset was already splitted, False otherwise.

Return type

bool

Raises

RuntimeError – If the dataset isn’t splitted and doesn’t follow Conmo’s format.

extract_fold(df: ~pandas.core.frame.DataFrame, sequences: ~numpy.ndarray, fold: int, train_idx: ~numpy.ndarray, test_idx: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

load_input(in_dir: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)

Read parquet data and labels files of the chosen dataset.

Parameters

in_dir (str) – Input directory where the files are located.

Returns

data (Pandas Dataframe) – Loaded data file.

labels (Pandas Dataframe) – Loaded labels file.

Raises

If data and labels have different sequences values. –

save_output(out_dir: str, data: DataFrame, labels: DataFrame) → None

Save splitted dataset to parquet format.

Parameters

out_dir (str) – Output directory where the results will be saved.

data (Pandas Dataframe) – Splitted data.

labels (Pandas Dataframe) – Splitted labels.

show_start_message()

Simple method to print on the terminal the name of the selected splitter.

split(in_dir: str, out_dir: str) → None[source]

Performs the split to both data and labels of the dataset.

Parameters

in_dir (str) – Input directory of the before step.

out_dir (str) – Output directory where te split data will be stored.

to_dataframe(df: DataFrame, data: ndarray, index: ndarray) → DataFrame[source]

Methods

`__init__`(splitter[, groups])
`already_splitted`(df)	Checks if the dataset was already splitted.
`extract_fold`(df, sequences, fold, train_idx, ...)
`load_input`(in_dir)	Read parquet data and labels files of the chosen dataset.
`save_output`(out_dir, data, labels)	Save splitted dataset to parquet format.
`show_start_message`()	Simple method to print on the terminal the name of the selected splitter.
`split`(in_dir, out_dir)	Performs the split to both data and labels of the dataset.
`to_dataframe`(df, data, index)