conmo.preprocesses.SklearnPreprocess
- class conmo.preprocesses.SklearnPreprocess(to_data: Union[bool, Iterable[str]], to_labels: Union[bool, Iterable[str]], test_set: bool, preprocess: Union[Binarizer, FunctionTransformer, KBinsDiscretizer, KernelCenterer, LabelBinarizer, LabelEncoder, MultiLabelBinarizer, MaxAbsScaler, MinMaxScaler, Normalizer, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, PowerTransformer, QuantileTransformer, RobustScaler, StandardScaler])[source]
Class used to wrap existing preprocess in the Scikit-Learn library. It also allows this preprocess to be applied to certain columns of the dataset.
- __init__(to_data: Union[bool, Iterable[str]], to_labels: Union[bool, Iterable[str]], test_set: bool, preprocess: Union[Binarizer, FunctionTransformer, KBinsDiscretizer, KernelCenterer, LabelBinarizer, LabelEncoder, MultiLabelBinarizer, MaxAbsScaler, MinMaxScaler, Normalizer, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, PowerTransformer, QuantileTransformer, RobustScaler, StandardScaler]) None [source]
- apply(in_dir: str, out_dir: str) None
Applies the preprocess to the given dataset.
- Parameters
in_dir (str) – Input directory where the files are located. Usually, this is the output directory of the splitter step.
out_dir (str) – Output directory where the files will be saved.
- extract_columns(df: DataFrame, columns: Union[bool, Iterable[str]]) Iterable[str]
Returns a list containig all the column’s name of the data.
- Parameters
df (Pandas Dataframe) – Dataframe containing the data.
columns (Union[bool, Iterable[str]]) – Bool value if the dataframe has columns or the list of columns.
- Returns
columns – List containing the names of the dataframe’s columns.
- Return type
Iterable[str]
- load_input(in_dir: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)
Read parquet data and labels files of the chosen dataset before it’s split.
- Parameters
in_dir (str) – Input directory where the files are located.
- Returns
data (Pandas Dataframe) – Loaded data file.
labels (Pandas Dataframe) – Loaded labels file.
- save_output(out_dir: str, data: DataFrame, labels: DataFrame) None
Save preprocessed dataset to parquet format.
- Parameters
out_dir (str) – Output directory where the results will be saved.
data (Pandas Dataframe) – Preprocessed data.
labels (Pandas Dataframe) – Preprocessed labels.
- show_start_message() None
Simple method to print on the terminal the name of the selected splitter.
- transform(df: DataFrame, columns: Iterable[str]) DataFrame [source]
Performs the preprocess over the dataframe with the given columns.
- Parameters
df (Pandas Dataframe) – Dataframe containing the data or the labels of the dataset.
columns (Iterable[str]) – List of columns that will be used in the preprocess. Also the columns of the final dataframe.
- Returns
Dataframe preprocessed.
- Return type
Pandas Dataframe
Methods
__init__
(to_data, to_labels, test_set, ...)apply
(in_dir, out_dir)Applies the preprocess to the given dataset.
extract_columns
(df, columns)Returns a list containig all the column's name of the data.
load_input
(in_dir)Read parquet data and labels files of the chosen dataset before it's split.
save_output
(out_dir, data, labels)Save preprocessed dataset to parquet format.
Simple method to print on the terminal the name of the selected splitter.
transform
(df, columns)Performs the preprocess over the dataframe with the given columns.