conmo.preprocesses.SklearnPreprocess
- class conmo.preprocesses.SklearnPreprocess(to_data: Union[bool, Iterable[str]], to_labels: Union[bool, Iterable[str]], test_set: bool, preprocess: Union[Binarizer, FunctionTransformer, KBinsDiscretizer, KernelCenterer, LabelBinarizer, LabelEncoder, MultiLabelBinarizer, MaxAbsScaler, MinMaxScaler, Normalizer, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, PowerTransformer, QuantileTransformer, RobustScaler, StandardScaler])[source]
Class used to wrap existing preprocess in the Scikit-Learn library. It also allows this preprocess to be applied to certain columns of the dataset.
- __init__(to_data: Union[bool, Iterable[str]], to_labels: Union[bool, Iterable[str]], test_set: bool, preprocess: Union[Binarizer, FunctionTransformer, KBinsDiscretizer, KernelCenterer, LabelBinarizer, LabelEncoder, MultiLabelBinarizer, MaxAbsScaler, MinMaxScaler, Normalizer, OneHotEncoder, OrdinalEncoder, PolynomialFeatures, PowerTransformer, QuantileTransformer, RobustScaler, StandardScaler]) None[source]
- apply(in_dir: str, out_dir: str) None
Applies the preprocess to the given dataset.
- Parameters
in_dir (str) – Input directory where the files are located. Usually, this is the output directory of the splitter step.
out_dir (str) – Output directory where the files will be saved.
- extract_columns(df: DataFrame, columns: Union[bool, Iterable[str]]) Iterable[str]
Returns a list containig all the column’s name of the data.
- Parameters
df (Pandas Dataframe) – Dataframe containing the data.
columns (Union[bool, Iterable[str]]) – Bool value if the dataframe has columns or the list of columns.
- Returns
columns – List containing the names of the dataframe’s columns.
- Return type
Iterable[str]
- load_input(in_dir: str) -> (<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)
Read parquet data and labels files of the chosen dataset before it’s split.
- Parameters
in_dir (str) – Input directory where the files are located.
- Returns
data (Pandas Dataframe) – Loaded data file.
labels (Pandas Dataframe) – Loaded labels file.
- save_output(out_dir: str, data: DataFrame, labels: DataFrame) None
Save preprocessed dataset to parquet format.
- Parameters
out_dir (str) – Output directory where the results will be saved.
data (Pandas Dataframe) – Preprocessed data.
labels (Pandas Dataframe) – Preprocessed labels.
- show_start_message() None
Simple method to print on the terminal the name of the selected splitter.
- transform(df: DataFrame, columns: Iterable[str]) DataFrame[source]
Performs the preprocess over the dataframe with the given columns.
- Parameters
df (Pandas Dataframe) – Dataframe containing the data or the labels of the dataset.
columns (Iterable[str]) – List of columns that will be used in the preprocess. Also the columns of the final dataframe.
- Returns
Dataframe preprocessed.
- Return type
Pandas Dataframe
Methods
__init__(to_data, to_labels, test_set, ...)apply(in_dir, out_dir)Applies the preprocess to the given dataset.
extract_columns(df, columns)Returns a list containig all the column's name of the data.
load_input(in_dir)Read parquet data and labels files of the chosen dataset before it's split.
save_output(out_dir, data, labels)Save preprocessed dataset to parquet format.
Simple method to print on the terminal the name of the selected splitter.
transform(df, columns)Performs the preprocess over the dataframe with the given columns.