Welcome to Conmo documentation!

Release v1.0.1.

What is Conmo?

Conmo is a framework developed in Python whose main objective is to facilitate the execution and comparison of different experiments mainly related to Anomaly Detection and Condition Monitoring. These experiments consist of a series of concatenated stages forming a pipeline architecture, i.e. the output of one stage is the input of the next one. This framework aims to provide a way to standarize machine learning experiments, thus being able to reconstruct result tables of scientific papers.

User Guide

Quickstart Guide

Requirements

Conmo was developed under Python version 3.7.11 so it should work with similar or more recent versions. However, we cannot claim this to be true, so we recommend using the same version. To be able to use Conmo you need to have installed a Python interpreter and the following libraries on your computer:

If you want to make a contribution by modifying code and documentation you need to include these libraries as well:

We suggest to create a new virtual enviroment using the Conda package manager and install there all dependences.

Installation

The fastest way to get started with Conmo is to install it via the pip command.

pip install conmo

And then you will be able to open a Python interpreter and try running.

import conmo

Some TensorFlow warnings might come up if your computer doesn’t have installed a GPU, although that’s not a problem for running Conmo.

You can also install Conmo manually downloading the source code from the Github repository.

git clone https://github.com/MyM-Uniovi/conmo.git
cd conmo

Then if you haven’t prepared manually a conda enviroment, you can execute the shell-script install_conmo_conda.sh to install all the dependences and create a Conda enviroment with Python 3.7.

cd scripts
./install_conmo_conda.sh conda_env_name

If your operating system is MacOS, please, check Known issues & Limitations section for more information about compatibility of Conmo with Apple M1 and M2 CPUs.

If your operating system is not Unix-like and you are using Windows 10/11 OS you can create the Conda enviroment manually or use the Windows Subsytem for Linux (WSL) tool. For more information about its installation, please refer to Microsoft’s official documentation..

To check if the Conda enviroment is activated you should see a (conda_env_name) in your command line. If it is not activated, then you can activated it using:

conda activate conda_env_name

Overview

The experiments in Conmo have a pipeline-based architecture. A pipeline consists of a chain of processes connected in such a way that the output of each element of the chain is the input of the next, thus creating a data flow. Each of these processes represents one of the typical generic steps in Machine Learning experiments:

Datasets

Defines the dataset used in the experiment which will be the starting data of the chain. Here the dataset will be loaded and parsed to a standard format.

Splitters

Typically in Machine Learning problems the data has to be splitted into train data and test data. Also here you can apply Cross-Validation techniques.

Preprocesses

Defines the sequence of preprocesses to be applied over the dataset to manipulate the data before any algorithm is executed.

Algorithms

Defines the different algorithms which will be executed over the same input data stream (as a result of the previous stage). It can be one or several.

Metrics

Defines the different metrics that can be used to evaluate the results obtained from the algorithms.

_images/pipeline.png

Further details and documentation about modules, functions and parameters are provided in the API Reference.

Running an experiment

Here is a brief example on how to use the different comno modules to reproduce an experiment. In this case with the predefined splitter of the Server Machine Dataset, Sklearn’s MinMaxScaler as preprocessing, PCAMahalanobis as algorithm and accuracy as metric.

  1. Import the module if it hasn’t been imported yet and other dependences:

     1from sklearn.preprocessing import MinMaxScaler
     2
     3from conmo import Experiment, Pipeline
     4from conmo.algorithms import PCAMahalanobis
     5from conmo.datasets import ServerMachineDataset
     6from conmo.metrics import Accuracy
     7from conmo.preprocesses import SklearnPreprocess
     8from conmo.splitters import SklearnSplitter
     9from sklearn.model_selection import PredefinedSplit
    10from sklearn.preprocessing import MinMaxScaler
    
  2. Configure the different stages of the pipeline:

     1dataset = ServerMachineDataset('1-01')
     2splitter = SklearnSplitter(splitter=PredefinedSplit(dataset.sklearn_predefined_split()))
     3preprocesses = [
     4    SklearnPreprocess(to_data=True, to_labels=False,
     5                    test_set=True, preprocess=MinMaxScaler()),
     6]
     7algorithms = [
     8    PCAMahalanobis()
     9]
    10metrics = [
    11    Accuracy()
    12]
    13pipeline = Pipeline(dataset, splitter, preprocesses, algorithms, metrics)
    
  3. Create an experiment with the configured pipeline. The first parameter is a list of the pipelines that will be included in the experiment It can be one or more. The second parameter is for statistical testing between results, but this part is still under development and therefore it cannot be used:

    1experiment = Experiment([pipeline], [])
    
  4. Start running the experiment by calling launch() method:

    1experiment.launch()
    
  5. As a result of the execution of the experiment a specific folder structure will be created in ~/conmo:

/data

This directory contains the various datasets that have already been imported (downloaded and parsed) and are therefore already available for use. They are stored in parquet format for better compression. For each of the subdatasets included in each dataset, there will be a data file and a labels file.

/experiments

This directory contains all the executions of an experiment in Conmo in chronological order. Each directory corresponds to an experiment and has in its name a timestamp with the time and day when this experiment was run. Within each experiment directory there will be another one for each pipeline, and within this one there will be as many directories as the number of steps each pipeline has been determined to contain These folders contain the input and output data used by each step of the pipeline. They are also stored in parquet format, in the same way as the datasets in the /data folder.

Examples

Examples

A handful of example experiments can be found in the “examples” directory of the repository. These are listed below:

NASA TurboFan Degradation

This example can be found in nasa_cmapss.py file. The chosen dataset is NASA’s Turbofan engine degradation simulation data set. It is a dataset widely used in multivariate time series anomaly detection and condition monitoring problems. The splitter used is the Sklearn Predefined Split. For more information see the Scikit-Learn documentation. Regarding preprocessing, several have been used. The Savitzky-Golay filter, RUL Imputation and Binarizer are already implemented in Conmo. The MinMaxScaler is a Sklearn preprocessing (more information here) that has been packaged using SklearnPreprocess. Finally, two custom preprocesses for data cleaning and label renaming have been defined using the CustomPreprocess wrapper. To create these preprocesses just create a function that has as parameters the Pandas Dataframes for data and labels. The algorithms used were dimensionality reduction with PCA together with Mahalanobis distance calculation and One Class Support Vector Machine. Finally, the metric used was Acurracy.

 1import pandas as pd
 2from sklearn.model_selection import PredefinedSplit
 3from sklearn.preprocessing import MinMaxScaler
 4
 5from conmo import Experiment, Pipeline
 6from conmo.algorithms import OneClassSVM, PCAMahalanobis
 7from conmo.datasets import NASATurbofanDegradation
 8from conmo.metrics import Accuracy
 9from conmo.preprocesses import (Binarizer, CustomPreprocess, RULImputation,
10                                SavitzkyGolayFilter, SklearnPreprocess)
11from conmo.splitters import SklearnSplitter
12
13# First custom preprocess definition
14def data_cleanup(data: pd.DataFrame, labels: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):
15    # Reduce columns
16    columns = ['T30', 'T50', 'P30']
17    sub_data = data.loc[:, columns]
18
19    # Rename columns
20    sub_data = sub_data.rename(columns={'T50': 'TGT'})
21
22    # Calculate FF
23    sub_data.loc[:, 'FF'] = data.loc[:, 'Ps30'] * data.loc[:, 'phi']
24    sub_data.head()
25
26    return sub_data, labels
27
28# Second custom preprocess definition
29def rename_labels(data: pd.DataFrame, labels: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):
30    # Rename labels from 'rul' to 'anomaly'
31    labels.rename(columns={'rul': 'anomaly'}, inplace=True)
32
33    return data, labels
34
35
36# Select FD001 subdataset of NASA Turbofan Degradation dataset
37dataset = NASATurbofanDegradation(subdataset="FD001")
38
39# Split dataset using predefined dataset split
40splitter = SklearnSplitter(splitter=PredefinedSplit(dataset.sklearn_predefined_split()))
41
42# Preprocesses definition
43preprocesses = [
44    CustomPreprocess(data_cleanup),
45    SklearnPreprocess(to_data=True, to_labels=False,
46                    test_set=True, preprocess=MinMaxScaler()),
47    SavitzkyGolayFilter(to_data=True, to_labels=False,
48                        test_set=True, window_length=7, polyorder=2),
49    RULImputation(threshold=125),
50    Binarizer(to_data=False, to_labels=[
51                    'rul'], test_set=True, threshold=50),
52    CustomPreprocess(rename_labels)
53]
54
55# Algorithms definiition with default parameters
56algorithms = [
57    PCAMahalanobis(),
58    OneClassSVM()
59]
60
61metrics = [
62    Accuracy()
63]
64# Pipeline with all steps
65pipeline = Pipeline(dataset, splitter, preprocesses, algorithms, metrics)
66
67# Experiment definition and launch
68experiment = Experiment([pipeline], [])
69experiment.launch()

Batteries Degradation

This experiment can be found in the file batteries_degradation.py and reproduces the results obtained in a paper to estimate the level of degradation of some types of lithium batteries. The dataset used is Batteries Degradation. This is not a time series, although it is somewhat similar since it measures different types of degradation in three types of batteries as they are gradually used. It is a local dataset, so it is necessary to pass the path in which it is located, and also the type of battery to be selected (LFP) and the test set, in this case 1. The splitter used is the Sklearn Predefined Split and it does not have any preprocessing since during the parsing of the local files to the Conmo format the data is already normalised. The algorithms used are the same as those used in the paper: Random Forest, Multilayer Perceptron and Convolutional Neural Network. In all cases the pre-trained models are used, so it is necessary to pass the path to the files as a parameter. The metric used is Root Mean Square Percentage Error.

 1from conmo import Experiment, Pipeline
 2from conmo.algorithms import PretrainedRandomForest, PretrainedCNN1D, PretrainedMultilayerPerceptron
 3from conmo.datasets import BatteriesDataset
 4from conmo.metrics import RMSPE
 5from conmo.splitters import SklearnSplitter
 6from sklearn.model_selection import PredefinedSplit
 7
 8# Pipeline definition
 9# Change path to our local dataset files, specify chemistry of the batteries (LFP, NCA, NMC) and test set
10dataset = BatteriesDataset('/path/to/batteries/dataset/', 'LFP', 1)
11splitter = SklearnSplitter(splitter=PredefinedSplit(dataset.sklearn_predefined_split()))
12preprocesses = None
13# Changes the path to the files where the pre-trained models are stored (usually h5, h5py or joblib formats).
14algorithms = [
15    PretrainedRandomForest(pretrained=True, path='/path/to/saved/model-RF.joblib'),
16    PretrainedMultilayerPerceptron(pretrained=True, input_len=128, path='/path/to/saved/model-MLP.h5'),
17    PretrainedCNN1D(pretrained=True, input_len=128, path='/path/to/saved/model-CNN.h5')
18]
19metrics = [
20    RMSPE()
21]
22pipeline = Pipeline(dataset, splitter, preprocesses, algorithms, metrics)
23
24
25# Experiment definition and launch
26experiment = Experiment([pipeline], [])
27experiment.launch()

Server Machine Dataset with PCAMahalanobis

This experiment can be found in the file omni_anomaly_smd.py. The Server Machine Dataset used in this experiment has been obtained from the OmniAnomaly repository. In their Github you can find more information about the dataset as well as the implementation of other anomaly detection and time series data mining algorithms. The splitter used is the Sklearn Predefined Split and the preprocessing is the MinMaxScaler from Sklearn. The algorithms is PCA with Mahalanobis distance. Finally, the metric is the Accuracy.

 1from sklearn.preprocessing import MinMaxScaler
 2
 3from conmo import Experiment, Pipeline
 4from conmo.algorithms import PCAMahalanobis
 5from conmo.datasets import ServerMachineDataset
 6from conmo.metrics import Accuracy
 7from conmo.preprocesses import SklearnPreprocess
 8from conmo.splitters import SklearnSplitter
 9from sklearn.model_selection import PredefinedSplit
10from sklearn.preprocessing import MinMaxScaler
11
12# Pipeline definition
13dataset = ServerMachineDataset('1-01')
14splitter = SklearnSplitter(splitter=PredefinedSplit(dataset.sklearn_predefined_split()))
15preprocesses = [
16    SklearnPreprocess(to_data=True, to_labels=False,
17                    test_set=True, preprocess=MinMaxScaler()),
18]
19algorithms = [
20    PCAMahalanobis()
21]
22metrics = [
23    Accuracy()
24]
25pipeline = Pipeline(dataset, splitter, preprocesses, algorithms, metrics)
26
27
28# Experiment definition and launch
29experiment = Experiment([pipeline], [])
30experiment.launch()

API Reference

API Reference

This is the API Reference documentation of the package, including modules, classes and functions.

conmo.experiments

This is the main submodule of the package and it is the responsible of create the intermediary directories of the experiment and take care of creating and executing the configured pipeline.

experiment.Experiment(pipelines, analytics)

experiment.Pipeline(dataset, splitter, ...)

conmo.datasets

The conmo.datasets submodule takes care of downloading the dataset and parsing it to the Conmo’s format.

datasets.dataset.Dataset(name)

Abstract base class for a Dataset.

datasets.dataset.RemoteDataset(url, ...)

Abstract base class for a RemoteDataset (downloadable).

datasets.dataset.LocalDataset(path)

Abstract base class for a LocalDataset (loadable).

datasets.MarsScienceLaboratoryMission(channel)

datasets.SoilMoistureActivePassiveSatellite(channel)

datasets.ServerMachineDataset(subdataset)

datasets.NASATurbofanDegradation(subdataset)

datasets.BatteriesDataset(path, chemistry, ...)

This is a dataset obtained from measurements of certain types of degradation of three types of batteries. Since it belongs to the local datasets, to launch any experiment with it, it must be stored on disk with the following directory structure: - DTW-Li-ion-Diagnosis - data : Data and labels for the three types of batteries are stored here. - mat: - LFP: - diagnosis: - V.mat - test: - V_references.mat - x_test_0.mat - x_test_1.mat - x_test_2.mat - x_test_3.mat - y_test.mat - NCA: - diagnosis - test - NMC: - The same as NCA and LFP - Q.mat.

conmo.splitters

Once the dataset has been loaded, it is necessary to separate the training and test parts. The conmo.splitters submodule permits generate new splitters or use predefined ones from the Scikit-Learn library.

splitters.splitter.Splitter()

splitters.SklearnSplitter(splitter[, groups])

conmo.preprocesses

The aim of the conmo.preprocesses submodule is to apply a series of transformations to the data set before it is used as input to the algorithms. Several types of preprocesses implemented are usually used in time series anomaly detection problems.

preprocesses.preprocess.Preprocess()

Abstract base class for a Preprocess.

preprocesses.preprocess.ExtendedPreprocess(...)

Specific class to implement preprocessing which consists of applying certain transformations on some columns of the dataset.

preprocesses.Binarizer(to_data, to_labels, ...)

preprocesses.CustomPreprocess(fn)

Core class used to implement self-created preprocess.

preprocesses.RULImputation(threshold)

preprocesses.SavitzkyGolayFilter(to_data, ...)

preprocesses.SklearnPreprocess(to_data, ...)

Class used to wrap existing preprocess in the Scikit-Learn library.

conmo.algorithms

The conmo.algorithms submodule contains everything related to algorithms in Conmo, from abstract classes to introduce new algorithms in Conmo to implementations of some of the algorithms used in the example experiments.

algorithms.algorithm.Algorithm()

algorithms.algorithm.AnomalyDetectionThresholdBasedAlgorithm(...)

algorithms.algorithm.AnomalyDetectionClassBasedAlgorithm()

algorithms.PCAMahalanobis([n_components, ...])

algorithms.OneClassSVM([kernel, degree, ...])

algorithms.KerasAutoencoder([encoding_dim, ...])

algorithms.PretrainedRandomForest(pretrained)

algorithms.PretrainedMultilayerPerceptron(...)

algorithms.PretrainedCNN1D(pretrained, input_len)

conmo.metrics

The conmo.metrics submodule contains everything necessary to add new ways of measuring the effectiveness of the implemented algorithms. Accuracy and RMSPE are currently implemented.

metrics.metric.Metric()

metrics.Accuracy([normalize])

metrics.RMSPE([normalize])

Development Guide

Development guide

Possibilities of Conmo

Conmo framework has been designed to be user-friendly for recreating and evaluating experiments, but also for adding new algorithms, datasets, preprocesses, etc. This section explains the possibilities offered by this framework when implementing new submodules. We believe that using and contributing to Conmo can benefit all types of users, as well as helping to standardise comparisons between results from different scientific articles.

If you still have doubts about the implementation of new components to the framework, you can take a look at the API reference, examples or contact the developers.

Add a new dataset

Dataset is the core abstract class for every dataset in Conmo and contains basic methods and attributes that are common for all datasets. At the same time, two classes depend on it and differ acording to where the original data is stored:

  • LocalDataset:

    Is the abstract class in charge of handling datasets that are stored locally on the computer where Conmo will be running. The main method of this class is LocalDataset.load(). It’s in charge of parsing the original dataset files to Conmo’s format and moving them to the data folder. It’s an abstract method wich needs to be implemented in every local dataset. There is also an abstract method feed_pipeline() to copy selected data to pipeline step folder.

  • RemoteDataset:

    In case the dataset to be implemented is originally located on a web server, a Git repository or other remote hosting, the RemoteDataset class is available in Conmo. Among all its methos, it’s remarkable the RemoteDataset.download() to download the dataset from a remote URL.

For adding a new local dataset to the framework you need to create a new class that inherits from LocalDataset and override the following methods:

  • LocalDataset.__init__():

    This is the constructor of the class. Here you can call the constructor of the father class to assign the path to thw original dataset. Here you can also define some attributes of the class, like the label’s columns names, feature’s names. Also you can assign the subdataset that you want to instanciate.

  • LocalDataset.dataset_files():

    This method must return a list with all the files (data and labels) that compounds the dataset.

  • LocalDataset.load():

    This method must convert all raw dataset files to the appropriate format for Conmo’s pipeline. For each of the datasets, first read and load the data and labels into Pandas dataframes, then concatenate them (e.g. train data and test data will be concatenated in one dataframe, the same for test) and finally save them in parquet format. Some considerations to take into account:

    • Data and labels dataframes will have at least a multi-index for sequence and time. You can consult more information in the Pandas documentation.

    • The columns index must start at 1.

    • If there dataset is only splittered into train and test, then there will be 2 sequences, one per set.

    • In case the dataset is a time series with sequences, train sequences go after the test sequences.

  • LocalDataset.feed_pipeline():

    This method is used to copy the dataset from data directory to the directory Dataset of the experiment.

  • LocalDataset.sklearn_predefined_split():

    If you plan to use the Predefined Split from the Sklearn library your class must implement this method. It must generate an array of indexes of same length as sequences to be used with PredefinedSplit. The index must start at 0.

For adding a new remote dataset to the framework the procedure is almost identical to a local dataset. You need to create a new class that inherits from RemoteDataset and override the following methods:

  • RemoteDataset.__init__():

    This is the constructor of the class. Here you can call the constructor of the father class to assign the path to thw original dataset. You can also define some attributes of the class, like the label’s columns names, features’s names. , file format, URL and checksum. Also you can assign the subdataset that you want to instanciate.

  • RemoteDataset.dataset_files():

    This method must return a list with all the files (data and labels) that compounds the dataset.

  • RemoteDataset.parse_to_package():

    Almost identical to LocalDataset.load().

  • RemoteDataset.feed_pipeline():

    This method is used to pass the dataset from data directory to the directory Dataset of the experiment.

  • RemoteDataset.sklearn_predefined_split():

    If you plan to use the Predefined Split from the Sklearn library your class must implement this method. It must generate an array of indexes of same length as sequences to be used with PredefinedSplit. The index must start at 0.

Add a new algorithm

Conmo provides a core abstract class named Algorithm that contains the basic methods for the operation of any algorithm, mainly training with a training set, performing a prediction over test, loading and saving input and output data. Depending on the type of anomaly detection algorithm to be implemented, there are two classes depending on the operation of the method:

  • AnomalyDetectionThresholdBasedAlgorithm:

    If your algorithm needs to calculate a threshold to determine which samples are anomalous it must inherit from this class. For example: PCA Mahalanobis.

  • AnomalyDetectionClassBasedAlgorithm:

    If your algorithm identifies by classes the normal sequences from the anomalous ones, it must inherit from this class. For example: One Class SVM.

  • PretrainedAlgorithm:

    Check out this class if your algorithm was pre-trained prior to running an experiment, i.e. it is not necessary to train it during the experiment. It is required to be able to define the path where the pre-trained model is stored on disk.

For adding a new algorithm to the framework you need to create a new class that inherits from one of these classes depending of the type of the algorithm and override the following methods:

  • __init__():

    Constructor of the class. Here you can initialize all the hyperparameters needed for the algorithm. Also you can fix random seeds of Tensorflow, Numpy, etc here for reproducibilty purposes.

  • fit_predict():

    Method responsible of building, training the model with the training data and testing it with the test set. In case your algorithm is threshold-based, it will be necessary to verify that each output in the test set exceeds that threshold to determine that it is anomalous. In the case of a class-based algorithm, depending on the output, it will be necessary to identify whether it is an outlier or an anomaly. Finally, the output dataframe has to be generated with the labels by sequence or by time.

  • find_anomaly_threshold():

    In case the algorithm is threshold based, the threshold selection can be customised overriding this method.

You can add auxiliary methods for model construction, weights loading, etc. in case the model structure is very complex.

Add a new splitter

The core abstract class is Splitter and provides some methods to load inputs, save outputs and check it the input was already splittered. For adding a new splitter you must create a new class that inherits from Splitter and implements the method Transform(). If the splitters you want to implement is available on Scikit-Learn library, we provide the class SklearnSplitter and indicating the name of the splitter to be used will allow you to use it in your experiment.

Add a new preprocess

ExtendedPreprocess Class is used for the implementation of new preprocessings in the pipeline. ExtendedPreprocess inherits from the core abstract class Preprocess and provides a constructor in order to define which parts of the dataset will be modified by the preprocessing: labels, data, test or train. Also permits to apply the preprocess to a specyfic set of columns. To define a new preprocess you only need to create a new class than inherits from ExtendedPreprocess and implements the method Transform(), where the preprocessing will be applied to th datset. If the preprocess you want to implement is available on Sklearn library, we provide the class SklearnPreprocess and indicating the name of the preprocessing to be used will allow you to use it in your experiment. In order to make things easier, the CustomPreprocess class is available to implement a preprocessing tool from a function, which will be passed as an argument in the constructor. For additional information you can have a look at the example nasa_cmapps.py.

Add a new metric

You can add a new metric by creating a new class that inherits from the abstract class Metric.

The only method you have to take care is:

  • calculate():

    Based on the outputs of the algorithms and the number of folds, the results are computed and the metrics dataframe is created and stored.

CSV dataset import example

A very common use case that Conmo users may encounter is to add a new dataset that is stored in CSV format. For this case we have developed this small guide, which includes a template as an example. The dataset is stored locally so it will inherit from LocalDataset. It contains three subdatasets stored in different directories, in all of them there are CSV files for data and labels, both for train and test:

_images/csv_dataset_folders.png

The template:

  1import os
  2import shutil
  3from os import path
  4from typing import Iterable
  5
  6import pandas as pd
  7
  8from conmo.conf import File, Index, Label
  9from conmo.datasets.dataset import LocalDataset
 10
 11
 12class CSV_Dataset(LocalDataset):
 13    # ------------------------------------------------------------------------------------ #
 14    # Define constants here ...                                                            #
 15    # ------------------------------------------------------------------------------------ #
 16    #
 17    EX_CONST = 22
 18    EX_SUBDATASETS = ['01', '02', '03']
 19    EX_COL_NAMES = ['A', 'B', 'C']
 20
 21    # ------------------------------------------------------------------------------------ #
 22    # Constructor of the class                                                             #
 23    # Call super class constructor to pass path where the raw dataset is stored            #
 24    # Here you can initialize attributes with passed values                                #
 25    # and the specific subbdataset to be used when instantiating                           #
 26    # ------------------------------------------------------------------------------------ #
 27    #
 28
 29    def __init__(self, path: str, subdataset: str) -> None:
 30        super().__init__(path)
 31        self.path = path
 32        self.subdataset = subdataset
 33
 34    # ------------------------------------------------------------------------------------ #
 35    # Loads the original CSV files to Pandas dataframes,                                   #
 36    # give them the appropriate format and finally save them to disk.                      #
 37    # ------------------------------------------------------------------------------------ #
 38    #
 39    def load(self) -> None:
 40        # SOME CONSIDERATIONS:
 41        # - You can use Pandas utility read_csv()
 42        # - Index must start at 1, not 0
 43        # - Generate only 1 file for data and other for labels
 44        # - Necessary a multi-index with two levels, an outer level of sequences and an inner level of sequences.
 45        # - If there is both train and test data, each of them shall form a sequence.
 46
 47        # Iterate over files in the directory where the local original data is stored
 48        for subdataset in os.listdir(self.path):
 49            # ------------------------------------------------------------------------------------ #
 50            # Read data CSV and generate dataframe
 51            train_data = pd.read_csv(path.join(
 52                self.path, subdataset, 'train_data.csv'), sep=',', header=None, names=self.EX_COL_NAMES)
 53            test_data = pd.read_csv(path.join(
 54                self.path, subdataset, 'test_data.csv'), sep=',', header=None, names=self.EX_COL_NAMES)
 55
 56            # Reset index for starting from 1 (Conmos format)
 57            train_data.index += 1
 58            test_data.index += 1
 59
 60            # Concatenate train and test data into 1 dataframe. (Always first train data)
 61            # Time is and old name and needs to be upgraded but the purpose is the same as a normal Index
 62            data = pd.concat([train_data, test_data], keys=[
 63                            1, 2], names=[Index.SEQUENCE, Index.TIME])
 64
 65            # Sort index after concatenate
 66            data.sort_index(inplace=True)
 67
 68            # ------------------------------------------------------------------------------------ #
 69            # Read labels CSV and generate dataframe
 70            train_labels = pd.read_csv(path.join(
 71                self.path, subdataset, 'train_labels.csv'), sep=',', header=None, names=[Label.ANOMALY])
 72            test_labels = pd.read_csv(path.join(
 73                self.path, subdataset, 'train_labels.csv'), sep=',', header=None, names=[Label.ANOMALY])
 74
 75            # Reset index for starting from 1 (Conmo's format)
 76            train_labels.index += 1
 77            test_labels.index += 1
 78
 79            # Concatenate train and test data into 1 dataframe. (Always first train data)
 80            # Time is and old name and needs to be upgrade but the purpose is the same as a normal Index
 81            labels = pd.concat([train_labels, test_labels], keys=[
 82                            1, 2], names=[Index.SEQUENCE, Index.TIME])
 83
 84            # Sort index after concatenate
 85            labels.sort_index(inplace=True)
 86
 87            # ------------------------------------------------------------------------------------ #
 88            # Finally save dataframes to disk in /home/{username}/conmo/data/... in parquet format
 89            data.to_parquet(path.join(self.dataset_dir, '{}_{}'.format(
 90                subdataset, File.DATA)), compression='gzip', index=True)
 91            labels.to_parquet(path.join(self.dataset_dir, '{}_{}'.format(
 92                subdataset, File.LABELS)), compression='gzip', index=True)
 93
 94    # ------------------------------------------------------------------------------------ #
 95    # Method for adding to a list the different files that                                 #
 96    # belong to the dataset                                                                #
 97    # Usually iterate over the subdatasets                                                 #
 98    # ------------------------------------------------------------------------------------ #
 99    #
100    def dataset_files(self) -> Iterable:
101        files = []
102        for key in self.EX_SUBDATASETS:
103            # Data
104            files.append(path.join(self.dataset_dir,
105                        "{}_{}".format(key, File.DATA)))
106            # Labels
107            files.append(path.join(self.dataset_dir,
108                        "{}_{}".format(key, File.LABELS)))
109        return files
110
111    # ------------------------------------------------------------------------------------ #
112    # Method for adding to pipeline step folder                                            #
113    # Move from dataset_dir to out_dir data and labels                                     #
114    # ------------------------------------------------------------------------------------ #
115    #
116    def feed_pipeline(self, out_dir: str) -> None:
117        # Data
118        shutil.copy(path.join(self.dataset_dir, "{}_{}".format(
119            self.subdataset, File.DATA)), path.join(out_dir, File.DATA))
120        # Labels
121        shutil.copy(path.join(self.dataset_dir, "{}_{}".format(
122            self.subdataset, File.LABELS)), path.join(out_dir, File.LABELS))
123
124    # ------------------------------------------------------------------------------------ #
125    # OPTIONAL: Only implement if you plan to use                                          #
126    # PredefinedSplit method of Scikit-Learn library.                                      #
127    # Returns indexes of sequences:                                                        #
128    #   -1 -> if the sequence will be excluded on test set                                 #
129    #    0 -> Test set                                                                     #
130    # ------------------------------------------------------------------------------------ #
131    #
132    def sklearn_predefined_split(self) -> Iterable[int]:
133        return [-1, 0]

Once the class is ready, the respective import has to be added to the __init__ file and the class name to the __all__ list as follows:

 1from conmo.datasets.mars_science_laboratory_mission import MarsScienceLaboratoryMission
 2from conmo.datasets.nasa_turbofan_degradation import NASATurbofanDegradation
 3from conmo.datasets.server_machine_dataset import ServerMachineDataset
 4from conmo.datasets.soil_moisture_active_passive_satellite import SoilMoistureActivePassiveSatellite
 5from conmo.datasets.batteries_degradation import BatteriesDataset
 6#----------------------------------------
 7# Add import to __init__ file of the module
 8from conmo.datasets.csv_dataset import CSV_Dataset
 9#----------------------------------------
10
11__all__ = [
12    'NASATurbofanDegradation'
13    'ServerMachineDataset'
14    'SoilMoistureActivePassiveSatellite'
15    'MarsScienceLaboratory'
16    'BatteriesDataset'
17    #----------------------------------------
18    # Add class name here
19    'CSV_Dataset'
20    #----------------------------------------
21]

Finally the dataset is ready to be used in an experiment:

 1import os
 2import sys
 3
 4# Add package to path (Uncomment only in case you have downloaded Conmo from github repository)
 5sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
 6
 7from sklearn.preprocessing import MinMaxScaler
 8
 9from conmo.experiment import Experiment, Pipeline
10from conmo.algorithms import OneClassSVM
11from conmo.datasets import CSV_Dataset
12from conmo.metrics import Accuracy
13from conmo.preprocesses import SklearnPreprocess
14from conmo.splitters import SklearnSplitter
15from sklearn.model_selection import PredefinedSplit
16from sklearn.preprocessing import MinMaxScaler
17
18# Pipeline definition
19dataset = CSV_Dataset('/home/lucas/conmo_test_csv', '01')
20splitter = SklearnSplitter(splitter=PredefinedSplit(dataset.sklearn_predefined_split()))
21preprocesses = [
22    SklearnPreprocess(to_data=True, to_labels=False,
23                    test_set=True, preprocess=MinMaxScaler()),
24]
25algorithms = [
26    OneClassSVM()
27]
28metrics = [
29    Accuracy()
30]
31pipeline = Pipeline(dataset, splitter, preprocesses, algorithms, metrics)
32
33
34# Experiment definition and launch
35experiment = Experiment([pipeline], [])
36experiment.launch()

Coding conventions

The following tools are used to ensure that new software being added to Conmo meets minimum quality and format requirements:

  • Autopep8: We use this tool to automatically format our Python code to conform to the PEP 8 style guide. It uses the pycodestyle utility to determine what parts of the code needs to be formatted.

  • Isort: We use this library to sort imports alphabetically, and automatically separated into sections and by type.

  • Pytest: To ensure that the output format of a new designed step (algorithm, dataset, etc) is correct we use Pytest framework to testing the new code. This tetsing frameworks is easy to use and supoort complex testing at the same. At the moment we are finishing the implementation of tests on the existing code, so there could be parts that may be modified in future updates.

Known Issues & Limitations

Known Issues & Limitations

We are aware that Conmo is still in a very early stage of development, so it is likely that as its use increases, various bugs will appear. Bugs that are detected will be published on this page in order to make it easier for users to prevent them. However, the Conmo development team is actively looking for and fixing any detected bugs. Please, if you find a bug/issue that does not appear on this list, we would be grateful if you could email us at mym.inv.uniovi@gmail.com or post an issue on our Github. Thanks in advance.

Issue ID

Severity

Description

001_split

Low

There are some problems with the use of Scikit-Learn’s Time Series Splitter in the experiments. We are working on resolving them.

002_rul

Medium

rul_rve.py example seems to be failing during the metric calculation step.

003_tf

Medium

If your computer has one of the new Apple processors (M1 or M2) with ARM-based architecture, it is likely that when you try to use Conmo, the Tensorflow dependency will fail. To fix this temporarily you can install Conmo without dependencies: ‘pip install –no-deps conmo’ and then manually install the branch provided by Google for ARM architectures tensorflow-macos.

Frequently Asked Questions

Frequently Asked Questions

How can I contribute to Conmo?

Depending on your profile and your intended use, you can contribute in different ways: The simplest way to contribute to Conmo is to use it to reproduce some experiments and then cite it. However, you can also contribute by implementing new algorithms, datasets, etc. that can then be used by everyone to perform experiments. Finally, reporting bugs in the functioning of Conmo can also be considered a way to collaborate with the project.

I don’t have a great knowledge of programming, can I still use Conmo?

Conmo intends to focus on all types of scientists, regardless of their specialisation. Generally speaking, we can distinguish two types of people who will use Conmo:

  1. People who only want to reproduce experiments that are already integrated only need basic programming knowledge, since Python is a simple programming language and the complexity has been largely encapsulated.

  2. People who want to collaborate by adding new algorithms, datasets, etc. need more in-depth programming knowledge. In particular Python language and object-oriented programming. However, the Conmo development team is actively looking for a way to simplify this kind of actions.

Release Notes

Release Notes

Conmo 1.0.1

Small documentation and CI errors fixed.

Conmo 1.0.0

First publicly available version.