conmo.datasets.dataset.RemoteDataset

class conmo.datasets.dataset.RemoteDataset(url: str, file_format: str, checksum: str, checksum_format: str)[source]

Abstract base class for a RemoteDataset (downloadable).

__init__(url: str, file_format: str, checksum: str, checksum_format: str) None[source]

Main constructor of the class.

Parameters

name (str) – The name given to the dataset.

check_checksum(response: object) bool[source]

Checks if the checksum of the downloaded file corresponds to the one provided in the class. For security e integrity issues. Currently only the md5 algorithm is integrated.

Parameters

response (Object) – Response object returned by the get method of the Requests library.

Return type

Boolean variable indicating whether the comparison of the hash with the checksum was successful or not.

abstract dataset_files() Iterable

Iterable of files included in the dataset.

download(out_dir: str) None[source]

Download a Dataset from a remote URL.

extract_data(response: object, out_dir: str) None[source]

Extracts the contents of a compressed file in zip format.

Parameters
  • response (Object) – Response object returned by the get method of the Requests library.

  • out_dir (str) – Directory were the zip file will be unzziped.

abstract feed_pipeline(out_dir: str) None[source]

Copy selected data file to pipeline step folder.

fetch(out_dir: str) None[source]

Fetch data to feed the pipeline.

Parameters

out_dir (str) – Directory where the dataset will be stored.

is_dataset_ready() bool

Check if dataset has been already loaded/downloaded and parsed to package format.

abstract parse_to_package(raw_dir: str) None[source]

Parse raw dataset to package format. Data and labels must be saved in parquet format. More information about parquet format: https://parquet.apache.org/

Parameters

raw_dir – Directory where the dataset was downloaded from its source.

show_start_message() None

Show starting step info message.

Methods

__init__(url, file_format, checksum, ...)

Main constructor of the class.

check_checksum(response)

Checks if the checksum of the downloaded file corresponds to the one provided in the class.

dataset_files()

Iterable of files included in the dataset.

download(out_dir)

Download a Dataset from a remote URL.

extract_data(response, out_dir)

Extracts the contents of a compressed file in zip format.

feed_pipeline(out_dir)

Copy selected data file to pipeline step folder.

fetch(out_dir)

Fetch data to feed the pipeline.

is_dataset_ready()

Check if dataset has been already loaded/downloaded and parsed to package format.

parse_to_package(raw_dir)

Parse raw dataset to package format.

show_start_message()

Show starting step info message.