TwirlFlake

Homological Time Series Analysis of Sensor Signals from Power Plants

License Code style: black

In this paper, we use topological data analysis techniques to construct a suitable neural network classifier for the task of learning sensor signals of entire power plants according to their reference designation system. We use representations of persistence diagrams to derive necessary preprocessing steps and visualize the large amounts of data. We derive architectures with deep one-dimensional convolutional layers combined with stacked long short-term memories as residual networks suitable for processing the persistence features. We combine three separate sub-networks, obtaining as input the time series itself and a representation of the persistent homology for the zeroth and first dimension. We give a mathematical derivation for most of the used hyper-parameters. For validation, numerical experiments were performed with sensor data from four power plants of the same construction type.

Keywords: Power Plants · Time Series Analysis · Signal processing · Geometric embeddings · Persistent homology · Topological data analysis.

Citation

@inproceedings{ML4ITS/MelodiaL21,
  author    = {Luciano Melodia and
               Richard Lenz},
  editor    = {Kamp M. et al.},
  title     = {Homological Time Series Analysis of Sensor Signals from Power Plants},
  booktitle = {Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021},
  series    = {Communications in Computer and Information Science},
  volume    = {1524},
  pages     = {283--299},
  publisher = {Springer},
  year      = {2021},
  url       = {https://doi.org/10.1007/978-3-030-93736-2_22},
  doi       = {10.1007/978-3-030-93736-2_22},
}

Contents

  1. Time Series Helper timeSeriesHelper.py
  2. Time Series Converter timeSeriesConverter.py - persistence_giotto_to_matplotlib
  3. Time Series Generator timeSeriesGenerator.py
  4. Time Series Embedding timeSeriesEmbedding.py
  5. Time Series Homology timeSeriesHomology.py
  6. Time Series Visualisation timeSeriesVisualisation.py
  7. Time Series Models timeSeriesModels.py
  8. Time Series Classifier timeSeriesClassifier.py

timeSeriesHelper

zip_to_csv

zip_to_csv(path: str)

Converts a packed sql file to a csv file.

This function unpacks the specified zip file at its location and calls the specified sql_to_csv function on the sql file, with the same name as the zip file.

zip_to_npy

zip_to_npy(path: str)

Converts a packed sql file to an npy file.

This function unpacks the specified zip file at its location and calls the specified sql_to_npy function on the sql file, with the same name as the zip file.

sql_to_csv

sql_to_csv(path: str, delimiter: str = "\n")

Converts a set of INSERT statements to csv format.

Extracts the data from a set of INSERT statements stored in a sql file, this function converts the data into a csv file, where each non INSERT line is stored in a separate pickle file file, and the data of the INSERT statements is stored line by line, with the specified delimiter at the end of each line.

sql_to_npy

sql_to_npy(path: str, delimiter: str = ",")

Converts a set of INSERT statements into a numpy array.

Similar to the csv function, this function also stores unused data in a pickle file and creates a brand new file with the extracted data, this time in npy format, but this time the the delimiter must be the delimiter used in the sql file, plus an additional missing_values string used to represent missing data.

csv_to_sql

csv_to_sql(path: str, delimiter: str = "\n")

Converts an csv file into a set of INSERT statements.

This function converts each set of data separated by the specified separator character of a csv file into an INSERT statement. It also inserts data stored in a pickle file which has the same name as the csv file, as a comment at the beginning, so as not to interfere with functionality.

csv_to_npy

csv_to_npy(path: str, delimiter: str = ",")

Converts a csv file to a Numpy array representation.

This function converts a csv file into a 2-dimensional Numpy representation, where each record separated by the specified delimiter is interpreted as a new line.

npy_to_sql

npy_to_sql(path: str)

Converts an npy file into a series of INSERT statements.

This function is the reverse of sql_to_npy and if you use it in conjunction with you will have the same file at the end as at the beginning.

gen_GAF

gen_GAF(path: str)

Generate a gramian angle field with user input.

This function receives user input via the console to generate either a Gramian Angular Summation Field or a Gramian Angular Difference Field from the data of a Numpy array using the function gen_GAF_exec.

gen_GAF_exec

gen_GAF_exec(
    data: list,
    sample_range: None or tuple = (-1, 1),
    method: str = "summation",
    null_value: str = "0",
)

Generate a Gramian Angular Field.

This is the actual function when it comes to generating a Gramian Angular Field from the data of a Numpy array. This function takes several variables to determine how the field should be scaled, what size it should have and whether it is a sum or difference field.

false_input

false_input(path: str)

Output error information and return to main.

This function prints an error message to the console and invokes main.

exit

exit()

Print message and exit program.

This function prints a message on the console and terminates the program.

switchoption

switchoption(n: int, path: str)

Call a function.

This function calls one of the functions of this program according to the n. and gives it the path as input.

checkpath

checkpath(path: str)

Check the path if it is relative.

This function removes all quotes from a path and checks whether it is relative or absolute it returns a cleaned path which is the absolute representation of the given path.

split_csv_file

split_csv_file(
    path: str, header: bool = False, column: int = 1, delimiter: str = ";"
) -> bool

Splits a csv file according to a column.

Takes as input a csv file path. Groups the file according to a certain label and stores the data into multiple files named after the label.

create_datasets

create_datasets(root: str)

Splits the various files into directories.

This function recursively lists all files in a directory and divides them into the hard-coded folder structure for persistence diagrams, heat kernels, the embeddings, the persistent silhouette, and the Betti curves.

timeSeriesConverter

persistence_giotto_to_matplotlib

persistence_giotto_to_matplotlib(
    diagram: np.ndarray, plot: bool = True, tikz: bool = True
) -> np.ndarray

Help function to convert giotto-tda persistence diagram to one from matplotlib.

giotto-tda uses plotly in a proprietary. The plotting function is part of the pipeline, not accessible as an object. We use the coordinates returned by the function and create our own own matplotlib plot. Currently the scales are lost.

timeSeriesGenerator

chunkIt

chunkIt(seq, num)

Chunks a list into a partition of specified size.

numpy_to_img

def numpy_to_img(
    directory: str,
    target: str,
    filetype: str = ".npy",
    target_filetype: str = ".png",
    color_mode: str = "RGB",
)

Converts a set of numpy arrays with the form (x,x,3) to RGB images.

Example:

    numpy_to_img("./data","./images")

create_timeseries_dataset_persistence_images

create_timeseries_dataset_persistence_images(
    path_to_data: str,
    target: str = cfg.paths["data"],
    label_delimiter: str = "@",
    file_extension_marker: str = ".",
    embedding_dimension: int = 3,
    embedding_time_delay: int = 1,
    stride: int = 1,
    delimiter: str = ",",
    n_jobs: int = -1,
    n_bins: int = 100,
    column: int = 3,
    homology_dimensions: tuple = (0, 1, 2),
    delete_old: bool = True,
    window_size: int = 200,
    filtering_epsilon: float = 0.23,
)

Creates a directory from a directory with time series .csv data for use with Keras.

We have based the folder on the following structure: In a folder File there are several subfolders. In each of these of these subfolders are .csv files. The subfolders themselves have no meaning for the classification or the the assignment of the names. The files are named as they will be labeled later. Optionally the filename can be can be extended, then the label should be placed after a @ in the file names. The files are loaded and transformed into a n-dimensional Torus using Taken’s embedding. We create artificial examples by starting from this embedding. The persistent homology and resulting persistent images are then generated for each example from a time series. A corresponding folder is created where the generated persistent image is stored. The file is numbered and stored in the folder with its name.

create_timeseries_dataset_ts_betti_curves

create_timeseries_dataset_ts_betti_curves(
    path_to_data: str,
    target: str = cfg.paths["data"],
    sample_size: int = 200,
    minimum_ts_size: int = 2 * 10 ** 3,
    class_size: int = 10 ** 3,
    label_delimiter: str = "@",
    file_extension_marker: str = ".",
    delimiter: str = ",",
    n_jobs: int = -1,
    column: int = 3,
    homology_dimensions: tuple = (0, 1, 2),
    color_mode: str = "L",
    saveasnumpy: bool = False,
)

Creates from a directory of time series .csv data a directory for use with Keras ImageDataGenerator.

This function generates from a dataset consisting of .csv files a pixel-oriented collection of of multivariate time series consisting of the original signal section and the persistent Betti curve of this section. Each line of the image corresponds to a time series, i.e. the first line corresponds to the original signal and the second line corresponds to the topological progression in the form of the Betti curve. These are grouped in folders named after their classes. The classes must be stored as values in a separate column for each entry within the csv file. Then a corresponding folder will be created, which is compatible with ImageGenerator of Keras.

Example of usage:

create_timeseries_dataset_ts_betti_curves(cfg.paths["split_ordered"], cfg.paths["data"])

timeSeriesEmbedding

list_files_in_dir

list_files_in_dir(directory: str)

Lists all files inside a directory.

Simple function that lists all files inside a directory as str.

read_csv_data

read_csv_data(path: str, delimiter: str = ",") -> np.ndarray

Reads .csv files into an np.ndarray.

Convert all columns of a .csv file into an np.ndarray.

fit_embedder

fit_embedder(y: np.ndarray, embedder: callable, verbose: bool = True) -> tuple

Fits a Takens embedding and displays optimal search parameters.

Determines the optimal parameters for the searched embedding according to the theory of toroidal embeddings resulting from the discrete Fourier transform.

get_single_signal

get_single_signal(index: int, file: int, plot: bool = False) -> np.ndarray

Gets a column directly from the .csv file.

Extracts a column from a .csv file according to the index and returns it as an np.darray for further processing.

get_sliding_window_embedding

get_sliding_window_embedding(
    index: int,
    file: int,
    width: int = 2,
    stride: int = 3,
    plot: bool = False,
    L: float = 1,
    B: float = 0,
    plot_signal: bool = False,
) -> np.ndarray

Sliding window embedding in a commutative Lie group.

This is an embedding which provably yields a commutative Lie group as an embedding space which is a smooth manifold with group structure. It is a connected manifold, so it has suitable properties to infer the dimension of homology groups. It is intuitive and can be used to It is intuitive and can be used to detect periodicities since it has direct connections to the theory of Fourier sequences.

get_periodic_embedding

get_periodic_embedding(
    index: int = 3,
    file: int = 1,
    plot_signal: bool = False,
    parameters_type: str = "fixed",
    max_time_delay: int = 3,
    max_embedding_dimension: int = 11,
    stride: int = 2,
    plot: bool = False,
    store: bool = False,
    fourier_transformed: bool = False,
)

Adapts a single-tailed embedder and displays optimal search parameters.

This function uses a search algorithm to obtain optimal parameters for a time series embedding. The search can be neglected if the parameters_type parameter is selected as fixed. The time delay and the embedding dimension are determined by the algorithm. Optionally, the embedded time series signal as a .np file by setting the store parameter to True.

count_dimension_distribution

count_dimension_distribution(
    path: str = cfg.paths["split"] + "**/*",
    recurr: bool = True,
    dimensions_dict: dict = {"1": 0, "2": 0, "3": 0, "4": 0, "5": 0},
    delays_dict: dict = {"1": 0, "2": 0, "3": 0, "4": 0, "5": 0},
    keyword: str = "_embedded_",
    position_dimension: int = -7,
    position_delay: int = -5,
)

Determine the dimension used for embedding from the filenames and count them.

We encode the embedding dimension and the time delay in the filename of the processed Files representing the time series. We need to pass the determined values for dimension and Time Delay per signal through our algorithm. This function returns a tuple of dictionaries with these counts.

timeSeriesHomology

compute_persistence_representations

compute_persistence_representations(
    path: str,
    parameters_type: str = "fixed",
    filetype: str = ".csv",
    delimiter: str = ",",
    n_jobs: int = -1,
    embedding_dimension: int = 3,
    embedding_time_delay: int = 5,
    stride: int = 10,
    index: int = 3,
    enrico_betti: bool = True,
    enrico_silhouette: bool = True,
    enrico_heatkernel: bool = True,
    n_bins: int = 100,
    store: bool = True,
    truncate: int = 3000,
) -> np.ndarray

Procedure computes the persistent homology representations for some data.

This is a collection of representations from giotto-tda. The examined folder structure has two sublevels. We find each file in this duplicate folder structure and compute all desired persistence representations for a fixed hyperparameter setting. The hyperparameters must be estimated beforehand. Optionally store the persistence diagram, the silhouette, a persistence heat kernel and the persistent Betti curve together with the embedded signal. With the embedded signal. The embedding dimension and time delay are encoded in the filename embedding dimension - time delay.

timeSeriesVisualisation

export_cmap_to_cpt

export_cmap_to_cpt(
    cmap,
    vmin: float = 0,
    vmax: float = 1,
    N: int = 255,
    filename: str = "test.cpt",
    **kwargs
)

Exports a custom matplotlib color map to files.

Generates a color map for matplotlib at a desired normalized interval of choice. The map is not returned, but saved as a text file. The default name for this file is test.cpt. This file then contains the information for the color maps in matplotlib and can then be loaded.

plot_embedding3D

plot_embedding3D(path: str)

Plots 3-1 embeddings iteratively within a directory.

Plots a set of embeddings always with three dimensions and a time delay of 1. This can be changed arbitrarily, according to the estimated parameters.

mean_heatkernel2D

mean_heatkernel2D(
    directory: str,
    homgroup: int,
    limit: int = 0,
    store: bool = False,
    plot: bool = False,
    filename: str = "figure",
    filetype: str = "svg",
    colormap: str = "Reds",
)

Calculates a mean heat core over a large collection of files in a directory.

Calculates a mean heat core map from a folder full of .npy files with heat maps. This can optionally be saved or displayed as a plot in the browser.

massive_surface_plot3D

massive_surface_plot3D(
    directory: str,
    homgroup: int,
    title: str = "Default",
    limit: int = 45000,
    store: bool = False,
    plotting: bool = False,
    filename: str = "figure",
    filetype: str = "svg",
    colormap: str = "Reds",
)

Calculates a solid surface from curves.

Calculates a surface from a directory full of npy files of curves (intended for Betti curves from persistence diagrams). For the x and y coordinates, the corresponding indices of the Betti curves themselves and the filtration index are selected. The value of the function is then visible on the ‘z’ axis. Optionally, these can be displayed as a graph in the browser or also saved.

Example:

massive_surface_plot3D(
    "/home/lume/documents/siemens_samples/power_plant_silhouette/",
    homgroup=1,
    store=True,
    plotting=True,
    limit=1000
)

massive_line_plot3D

massive_line_plot3D(
    directory: str,
    homgroup: int,
    resolution: int = 300,
    k: int = 3,
    limit: int = 0,
    elev: float = 20,
    azim: int = 135,
    KKS: str = "LBB",
    fignum: int = 0,
    plot: bool = False,
)

Function creates a massive line chart from a directory of .npy files that contain the data.

This function creates a line graph from a set of .npy files. The line graph will be three dimensional and each line will be plotted along the z axis, while the other two axes will represent the plot or time step. It is assumed that the .npy file stores a one-dimensional array. The method iterates over to populate a directory of .npy files, each of which contains a one-dimensional time series.

Examples:

massive_line_plot3D(
    directory="/home/lume/documents/siemens_samples/kraftwerk_betticurve/", homgroup=0
)

Example of multiple plots of Betti curves / persistence silhouettes:

number = 0
for i in cfg.pptcat:
    massive_line_plot3D(
        directory="/home/lume/documents/siemens_kraftwerk_samples/kraftwerk_bettikurve/",
        homgroup=0,
        KKS=i,
        fignum=count,
    )
    count += 1
plt.show()
plt.close()

timeSeriesModels

convolution_layer

convolution_layer(
    x,
    name: str,
    i: int,
    filters: int = cfg.cnnmodel["filters"],
    kernel_size: int = cfg.cnnmodel["kernel_size"],
    activation: str = cfg.cnnmodel["activation"],
    padding: str = cfg.cnnmodel["padding"],
    l1_regu: float = cfg.cnnmodel["l1"],
    l2_regu: float = cfg.cnnmodel["l2"],
    batchnorm: bool = True,
)

This is a convolution layer with stack normalization.

Here we implement a convolutional layer with activation function, padding, optional filters and stride parameters, and some regularizers on the Keras framework. This function is to be used to recursively define a neural network architecture.

lstm_layer

lstm_layer(
    x,
    name: str,
    i: int,
    units: int = cfg.lstmmodel["units"],
    return_state: bool = cfg.lstmmodel["return_state"],
    go_backwards: bool = cfg.lstmmodel["go_backwards"],
    stateful: bool = cfg.lstmmodel["stateful"],
    l1_regu: float = cfg.lstmmodel["l1"],
    l2_regu: float = cfg.lstmmodel["l2"],
)

This is a CuDNNLSTM layer.

We implement here a CuDNNLSTM layer with activation function, padding, optional filters and stride parameters, and some regularizers on the Keras framework. This function is to be used to recursively define a neural network architecture.

create_lstm

create_lstm(
    x,
    name: str,
    layers: int,
    differential: int,
    max_pool: list,
    avg_pool: list,
    dropout_rate: float = cfg.lstmmodel["dropout_rate"],
    pooling_size: int = cfg.lstmmodel["pooling_size"],
)

This function returns one of the subnetworks of the remaining CuDNNLSTM.

This function generates the core neural network according to the theory of Ck-differentiable neural networks. It applies residual connections according to the k degree of differentiability and adds dropout and pooling layers at the desired location in the architecture. Some of the specifications can be passed as argument to be passed to perform hyperparameter optimization.

create_subnet

create_subnet(
    x,
    name: str,
    layers: int,
    differential: int,
    dropouts: list,
    max_pool: list,
    avg_pool: list,
    dropout_rate: float = cfg.cnnmodel["dropout_rate"],
    pooling_size: int = cfg.cnnmodel["pooling_size"],
    batchnorm: bool = True,
)

**This function returns one of the subnets of the rest of the CNN.

This function generates the core neural network according to the theory of Ck-differentiable neural networks. It applies residual connections according to the k degree of differentiability and adds dropout and pooling layers at the desired location in the architecture. Some of the specifications can be passed as argument to be passed to perform hyperparameter optimization. Currently it is just implemented for C1-architectures.

create_c2_1DCNN_model

create_three_nets(shape, classes_number, image_size) -> callable

**The CNN architecture for classifying the lane sensor data.

This function results in three subnets with the same architecture. The purpose is to process the raw data within one of the subnets, and in the other subnets to process the Betti curves of the zeroth and first homology group, which are generated during the filtration of the sample. The result is summed within a final layer and a corresponding decision for a class is made by a usual vanilla density layer.

timeSeriesClassifier

recall

recall(y_true, y_pred)

Compute recall of the classification task.

Calculates the proportion of true positives within the classified samples divided by the true positives and false negatives.

precision

precision(y_true, y_pred)

Compute precision of the classification task.

Calculates the proportion of true positives within the classified samples divided by the true positives and false positives.

f1

f1(y_true, y_pred)

Compute f1-score of the classification task.

Calculates the f1 score, which is two times the precision times the recall divided by the precision plus the recall. A smoothing constant is used to avoid division by zero in the denominator.

check_files

check_files(folder_path: str)

This function checks the compatibility of the images with pillow.

We read the files with pillow and check for errors. If this function traverses the whole directory without stopping, the dataset will work fine with the given version of pillow and the ImageDataGenerator class of Keras.

train_neuralnet_images

train_neuralnet_images(
    directory: str,
    model_name: str,
    checkpoint_path=None,
    classes=None,
    seed=None,
    dtype=None,
    color_mode: str = "grayscale",
    class_mode: str = "categorical",
    interpolation: str = "bicubic",
    save_format: str = "tiff",
    image_size: tuple = cfg.cnnmodel["image_size"],
    batch_size: int = cfg.cnnmodel["batch_size"],
    validation_split: float = cfg.cnnmodel["validation_split"],
    follow_links: bool = False,
    save_to_dir: bool = False,
    save_prefix: bool = False,
    shuffle: bool = True,
    nb_epochs: int = cfg.cnnmodel["epochs"],
    learning_rate: float = cfg.cnnmodel["learning_rate"],
    reduction_factor: float = 0.2,
    reduction_patience: int = 10,
    min_lr: float = 1e-6,
    stopping_patience: int = 100,
    verbose: int = 1,
)

Main training procedure for neural networks on image datasets.

Conventional function based on the implementation of Keras in the Tensorflow-package for classifying image data. Works for images with RGB and grayscale. The dataset of images to be classified must be organized as a folder structure, so that each class is in a folder. ptionally test datasets can be created, then a two level hierarchical folder structure is needed. In the first level there are two folders, once the training and the test dataset. In the second level then the folders with the examples for the individual classes. Adam is used as the optimization algorithm. As metrics accuracy, precision, recall, f1-score and AUC are calculated. All other parameters can be set optionally. The layers of the neural network are all labeled.

Example os usage:

train_neuralnet_images(
    "./data",
    "./model_weights/C" + str(cfg.lstmmodel["differential"]) + "_CNNCuDNNLSTM_Betticurves_" + str(cfg.cnnmodel["filters"]) + "_" + str(cfg.lstmmodel["units"]) + "_" + str(cfg.cnnmodel["layers_x"]) + "_" + str(cfg.lstmmodel["layers"]) + "layers",
    "./model_weights/C" + str(cfg.lstmmodel["differential"]) + "_CNNCuDNNLSTM_Betticurves_" + str(cfg.cnnmodel["filters"]) + "_" + str(cfg.lstmmodel["units"]) + "_" + str(cfg.cnnmodel["layers_x"]) + "_" + str(cfg.lstmmodel["layers"]) + "layers",
)