In this paper, we use topological data analysis techniques to construct a suitable neural network classifier for the task of learning sensor signals of entire power plants according to their reference designation system. We use representations of persistence diagrams to derive necessary preprocessing steps and visualize the large amounts of data. We derive architectures with deep one-dimensional convolutional layers combined with stacked long short-term memories as residual networks suitable for processing the persistence features. We combine three separate sub-networks, obtaining as input the time series itself and a representation of the persistent homology for the zeroth and first dimension. We give a mathematical derivation for most of the used hyper-parameters. For validation, numerical experiments were performed with sensor data from four power plants of the same construction type.
Keywords: Power Plants · Time Series Analysis · Signal processing · Geometric embeddings · Persistent homology · Topological data analysis.
@inproceedings{ML4ITS/MelodiaL21,
author = {Luciano Melodia and
Richard Lenz},
editor = {Kamp M. et al.},
title = {Homological Time Series Analysis of Sensor Signals from Power Plants},
booktitle = {Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021},
series = {Communications in Computer and Information Science},
volume = {1524},
pages = {283--299},
publisher = {Springer},
year = {2021},
url = {https://doi.org/10.1007/978-3-030-93736-2_22},
doi = {10.1007/978-3-030-93736-2_22},
}
timeSeriesHelper.py
timeSeriesConverter.py
- persistence_giotto_to_matplotlibtimeSeriesGenerator.py
timeSeriesEmbedding.py
timeSeriesHomology.py
timeSeriesVisualisation.py
timeSeriesModels.py
timeSeriesClassifier.py
zip_to_csv(path: str)
Converts a packed sql
file to a csv
file.
This function unpacks the specified zip
file at its location and calls the specified sql_to_csv
function on the sql
file, with the same name as the zip file.
zip
file with the sql
in it, type str
.zip_to_npy(path: str)
Converts a packed sql
file to an npy file.
This function unpacks the specified zip file at its location and calls the specified sql_to_npy
function on the sql
file, with the same name as the zip file.
sql
, type str
.sql_to_csv(path: str, delimiter: str = "\n")
Converts a set of INSERT
statements to csv
format.
Extracts the data from a set of INSERT
statements stored in a sql
file, this function
converts the data into a csv
file, where each non INSERT
line is stored in a separate pickle file
file, and the data of the INSERT
statements is stored line by line, with the specified delimiter
at the end of each line.
sql
file, type str
.str
.sql_to_npy(path: str, delimiter: str = ",")
Converts a set of INSERT
statements into a numpy array.
Similar to the csv function, this function also stores unused data in a pickle file and creates
a brand new file with the extracted data, this time in npy format, but this time the
the delimiter must be the delimiter used in the sql
file, plus an additional
missing_values string used to represent missing data.
sql
file, type str
.sql
file to separate the data, type str
.str
.csv_to_sql(path: str, delimiter: str = "\n")
Converts an csv
file into a set of INSERT
statements.
This function converts each set of data separated by the specified separator character
of a csv
file into an INSERT
statement. It also inserts data
stored in a pickle file which has the same name as the csv
file,
as a comment at the beginning, so as not to interfere with functionality.
csv
file, type str
.str
.csv_to_npy(path: str, delimiter: str = ",")
Converts a csv
file to a Numpy array representation.
This function converts a csv
file into a 2-dimensional Numpy representation,
where each record separated by the specified delimiter is interpreted as a new line.
csv
file, type str
.str
.str
.npy_to_sql(path: str)
Converts an npy file into a series of INSERT
statements.
This function is the reverse of sql_to_npy and if you use it in conjunction with you will have the same file at the end as at the beginning.
str
.gen_GAF(path: str)
Generate a gramian angle field with user input.
This function receives user input via the console to generate either a Gramian Angular Summation Field or a Gramian Angular Difference Field from the data of a Numpy array using the function gen_GAF_exec.
str
.gen_GAF_exec(
data: list,
sample_range: None or tuple = (-1, 1),
method: str = "summation",
null_value: str = "0",
)
Generate a Gramian Angular Field.
This is the actual function when it comes to generating a Gramian Angular Field from the data of a Numpy array. This function takes several variables to determine how the field should be scaled, what size it should have and whether it is a sum or difference field.
list
.
param size: this is the size of the square output image, type int or float
.tuple
.sum
or difference
, type str
.str
.false_input(path: str)
Output error information and return to main.
This function prints an error message to the console and invokes main.
exit()
Print message and exit program.
This function prints a message on the console and terminates the program.
switchoption(n: int, path: str)
Call a function.
This function calls one of the functions of this program according to the n
.
and gives it the path as input.
int
.str
.checkpath(path: str)
Check the path if it is relative.
This function removes all quotes from a path and checks whether it is relative or absolute it returns a cleaned path which is the absolute representation of the given path.
str
.str
.split_csv_file(
path: str, header: bool = False, column: int = 1, delimiter: str = ";"
) -> bool
Splits a csv
file according to a column.
Takes as input a csv
file path. Groups the file according to a certain label and stores the
data into multiple files named after the label.
csv
file, type str
.bool
.bool
.csv
file, default ;
, type str
.create_datasets(root: str)
Splits the various files into directories.
This function recursively lists all files in a directory and divides them into the hard-coded folder structure for persistence diagrams, heat kernels, the embeddings, the persistent silhouette, and the Betti curves.
str
.persistence_giotto_to_matplotlib(
diagram: np.ndarray, plot: bool = True, tikz: bool = True
) -> np.ndarray
Help function to convert giotto-tda
persistence diagram to one from matplotlib
.
giotto-tda
uses plotly in a proprietary. The plotting function is part of the pipeline, not
accessible as an object. We use the coordinates returned by the function and create our own
own matplotlib
plot. Currently the scales are lost.
np.ndarray
.bool
.bool
.np.ndarray
.chunkIt(seq, num)
Chunks a list into a partition of specified size.
list
.int
.list
.def numpy_to_img(
directory: str,
target: str,
filetype: str = ".npy",
target_filetype: str = ".png",
color_mode: str = "RGB",
)
Converts a set of numpy arrays with the form (x,x,3) to RGB images.
Example:
numpy_to_img("./data","./images")
str
.str
.create_timeseries_dataset_persistence_images(
path_to_data: str,
target: str = cfg.paths["data"],
label_delimiter: str = "@",
file_extension_marker: str = ".",
embedding_dimension: int = 3,
embedding_time_delay: int = 1,
stride: int = 1,
delimiter: str = ",",
n_jobs: int = -1,
n_bins: int = 100,
column: int = 3,
homology_dimensions: tuple = (0, 1, 2),
delete_old: bool = True,
window_size: int = 200,
filtering_epsilon: float = 0.23,
)
Creates a directory from a directory with time series .csv
data for use with Keras.
We have based the folder on the following structure: In a folder File there are several subfolders. In each of these of these subfolders are .csv
files. The subfolders themselves have no meaning for the classification or the the assignment of the names. The files are named as they will be labeled later. Optionally the filename can be can be extended, then the label should be placed after a @
in the file names. The files are loaded and transformed into a n
-dimensional Torus using Taken’s embedding. We create artificial examples by starting from this embedding. The persistent homology and resulting persistent images are then generated for each example from a time series. A corresponding folder is created where the generated persistent image is stored. The file is numbered and
stored in the folder with its name.
.npy
, dType str
..png
, dtype str
.str
.str
.int
.int
.int
..csv
file containing the time series, dtype str
.int
.int
.int
.tuple
.bool
.int
.float
.create_timeseries_dataset_ts_betti_curves(
path_to_data: str,
target: str = cfg.paths["data"],
sample_size: int = 200,
minimum_ts_size: int = 2 * 10 ** 3,
class_size: int = 10 ** 3,
label_delimiter: str = "@",
file_extension_marker: str = ".",
delimiter: str = ",",
n_jobs: int = -1,
column: int = 3,
homology_dimensions: tuple = (0, 1, 2),
color_mode: str = "L",
saveasnumpy: bool = False,
)
Creates from a directory of time series .csv
data a directory for use with Keras ImageDataGenerator.
This function generates from a dataset consisting of .csv
files a pixel-oriented collection of
of multivariate time series consisting of the original signal section and the persistent Betti curve of this section.
Each line of the image corresponds to a time series, i.e. the first line corresponds to the original signal and the second
line corresponds to the topological progression in the form of the Betti curve. These are grouped in folders named after their classes.
The classes must be stored as values in a separate column for each entry within the csv
file. Then a
corresponding folder will be created, which is compatible with ImageGenerator
of Keras
.
Example of usage:
create_timeseries_dataset_ts_betti_curves(cfg.paths["split_ordered"], cfg.paths["data"])
.npy
files, dtype str
..png
, dtype str
.int
.int
.int
.csv
file, dtype str
..
, dtype str
..csv
file containing the time series, dtype str
.int
.int
.tuple
.grayscale
or rgb
, dtype str
.bool
.list_files_in_dir(directory: str)
Lists all files inside a directory.
Simple function that lists all files inside a directory as str
.
str
.list
.read_csv_data(path: str, delimiter: str = ",") -> np.ndarray
Reads .csv
files into an np.ndarray
.
Convert all columns of a .csv
file into an np.ndarray
.
str
..csv
files, type str
.np.ndarray
.fit_embedder(y: np.ndarray, embedder: callable, verbose: bool = True) -> tuple
Fits a Takens embedding and displays optimal search parameters.
Determines the optimal parameters for the searched embedding according to the theory of toroidal embeddings resulting from the discrete Fourier transform.
np.ndarray
.callable
.bool
.tuple
.get_single_signal(index: int, file: int, plot: bool = False) -> np.ndarray
Gets a column directly from the .csv
file.
Extracts a column from a .csv
file according to the index and returns it as an
np.darray
for further processing.
int
.int
.bool
.np.array
.get_sliding_window_embedding(
index: int,
file: int,
width: int = 2,
stride: int = 3,
plot: bool = False,
L: float = 1,
B: float = 0,
plot_signal: bool = False,
) -> np.ndarray
Sliding window embedding in a commutative Lie group.
This is an embedding which provably yields a commutative Lie group as an embedding space which is a smooth manifold with group structure. It is a connected manifold, so it has suitable properties to infer the dimension of homology groups. It is intuitive and can be used to It is intuitive and can be used to detect periodicities since it has direct connections to the theory of Fourier sequences.
int
.int
.width+1
, type int
.int
.bool
.bool
.np.ndarray
.get_periodic_embedding(
index: int = 3,
file: int = 1,
plot_signal: bool = False,
parameters_type: str = "fixed",
max_time_delay: int = 3,
max_embedding_dimension: int = 11,
stride: int = 2,
plot: bool = False,
store: bool = False,
fourier_transformed: bool = False,
)
Adapts a single-tailed embedder and displays optimal search parameters.
This function uses a search algorithm to obtain optimal parameters for a time series embedding.
The search can be neglected if the parameters_type
parameter is selected as fixed
. The time delay
and the embedding dimension are determined by the algorithm. Optionally, the embedded
time series signal as a .np
file by setting the store
parameter to True
.
.csv
file, type int
.int
.bool
.search
for optimal parameters or take them fixed
, type str
.int
.int
.int
.bool
.bool
.bool
.np.ndarray
.count_dimension_distribution(
path: str = cfg.paths["split"] + "**/*",
recurr: bool = True,
dimensions_dict: dict = {"1": 0, "2": 0, "3": 0, "4": 0, "5": 0},
delays_dict: dict = {"1": 0, "2": 0, "3": 0, "4": 0, "5": 0},
keyword: str = "_embedded_",
position_dimension: int = -7,
position_delay: int = -5,
)
Determine the dimension used for embedding from the filenames and count them.
We encode the embedding dimension and the time delay in the filename of the processed Files representing the time series. We need to pass the determined values for dimension and Time Delay per signal through our algorithm. This function returns a tuple of dictionaries with these counts.
str
.bool
.dict
.dict
.str
.int
.int
.tuple
.compute_persistence_representations(
path: str,
parameters_type: str = "fixed",
filetype: str = ".csv",
delimiter: str = ",",
n_jobs: int = -1,
embedding_dimension: int = 3,
embedding_time_delay: int = 5,
stride: int = 10,
index: int = 3,
enrico_betti: bool = True,
enrico_silhouette: bool = True,
enrico_heatkernel: bool = True,
n_bins: int = 100,
store: bool = True,
truncate: int = 3000,
) -> np.ndarray
Procedure computes the persistent homology representations for some data.
This is a collection of representations from giotto-tda
. The examined folder structure has two sublevels. We find each file in this duplicate folder structure and compute all desired persistence representations for a fixed hyperparameter setting. The hyperparameters must be estimated beforehand. Optionally store the persistence diagram, the silhouette, a persistence heat kernel and the persistent Betti curve together with the embedded signal. With the embedded signal. The embedding dimension and time delay are encoded in the filename embedding dimension - time delay
.
str
.str
.csv
, type str
.int
.int
.int
.int
.csv
file, type int
.bool
.bool
.bool
.int
.npy
files or not, type bool
.export_cmap_to_cpt(
cmap,
vmin: float = 0,
vmax: float = 1,
N: int = 255,
filename: str = "test.cpt",
**kwargs
)
Exports a custom matplotlib color map to files.
Generates a color map for matplotlib at a desired normalized interval of choice. The map is not returned, but
saved as a text file. The default name for this file is test.cpt
. This file then contains the information
for the color maps in matplotlib and can then be loaded.
str
.float
.float
.RGB
, type int
.str
.B
, F
or N
for color definition, type str
.plot_embedding3D(path: str)
Plots 3-1 embeddings iteratively within a directory.
Plots a set of embeddings always with three dimensions and a time delay of 1. This can be changed arbitrarily, according to the estimated parameters.
.npy
files, type .npy
.mean_heatkernel2D(
directory: str,
homgroup: int,
limit: int = 0,
store: bool = False,
plot: bool = False,
filename: str = "figure",
filetype: str = "svg",
colormap: str = "Reds",
)
Calculates a mean heat core over a large collection of files in a directory.
Calculates a mean heat core map from a folder full of .npy
files with heat maps.
This can optionally be saved or displayed as a plot in the browser.
.npy
files for line plots, type str
.int
.int
.bool
.str
.str
.plotly.graph_objects.Figure
.massive_surface_plot3D(
directory: str,
homgroup: int,
title: str = "Default",
limit: int = 45000,
store: bool = False,
plotting: bool = False,
filename: str = "figure",
filetype: str = "svg",
colormap: str = "Reds",
)
Calculates a solid surface from curves.
Calculates a surface from a directory full of npy
files of curves (intended for Betti curves
from persistence diagrams). For the x
and y
coordinates, the corresponding indices of the Betti
curves themselves and the filtration index are selected. The value of the function is then visible
on the ‘z’ axis. Optionally, these can be displayed as a graph in the browser or also saved.
Example:
massive_surface_plot3D(
"/home/lume/documents/siemens_samples/power_plant_silhouette/",
homgroup=1,
store=True,
plotting=True,
limit=1000
)
.npy
files for line plots, type str
.int
.int
.bool
.bool
.str
.str
.plotly.graph_objects.Figure
.massive_line_plot3D(
directory: str,
homgroup: int,
resolution: int = 300,
k: int = 3,
limit: int = 0,
elev: float = 20,
azim: int = 135,
KKS: str = "LBB",
fignum: int = 0,
plot: bool = False,
)
Function creates a massive line chart from a directory of .npy
files that contain the data.
This function creates a line graph from a set of .npy
files. The line graph will be three dimensional
and each line will be plotted along the z
axis, while the other two axes will represent the plot
or time step. It is assumed that the .npy
file stores a one-dimensional array. The method iterates over
to populate a directory of .npy
files, each of which contains a one-dimensional time series.
Examples:
massive_line_plot3D(
directory="/home/lume/documents/siemens_samples/kraftwerk_betticurve/", homgroup=0
)
Example of multiple plots of Betti curves / persistence silhouettes:
number = 0
for i in cfg.pptcat:
massive_line_plot3D(
directory="/home/lume/documents/siemens_kraftwerk_samples/kraftwerk_bettikurve/",
homgroup=0,
KKS=i,
fignum=count,
)
count += 1
plt.show()
plt.close()
.npy
files for line plots, type str
.int
.int
.int
.int
.float
.int
.bool
.int
.bool
.convolution_layer(
x,
name: str,
i: int,
filters: int = cfg.cnnmodel["filters"],
kernel_size: int = cfg.cnnmodel["kernel_size"],
activation: str = cfg.cnnmodel["activation"],
padding: str = cfg.cnnmodel["padding"],
l1_regu: float = cfg.cnnmodel["l1"],
l2_regu: float = cfg.cnnmodel["l2"],
batchnorm: bool = True,
)
This is a convolution layer with stack normalization.
Here we implement a convolutional layer with activation function, padding, optional filters and stride parameters, and some regularizers on the Keras framework. This function is to be used to recursively define a neural network architecture.
tf.tensor
.str
.int
.int
.int
.str
.str
.float
.float
.bool
.tf.keras.layers
.lstm_layer(
x,
name: str,
i: int,
units: int = cfg.lstmmodel["units"],
return_state: bool = cfg.lstmmodel["return_state"],
go_backwards: bool = cfg.lstmmodel["go_backwards"],
stateful: bool = cfg.lstmmodel["stateful"],
l1_regu: float = cfg.lstmmodel["l1"],
l2_regu: float = cfg.lstmmodel["l2"],
)
This is a CuDNNLSTM layer.
We implement here a CuDNNLSTM layer with activation function, padding, optional filters and stride parameters, and some regularizers on the Keras framework. This function is to be used to recursively define a neural network architecture.
tf.tensor
.str
.int
.bool
.true
, the input sequence will be processed backwards and return the reverse sequence, bool
.True
, the last state for each sample at index i in a batch as the initial state for the sample at index i
in the following batch, bool
.float
.float
.tf.keras.layers
.create_lstm(
x,
name: str,
layers: int,
differential: int,
max_pool: list,
avg_pool: list,
dropout_rate: float = cfg.lstmmodel["dropout_rate"],
pooling_size: int = cfg.lstmmodel["pooling_size"],
)
This function returns one of the subnetworks of the remaining CuDNNLSTM.
This function generates the core neural network according to the theory of Ck
-differentiable neural networks. It applies residual connections according to the k
degree of differentiability and adds dropout and pooling layers at the desired location in the architecture. Some of the specifications can be passed as argument to be passed to perform hyperparameter optimization.
tf.tensor
.str
.int
.int
.float
.int
.tf.tensor
.create_subnet(
x,
name: str,
layers: int,
differential: int,
dropouts: list,
max_pool: list,
avg_pool: list,
dropout_rate: float = cfg.cnnmodel["dropout_rate"],
pooling_size: int = cfg.cnnmodel["pooling_size"],
batchnorm: bool = True,
)
**This function returns one of the subnets of the rest of the CNN.
This function generates the core neural network according to the theory of Ck
-differentiable neural networks. It applies residual connections according to the k
degree of differentiability and adds dropout and pooling layers at the desired location in the architecture. Some of the specifications can be passed as argument to be passed to perform hyperparameter optimization. Currently it is just implemented for C1
-architectures.
tf.tensor
.str
.int
.int
.int
.float
.int
.bool
.tf.tensor
.create_three_nets(shape, classes_number, image_size) -> callable
**The CNN architecture for classifying the lane sensor data.
This function results in three subnets with the same architecture. The purpose is to process the raw data within one of the subnets, and in the other subnets to process the Betti curves of the zeroth and first homology group, which are generated during the filtration of the sample. The result is summed within a final layer and a corresponding decision for a class is made by a usual vanilla density layer.
tuple
.int
.tuple
.tf.keras.Model
.recall(y_true, y_pred)
Compute recall of the classification task.
Calculates the proportion of true positives within the classified samples divided by the true positives and false negatives.
tf.tensor
.tf.tensor
.precision(y_true, y_pred)
Compute precision of the classification task.
Calculates the proportion of true positives within the classified samples divided by the true positives and false positives.
tf.tensor
.tf.tensor
.f1(y_true, y_pred)
Compute f1-score of the classification task.
Calculates the f1 score, which is two times the precision times the recall divided by the precision plus the recall. A smoothing constant is used to avoid division by zero in the denominator.
tf.tensor
.tf.tensor
.check_files(folder_path: str)
This function checks the compatibility of the images with pillow.
We read the files with pillow and check for errors. If this function traverses the whole directory without stopping, the dataset will work fine with the given version of pillow and the ImageDataGenerator
class of Keras
.
str
.train_neuralnet_images(
directory: str,
model_name: str,
checkpoint_path=None,
classes=None,
seed=None,
dtype=None,
color_mode: str = "grayscale",
class_mode: str = "categorical",
interpolation: str = "bicubic",
save_format: str = "tiff",
image_size: tuple = cfg.cnnmodel["image_size"],
batch_size: int = cfg.cnnmodel["batch_size"],
validation_split: float = cfg.cnnmodel["validation_split"],
follow_links: bool = False,
save_to_dir: bool = False,
save_prefix: bool = False,
shuffle: bool = True,
nb_epochs: int = cfg.cnnmodel["epochs"],
learning_rate: float = cfg.cnnmodel["learning_rate"],
reduction_factor: float = 0.2,
reduction_patience: int = 10,
min_lr: float = 1e-6,
stopping_patience: int = 100,
verbose: int = 1,
)
Main training procedure for neural networks on image datasets.
Conventional function based on the implementation of Keras in the Tensorflow
-package for classifying image data. Works for images with RGB
and grayscale. The dataset of images to be classified must be organized as a folder structure, so that each class is in a folder.
ptionally test datasets can be created, then a two level hierarchical folder structure is needed. In the first level there are two folders,
once the training and the test dataset. In the second level then the folders with the examples for the individual classes.
Adam is used as the optimization algorithm. As metrics accuracy, precision, recall, f1-score and AUC are calculated.
All other parameters can be set optionally. The layers of the neural network are all labeled.
Example os usage:
train_neuralnet_images(
"./data",
"./model_weights/C" + str(cfg.lstmmodel["differential"]) + "_CNNCuDNNLSTM_Betticurves_" + str(cfg.cnnmodel["filters"]) + "_" + str(cfg.lstmmodel["units"]) + "_" + str(cfg.cnnmodel["layers_x"]) + "_" + str(cfg.lstmmodel["layers"]) + "layers",
"./model_weights/C" + str(cfg.lstmmodel["differential"]) + "_CNNCuDNNLSTM_Betticurves_" + str(cfg.cnnmodel["filters"]) + "_" + str(cfg.lstmmodel["units"]) + "_" + str(cfg.cnnmodel["layers_x"]) + "_" + str(cfg.lstmmodel["layers"]) + "layers",
)
str
.str
.str
.['dogs', 'cats']
). Default is none
. If not specified, the list of classes
is automatically derived from the y_col
mapped to the label indices will be alphanumeric). The dictionary containing the mapping of class names
to class indices
can be obtained from the class_indices
attribute, dtype list
.float
.str
.grayscale
, rgb
, rgba
. Default: rgb
. Whether the images are converted to have 1
, 3
, or 4
channels, dtype str
.binary
, categorical
, input
, multi_output
, raw
, sparse
or None
, dtype str
.bilinear
. Supports bilinear
, nearest
, bicubic
, area
, lanczos3
, lanczos5
, gaußian
, mitchellcubic
, dtype str
.png
, jpeg
, dtype str
.32
, dtype int
.tuple
.float
.bool
.bool
,bool
.bool
.int
.float
.float
.int
.float
.int
.int
.