This submit didn’t find yourself fairly the way in which I’d imagined. A fast follow-up on the current Time collection prediction with
FNN-LSTM, it was speculated to reveal how noisy time collection (so frequent in
observe) may revenue from a change in structure: As an alternative of FNN-LSTM, an LSTM autoencoder regularized by false nearest
neighbors (FNN) loss, use FNN-VAE, a variational autoencoder constrained by the identical. Nonetheless, FNN-VAE didn’t appear to deal with
noise higher than FNN-LSTM. No plot, no submit, then?
Alternatively – this isn’t a scientific examine, with speculation and experimental setup all preregistered; all that basically
issues is that if there’s one thing helpful to report. And it seems like there may be.
Firstly, FNN-VAE, whereas on par performance-wise with FNN-LSTM, is way superior in that different which means of “efficiency”:
Coaching goes a lot quicker for FNN-VAE.
Secondly, whereas we don’t see a lot distinction between FNN-LSTM and FNN-VAE, we do see a transparent influence of utilizing FNN loss. Including in FNN loss strongly reduces imply squared error with respect to the underlying (denoised) collection – particularly within the case of VAE, however for LSTM as properly. That is of explicit curiosity with VAE, because it comes with a regularizer
out-of-the-box – particularly, Kullback-Leibler (KL) divergence.
After all, we don’t declare that related outcomes will all the time be obtained on different noisy collection; nor did we tune any of
the fashions “to loss of life.” For what might be the intent of such a submit however to point out our readers fascinating (and promising) concepts
to pursue in their very own experimentation?
The context
This submit is the third in a mini-series.
In Deep attractors: The place deep studying meets chaos, we
defined, with a considerable detour into chaos concept, the concept of FNN loss, launched in (Gilpin 2020). Please seek the advice of
that first submit for theoretical background and intuitions behind the method.
The next submit, Time collection prediction with FNN-LSTM, confirmed
learn how to use an LSTM autoencoder, constrained by FNN loss, for forecasting (versus reconstructing an attractor). The outcomes had been gorgeous: In multi-step prediction (12-120 steps, with that quantity various by
dataset), the short-term forecasts had been drastically improved by including in FNN regularization. See that second submit for
experimental setup and outcomes on 4 very totally different, non-synthetic datasets.
At present, we present learn how to substitute the LSTM autoencoder by a – convolutional – VAE. In mild of the experimentation outcomes,
already hinted at above, it’s utterly believable that the “variational” half just isn’t even so essential right here – {that a}
convolutional autoencoder with simply MSE loss would have carried out simply as properly on these information. In actual fact, to seek out out, it’s
sufficient to take away the decision to reparameterize()
and multiply the KL part of the loss by 0. (We go away this to the
reader, to maintain the submit at cheap size.)
One final piece of context, in case you haven’t learn the 2 earlier posts and want to soar in right here straight. We’re
doing time collection forecasting; so why this discuss of autoencoders? Shouldn’t we simply be evaluating an LSTM (or another sort of
RNN, for that matter) to a convnet? In actual fact, the need of a latent illustration is because of the very thought of FNN: The
latent code is meant to replicate the true attractor of a dynamical system. That’s, if the attractor of the underlying
system is roughly two-dimensional, we hope to seek out that simply two of the latent variables have appreciable variance. (This
reasoning is defined in a number of element within the earlier posts.)
FNN-VAE
So, let’s begin with the code for our new mannequin.
The encoder takes the time collection, of format batch_size x num_timesteps x num_features
identical to within the LSTM case, and
produces a flat, 10-dimensional output: the latent code, which FNN loss is computed on.
library(tensorflow)
library(keras)
library(tfdatasets)
library(tfautograph)
library(reticulate)
library(purrr)
vae_encoder_model operate(n_timesteps,
n_features,
n_latent,
title = NULL) {
keras_model_custom(title = title, operate(self) {
self$conv1 layer_conv_1d(kernel_size = 3,
filters = 16,
strides = 2)
self$act1 layer_activation_leaky_relu()
self$batchnorm1 layer_batch_normalization()
self$conv2 layer_conv_1d(kernel_size = 7,
filters = 32,
strides = 2)
self$act2 layer_activation_leaky_relu()
self$batchnorm2 layer_batch_normalization()
self$conv3 layer_conv_1d(kernel_size = 9,
filters = 64,
strides = 2)
self$act3 layer_activation_leaky_relu()
self$batchnorm3 layer_batch_normalization()
self$conv4 layer_conv_1d(
kernel_size = 9,
filters = n_latent,
strides = 2,
activation = "linear"
)
self$batchnorm4 layer_batch_normalization()
self$flat layer_flatten()
operate (x, masks = NULL) {
x %>%
self$conv1() %>%
self$act1() %>%
self$batchnorm1() %>%
self$conv2() %>%
self$act2() %>%
self$batchnorm2() %>%
self$conv3() %>%
self$act3() %>%
self$batchnorm3() %>%
self$conv4() %>%
self$batchnorm4() %>%
self$flat()
}
})
}
The decoder begins from this – flat – illustration and decompresses it right into a time sequence. In each encoder and decoder
(de-)conv layers, parameters are chosen to deal with a sequence size (num_timesteps
) of 120, which is what we’ll use for
prediction under.
vae_decoder_model operate(n_timesteps,
n_features,
n_latent,
title = NULL) {
keras_model_custom(title = title, operate(self) {
self$reshape layer_reshape(target_shape = c(1, n_latent))
self$conv1 layer_conv_1d_transpose(kernel_size = 15,
filters = 64,
strides = 3)
self$act1 layer_activation_leaky_relu()
self$batchnorm1 layer_batch_normalization()
self$conv2 layer_conv_1d_transpose(kernel_size = 11,
filters = 32,
strides = 3)
self$act2 layer_activation_leaky_relu()
self$batchnorm2 layer_batch_normalization()
self$conv3 layer_conv_1d_transpose(
kernel_size = 9,
filters = 16,
strides = 2,
output_padding = 1
)
self$act3 layer_activation_leaky_relu()
self$batchnorm3 layer_batch_normalization()
self$conv4 layer_conv_1d_transpose(
kernel_size = 7,
filters = 1,
strides = 1,
activation = "linear"
)
self$batchnorm4 layer_batch_normalization()
operate (x, masks = NULL) {
x %>%
self$reshape() %>%
self$conv1() %>%
self$act1() %>%
self$batchnorm1() %>%
self$conv2() %>%
self$act2() %>%
self$batchnorm2() %>%
self$conv3() %>%
self$act3() %>%
self$batchnorm3() %>%
self$conv4() %>%
self$batchnorm4()
}
})
}
Notice that despite the fact that we known as these constructors vae_encoder_model()
and vae_decoder_model()
, there may be nothing
variational to those fashions per se; they’re actually simply an encoder and a decoder, respectively. Metamorphosis right into a VAE will
occur within the coaching process; the truth is, the one two issues that can make this a VAE are going to be the
reparameterization of the latent layer and the added-in KL loss.
Talking of coaching, these are the routines we’ll name. The operate to compute FNN loss, loss_false_nn()
, could be present in
each of the abovementioned predecessor posts; we kindly ask the reader to repeat it from one in all these locations.
# to reparameterize encoder output earlier than calling decoder
reparameterize operate(imply, logvar = 0) {
eps k_random_normal(form = n_latent)
eps * k_exp(logvar * 0.5) + imply
}
# loss has 3 parts: NLL, KL, and FNN
# in any other case, that is simply regular TF2-style coaching
train_step_vae operate(batch) {
with (tf$GradientTape(persistent = TRUE) %as% tape, {
code encoder(batch[[1]])
z reparameterize(code)
prediction decoder(z)
l_mse mse_loss(batch[[2]], prediction)
# see loss_false_nn in 2 earlier posts
l_fnn loss_false_nn(code)
# KL divergence to an ordinary regular
l_kl -0.5 * k_mean(1 - k_square(z))
# total loss is a weighted sum of all 3 parts
loss l_mse + fnn_weight * l_fnn + kl_weight * l_kl
})
encoder_gradients
tape$gradient(loss, encoder$trainable_variables)
decoder_gradients
tape$gradient(loss, decoder$trainable_variables)
optimizer$apply_gradients(purrr::transpose(listing(
encoder_gradients, encoder$trainable_variables
)))
optimizer$apply_gradients(purrr::transpose(listing(
decoder_gradients, decoder$trainable_variables
)))
train_loss(loss)
train_mse(l_mse)
train_fnn(l_fnn)
train_kl(l_kl)
}
# wrap all of it in autograph
training_loop_vae tf_function(autograph(operate(ds_train) {
for (batch in ds_train) {
train_step_vae(batch)
}
tf$print("Loss: ", train_loss$consequence())
tf$print("MSE: ", train_mse$consequence())
tf$print("FNN loss: ", train_fnn$consequence())
tf$print("KL loss: ", train_kl$consequence())
train_loss$reset_states()
train_mse$reset_states()
train_fnn$reset_states()
train_kl$reset_states()
}))
To complete up the mannequin part, right here is the precise coaching code. That is practically similar to what we did for FNN-LSTM earlier than.
n_latent 10L
n_features 1
encoder vae_encoder_model(n_timesteps,
n_features,
n_latent)
decoder vae_decoder_model(n_timesteps,
n_features,
n_latent)
mse_loss
tf$keras$losses$MeanSquaredError(discount = tf$keras$losses$Discount$SUM)
train_loss tf$keras$metrics$Imply(title = 'train_loss')
train_fnn tf$keras$metrics$Imply(title = 'train_fnn')
train_mse tf$keras$metrics$Imply(title = 'train_mse')
train_kl tf$keras$metrics$Imply(title = 'train_kl')
fnn_multiplier 1 # default worth utilized in practically all circumstances (see textual content)
fnn_weight fnn_multiplier * nrow(x_train)/batch_size
kl_weight 1
optimizer optimizer_adam(lr = 1e-3)
for (epoch in 1:100) {
cat("Epoch: ", epoch, " -----------n")
training_loop_vae(ds_train)
test_batch as_iterator(ds_test) %>% iter_next()
encoded encoder(test_batch[[1]][1:1000])
test_var tf$math$reduce_variance(encoded, axis = 0L)
print(test_var %>% as.numeric() %>% spherical(5))
}
Experimental setup and information
The thought was so as to add white noise to a deterministic collection. This time, the Roessler
system was chosen, primarily for the prettiness of its attractor, obvious
even in its two-dimensional projections:
Like we did for the Lorenz system within the first a part of this collection, we use deSolve
to generate information from the Roessler
equations.
library(deSolve)
parameters c(a = .2,
b = .2,
c = 5.7)
initial_state
c(x = 1,
y = 1,
z = 1.05)
roessler operate(t, state, parameters) {
with(as.listing(c(state, parameters)), {
dx -y - z
dy x + a * y
dz = b + z * (x - c)
listing(c(dx, dy, dz))
})
}
instances seq(0, 2500, size.out = 20000)
roessler_ts
ode(
y = initial_state,
instances = instances,
func = roessler,
parms = parameters,
technique = "lsoda"
) %>% unclass() %>% as_tibble()
n 10000
roessler roessler_ts$x[1:n]
roessler scale(roessler)
Then, noise is added, to the specified diploma, by drawing from a standard distribution, centered at zero, with normal deviations
various between 1 and a couple of.5.
# add noise
noise 1 # additionally used 1.5, 2, 2.5
roessler roessler + rnorm(10000, imply = 0, sd = noise)
Right here you may evaluate results of not including any noise (left), normal deviation-1 (center), and normal deviation-2.5 Gaussian noise:
In any other case, preprocessing proceeds as within the earlier posts. Within the upcoming outcomes part, we’ll evaluate forecasts not simply
to the “actual,” after noise addition, check cut up of the info, but in addition to the underlying Roessler system – that’s, the factor
we’re actually thinking about. (Simply that in the true world, we are able to’t try this examine.) This second check set is ready for
forecasting identical to the opposite one; to keep away from duplication we don’t reproduce the code.
n_timesteps 120
batch_size 32
gen_timesteps operate(x, n_timesteps) {
do.name(rbind,
purrr::map(seq_along(x),
operate(i) {
begin i
finish i + n_timesteps - 1
out x[start:end]
out
})
) %>%
na.omit()
}
practice gen_timesteps(roessler[1:(n/2)], 2 * n_timesteps)
check gen_timesteps(roessler[(n/2):n], 2 * n_timesteps)
dim(practice) c(dim(practice), 1)
dim(check) c(dim(check), 1)
x_train practice[ , 1:n_timesteps, , drop = FALSE]
y_train practice[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]
ds_train tensor_slices_dataset(listing(x_train, y_train)) %>%
dataset_shuffle(nrow(x_train)) %>%
dataset_batch(batch_size)
x_test check[ , 1:n_timesteps, , drop = FALSE]
y_test check[ , (n_timesteps + 1):(2*n_timesteps), , drop = FALSE]
ds_test tensor_slices_dataset(listing(x_test, y_test)) %>%
dataset_batch(nrow(x_test))
Outcomes
The LSTM used for comparability with the VAE described above is similar to the structure employed within the earlier submit.
Whereas with the VAE, an fnn_multiplier
of 1 yielded ample regularization for all noise ranges, some extra experimentation
was wanted for the LSTM: At noise ranges 2 and a couple of.5, that multiplier was set to five.
Because of this, in all circumstances, there was one latent variable with excessive variance and a second one in all minor significance. For all
others, variance was near 0.
In all circumstances right here means: In all circumstances the place FNN regularization was used. As already hinted at within the introduction, the principle
regularizing issue offering robustness to noise right here appears to be FNN loss, not KL divergence. So for all noise ranges,
apart from FNN-regularized LSTM and VAE fashions we additionally examined their non-constrained counterparts.
Low noise
Seeing how all fashions did beautifully on the unique deterministic collection, a noise stage of 1 can virtually be handled as
a baseline. Right here you see sixteen 120-timestep predictions from each regularized fashions, FNN-VAE (darkish blue), and FNN-LSTM
(orange). The noisy check information, each enter (x
, 120 steps) and output (y
, 120 steps) are displayed in (blue-ish) gray. In
inexperienced, additionally spanning the entire sequence, we’ve got the unique Roessler information, the way in which they’d look had no noise been added.
Regardless of the noise, forecasts from each fashions look wonderful. Is that this because of the FNN regularizer?
Taking a look at forecasts from their unregularized counterparts, we’ve got to confess these don’t look any worse. (For higher
comparability, the sixteen sequences to forecast had been initiallly picked at random, however used to check all fashions and
situations.)
What occurs once we begin to add noise?
Substantial noise
Between noise ranges 1.5 and a couple of, one thing modified, or turned noticeable from visible inspection. Let’s soar on to the
highest-used stage although: 2.5.
Right here first are predictions obtained from the unregularized fashions.
Each LSTM and VAE get “distracted” a bit an excessive amount of by the noise, the latter to a good larger diploma. This results in circumstances
the place predictions strongly “overshoot” the underlying non-noisy rhythm. This isn’t stunning, in fact: They had been skilled
on the noisy model; predict fluctuations is what they realized.
Will we see the identical with the FNN fashions?
Apparently, we see a significantly better match to the underlying Roessler system now! Particularly the VAE mannequin, FNN-VAE, surprises
with an entire new smoothness of predictions; however FNN-LSTM turns up a lot smoother forecasts as properly.
“Easy, becoming the system…” – by now you might be questioning, when are we going to give you extra quantitative
assertions? If quantitative implies “imply squared error” (MSE), and if MSE is taken to be some divergence between forecasts
and the true goal from the check set, the reply is that this MSE doesn’t differ a lot between any of the 4 architectures.
Put in another way, it’s largely a operate of noise stage.
Nonetheless, we may argue that what we’re actually thinking about is how properly a mannequin forecasts the underlying course of. And there,
we see variations.
Within the following plot, we distinction MSEs obtained for the 4 mannequin sorts (gray: VAE; orange: LSTM; darkish blue: FNN-VAE; inexperienced:
FNN-LSTM). The rows replicate noise ranges (1, 1.5, 2, 2.5); the columns signify MSE in relation to the noisy(“actual”) goal
(left) on the one hand, and in relation to the underlying system on the opposite (proper). For higher visibility of the impact,
MSEs have been normalized as fractions of the utmost MSE in a class.
So, if we wish to predict sign plus noise (left), it isn’t extraordinarily crucial whether or not we use FNN or not. But when we wish to
predict the sign solely (proper), with growing noise within the information FNN loss turns into more and more efficient. This impact is way
stronger for VAE vs. FNN-VAE than for LSTM vs. FNN-LSTM: The space between the gray line (VAE) and the darkish blue one
(FNN-VAE) turns into bigger and bigger as we add extra noise.
Summing up
Our experiments present that when noise is more likely to obscure measurements from an underlying deterministic system, FNN
regularization can strongly enhance forecasts. That is the case particularly for convolutional VAEs, and possibly convolutional
autoencoders usually. And if an FNN-constrained VAE performs as properly, for time collection prediction, as an LSTM, there’s a
sturdy incentive to make use of the convolutional mannequin: It trains considerably quicker.
With that, we conclude our mini-series on FNN-regularized fashions. As all the time, we’d love to listen to from you when you had been capable of
make use of this in your individual work!
Thanks for studying!