A bit greater than a yr in the past, in his lovely visitor put up, Nick Strayer confirmed find out how to classify a set of on a regular basis actions utilizing smartphone-recorded gyroscope and accelerometer knowledge. Accuracy was superb, however Nick went on to examine classification outcomes extra carefully. Have been there actions extra susceptible to misclassification than others? And the way about these misguided outcomes: Did the community report them with equal, or much less confidence than people who have been right?
Technically, after we communicate of confidence in that method, we’re referring to the rating obtained for the “successful” class after softmax activation. If that successful rating is 0.9, we’d say “the community is bound that’s a gentoo penguin”; if it’s 0.2, we’d as an alternative conclude “to the community, neither possibility appeared becoming, however cheetah seemed finest.”
This use of “confidence” is convincing, nevertheless it has nothing to do with confidence – or credibility, or prediction, what have you ever – intervals. What we’d actually like to have the ability to do is put distributions over the community’s weights and make it Bayesian. Utilizing tfprobability’s variational Keras-compatible layers, that is one thing we truly can do.
Including uncertainty estimates to Keras fashions with tfprobability reveals find out how to use a variational dense layer to acquire estimates of epistemic uncertainty. On this put up, we modify the convnet utilized in Nick’s put up to be variational all through. Earlier than we begin, let’s rapidly summarize the duty.
The duty
To create the Smartphone-Based mostly Recognition of Human Actions and Postural Transitions Information Set (Reyes-Ortiz et al. 2016), the researchers had topics stroll, sit, stand, and transition from a kind of actions to a different. In the meantime, two kinds of smartphone sensors have been used to document movement knowledge: Accelerometers measure linear acceleration in three dimensions, whereas gyroscopes are used to trace angular velocity across the coordinate axes. Listed below are the respective uncooked sensor knowledge for six kinds of actions from Nick’s unique put up:
Similar to Nick, we’re going to zoom in on these six kinds of exercise, and attempt to infer them from the sensor knowledge. Some knowledge wrangling is required to get the dataset right into a type we are able to work with; right here we’ll construct on Nick’s put up, and successfully begin from the info properly pre-processed and cut up up into coaching and check units:
Observations: 289
Variables: 6
$ experiment 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 2…
$ userId 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 7, 7, 9, 9, 10, 10, 11…
$ exercise 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7…
$ knowledge [, , STAND_TO_SIT, STAND_TO_SIT, STAND_TO_SIT, STAND_TO_S…
$ observationId 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 17, 18, 19, 2…
Observations: 69
Variables: 6
$ experiment 11, 12, 15, 16, 32, 33, 42, 43, 52, 53, 56, 57, 11, …
$ userId 6, 6, 8, 8, 16, 16, 21, 21, 26, 26, 28, 28, 6, 6, 8,…
$ activity 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8…
$ data [, , STAND_TO_SIT, STAND_TO_SIT, STAND_TO_SIT, STAND_TO_S…
$ observationId 11, 12, 15, 16, 31, 32, 41, 42, 51, 52, 55, 56, 71, …
The code required to arrive at this stage (copied from Nick’s post) may be found in the appendix at the bottom of this page.
Training pipeline
The dataset in question is small enough to fit in memory – but yours might not be, so it can’t hurt to see some streaming in action. Besides, it’s probably safe to say that with TensorFlow 2.0, tfdatasets pipelines are the way to feed data to a model.
Once the code listed in the appendix has run, the sensor data is to be found in trainData$data
, a list column containing data.frame
s where each row corresponds to a point in time and each column holds one of the measurements. However, not all time series (recordings) are of the same length; we thus follow the original post to pad all series to length pad_size
(= 338). The expected shape of training batches will then be (batch_size, pad_size, 6)
.
We initially create our training dataset:
train_x train_data$data %>%
map(as.matrix) %>%
pad_sequences(maxlen = pad_size, dtype = "float32") %>%
tensor_slices_dataset()
train_y train_data$activity %>%
one_hot_classes() %>%
tensor_slices_dataset()
train_dataset zip_datasets(train_x, train_y)
train_dataset
Then shuffle and batch it:
n_train nrow(train_data)
# the highest possible batch size for this dataset
# chosen because it yielded the best performance
# alternatively, experiment with e.g. different learning rates, ...
batch_size n_train
train_dataset train_dataset %>%
dataset_shuffle(n_train) %>%
dataset_batch(batch_size)
train_dataset
Same for the test data.
test_x test_data$data %>%
map(as.matrix) %>%
pad_sequences(maxlen = pad_size, dtype = "float32") %>%
tensor_slices_dataset()
test_y test_data$activity %>%
one_hot_classes() %>%
tensor_slices_dataset()
n_test nrow(test_data)
test_dataset zip_datasets(test_x, test_y) %>%
dataset_batch(n_test)
Using tfdatasets
does not mean we cannot run a quick sanity check on our data:
first test_dataset %>%
reticulate::as_iterator() %>%
# get first batch (= whole test set, in our case)
reticulate::iter_next() %>%
# predictors only
.[[1]] %>%
# first merchandise in batch
.[1,,]
first
tf.Tensor(
[[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
...
[ 1.00416672 0.2375 0.12916666 -0.40225476 -0.20463985 -0.14782938]
[ 1.04166663 0.26944447 0.12777779 -0.26755899 -0.02779437 -0.1441642 ]
[ 1.0250001 0.27083334 0.15277778 -0.19639318 0.35094208 -0.16249016]],
form=(338, 6), dtype=float64)
Now let’s construct the community.
A variational convnet
We construct on the simple convolutional structure from Nick’s put up, simply making minor modifications to kernel sizes and numbers of filters. We additionally throw out all dropout layers; no extra regularization is required on high of the priors utilized to the weights.
Be aware the next concerning the “Bayesified” community.
-
Every layer is variational in nature, the convolutional ones (layer_conv_1d_flipout) in addition to the dense layers (layer_dense_flipout).
-
With variational layers, we are able to specify the prior weight distribution in addition to the type of the posterior; right here the defaults are used, leading to an ordinary regular prior and a default mean-field posterior.
-
Likewise, the consumer might affect the divergence operate used to evaluate the mismatch between prior and posterior; on this case, we truly take some motion: We scale the (default) KL divergence by the variety of samples within the coaching set.
-
One final thing to notice is the output layer. It’s a distribution layer, that’s, a layer wrapping a distribution – the place wrapping means: Coaching the community is enterprise as standard, however predictions are distributions, one for every knowledge level.
library(tfprobability)
num_classes 6
# scale the KL divergence by variety of coaching examples
n n_train %>% tf$forged(tf$float32)
kl_div operate(q, p, unused)
tfd_kl_divergence(q, p) / n
mannequin keras_model_sequential()
mannequin %>%
layer_conv_1d_flipout(
filters = 12,
kernel_size = 3,
activation = "relu",
kernel_divergence_fn = kl_div
) %>%
layer_conv_1d_flipout(
filters = 24,
kernel_size = 5,
activation = "relu",
kernel_divergence_fn = kl_div
) %>%
layer_conv_1d_flipout(
filters = 48,
kernel_size = 7,
activation = "relu",
kernel_divergence_fn = kl_div
) %>%
layer_global_average_pooling_1d() %>%
layer_dense_flipout(
items = 48,
activation = "relu",
kernel_divergence_fn = kl_div
) %>%
layer_dense_flipout(
num_classes,
kernel_divergence_fn = kl_div,
identify = "dense_output"
) %>%
layer_one_hot_categorical(event_size = num_classes)
We inform the community to reduce the destructive log chance.
nll operate(y, mannequin) - (mannequin %>% tfd_log_prob(y))
This can turn out to be a part of the loss. The best way we arrange this instance, this isn’t its most substantial half although. Right here, what dominates the loss is the sum of the KL divergences, added (routinely) to mannequin$losses
.
In a setup like this, it’s attention-grabbing to observe each components of the loss individually. We are able to do that by way of two metrics:
# the KL a part of the loss
kl_part operate(y_true, y_pred) {
kl tf$reduce_sum(mannequin$losses)
kl
}
# the NLL half
nll_part operate(y_true, y_pred) {
cat_dist tfd_one_hot_categorical(logits = y_pred)
nll - (cat_dist %>% tfd_log_prob(y_true) %>% tf$reduce_mean())
nll
}
We prepare considerably longer than Nick did within the unique put up, permitting for early stopping although.
mannequin %>% compile(
optimizer = "rmsprop",
loss = nll,
metrics = c("accuracy",
custom_metric("kl_part", kl_part),
custom_metric("nll_part", nll_part)),
experimental_run_tf_function = FALSE
)
train_history mannequin %>% match(
train_dataset,
epochs = 1000,
validation_data = test_dataset,
callbacks = record(
callback_early_stopping(persistence = 10)
)
)
Whereas the general loss declines linearly (and possibly would for a lot of extra epochs), this isn’t the case for classification accuracy or the NLL a part of the loss:
Ultimate accuracy isn’t as excessive as within the non-variational setup, although nonetheless not dangerous for a six-class drawback. We see that with none extra regularization, there’s little or no overfitting to the coaching knowledge.
Now how can we get hold of predictions from this mannequin?
Probabilistic predictions
Although we gained’t go into this right here, it’s good to know that we entry extra than simply the output distributions; via their kernel_posterior
attribute, we are able to entry the hidden layers’ posterior weight distributions as effectively.
Given the small measurement of the check set, we compute all predictions without delay. The predictions at the moment are categorical distributions, one for every pattern within the batch:
test_data_all dataset_collect(test_dataset) %>% { .[[1]][[1]]}
one_shot_preds mannequin(test_data_all)
one_shot_preds
tfp.distributions.OneHotCategorical(
"sequential_one_hot_categorical_OneHotCategorical_OneHotCategorical",
batch_shape=[69], event_shape=[6], dtype=float32)
We prefixed these predictions with one_shot
to point their noisy nature: These are predictions obtained on a single move via the community, all layer weights being sampled from their respective posteriors.
From the anticipated distributions, we calculate imply and customary deviation per (check) pattern.
The usual deviations thus obtained could possibly be stated to mirror the general predictive uncertainty. We are able to estimate one other form of uncertainty, known as epistemic, by making various passes via the community after which, calculating – once more, per check pattern – the usual deviations of the anticipated means.
Placing all of it collectively, we have now
# A tibble: 414 x 6
obs class imply sd mc_sd label
1 1 V1 0.945 0.227 0.0743 STAND_TO_SIT
2 1 V2 0.0534 0.225 0.0675 SIT_TO_STAND
3 1 V3 0.00114 0.0338 0.0346 SIT_TO_LIE
4 1 V4 0.00000238 0.00154 0.000336 LIE_TO_SIT
5 1 V5 0.0000132 0.00363 0.00164 STAND_TO_LIE
6 1 V6 0.0000305 0.00553 0.00398 LIE_TO_STAND
7 2 V1 0.993 0.0813 0.149 STAND_TO_SIT
8 2 V2 0.00153 0.0390 0.102 SIT_TO_STAND
9 2 V3 0.00476 0.0688 0.108 SIT_TO_LIE
10 2 V4 0.00000172 0.00131 0.000613 LIE_TO_SIT
# … with 404 extra rows
Evaluating predictions to the bottom reality:
# A tibble: 69 x 7
obs maxprob maxprob_sd maxprob_mc_sd predicted reality right
1 1 0.945 0.227 0.0743 STAND_TO_SIT STAND_TO_SIT TRUE
2 2 0.993 0.0813 0.149 STAND_TO_SIT STAND_TO_SIT TRUE
3 3 0.733 0.443 0.131 STAND_TO_SIT STAND_TO_SIT TRUE
4 4 0.796 0.403 0.138 STAND_TO_SIT STAND_TO_SIT TRUE
5 5 0.843 0.364 0.358 SIT_TO_STAND STAND_TO_SIT FALSE
6 6 0.816 0.387 0.176 SIT_TO_STAND STAND_TO_SIT FALSE
7 7 0.600 0.490 0.370 STAND_TO_SIT STAND_TO_SIT TRUE
8 8 0.941 0.236 0.0851 STAND_TO_SIT STAND_TO_SIT TRUE
9 9 0.853 0.355 0.274 SIT_TO_STAND STAND_TO_SIT FALSE
10 10 0.961 0.195 0.195 STAND_TO_SIT STAND_TO_SIT TRUE
11 11 0.918 0.275 0.168 STAND_TO_SIT STAND_TO_SIT TRUE
12 12 0.957 0.203 0.150 STAND_TO_SIT STAND_TO_SIT TRUE
13 13 0.987 0.114 0.188 SIT_TO_STAND SIT_TO_STAND TRUE
14 14 0.974 0.160 0.248 SIT_TO_STAND SIT_TO_STAND TRUE
15 15 0.996 0.0657 0.0534 SIT_TO_STAND SIT_TO_STAND TRUE
16 16 0.886 0.318 0.0868 SIT_TO_STAND SIT_TO_STAND TRUE
17 17 0.773 0.419 0.173 SIT_TO_STAND SIT_TO_STAND TRUE
18 18 0.998 0.0444 0.222 SIT_TO_STAND SIT_TO_STAND TRUE
19 19 0.885 0.319 0.161 SIT_TO_STAND SIT_TO_STAND TRUE
20 20 0.930 0.255 0.271 SIT_TO_STAND SIT_TO_STAND TRUE
# … with 49 extra rows
Are customary deviations larger for misclassifications?
# A tibble: 2 x 5
right depend avg_mean avg_sd avg_mc_sd
1 FALSE 19 0.775 0.380 0.237
2 TRUE 50 0.879 0.264 0.183
They’re; although maybe to not the extent we’d need.
With simply six lessons, we are able to additionally examine customary deviations on the person prediction-target pairings stage.
# A tibble: 14 x 7
# Teams: reality [6]
reality predicted cnt avg_mean avg_sd avg_mc_sd right
1 SIT_TO_STAND SIT_TO_STAND 12 0.935 0.205 0.184 TRUE
2 STAND_TO_SIT STAND_TO_SIT 9 0.871 0.284 0.162 TRUE
3 LIE_TO_SIT LIE_TO_SIT 9 0.765 0.377 0.216 TRUE
4 SIT_TO_LIE SIT_TO_LIE 8 0.908 0.254 0.187 TRUE
5 STAND_TO_LIE STAND_TO_LIE 7 0.956 0.144 0.132 TRUE
6 LIE_TO_STAND LIE_TO_STAND 5 0.809 0.353 0.227 TRUE
7 SIT_TO_LIE STAND_TO_LIE 4 0.685 0.436 0.233 FALSE
8 LIE_TO_STAND SIT_TO_STAND 4 0.909 0.271 0.282 FALSE
9 STAND_TO_LIE SIT_TO_LIE 3 0.852 0.337 0.238 FALSE
10 STAND_TO_SIT SIT_TO_STAND 3 0.837 0.368 0.269 FALSE
11 LIE_TO_STAND LIE_TO_SIT 2 0.689 0.454 0.233 FALSE
12 LIE_TO_SIT STAND_TO_SIT 1 0.548 0.498 0.0805 FALSE
13 SIT_TO_STAND LIE_TO_STAND 1 0.530 0.499 0.134 FALSE
14 LIE_TO_SIT LIE_TO_STAND 1 0.824 0.381 0.231 FALSE
Once more, we see larger customary deviations for flawed predictions, however to not a excessive diploma.
Conclusion
We’ve proven find out how to construct, prepare, and acquire predictions from a completely variational convnet. Evidently, there’s room for experimentation: Various layer implementations exist; a distinct prior could possibly be specified; the divergence could possibly be calculated otherwise; and the same old neural community hyperparameter tuning choices apply.
Then, there’s the query of penalties (or: determination making). What’s going to occur in high-uncertainty circumstances, what even is a high-uncertainty case? Naturally, questions like these are out-of-scope for this put up, but of important significance in real-world purposes.
Thanks for studying!
Appendix
To be executed earlier than operating this put up’s code. Copied from Classifying bodily exercise from smartphone knowledge.
library(keras)
library(tidyverse)
activity_labels learn.desk("knowledge/activity_labels.txt",
col.names = c("quantity", "label"))
one_hot_to_label activity_labels %>%
mutate(quantity = quantity - 7) %>%
filter(quantity >= 0) %>%
mutate(class = paste0("V",quantity + 1)) %>%
choose(-quantity)
labels learn.desk(
"knowledge/RawData/labels.txt",
col.names = c("experiment", "userId", "exercise", "startPos", "endPos")
)
dataFiles record.recordsdata("knowledge/RawData")
dataFiles %>% head()
fileInfo data_frame(
filePath = dataFiles
) %>%
filter(filePath != "labels.txt") %>%
separate(filePath, sep = '_',
into = c("kind", "experiment", "userId"),
take away = FALSE) %>%
mutate(
experiment = str_remove(experiment, "exp"),
userId = str_remove_all(userId, "consumer|.txt")
) %>%
unfold(kind, filePath)
# Learn contents of single file to a dataframe with accelerometer and gyro knowledge.
readInData operate(experiment, userId){
genFilePath = operate(kind) {
paste0("knowledge/RawData/", kind, "_exp",experiment, "_user", userId, ".txt")
}
bind_cols(
learn.desk(genFilePath("acc"), col.names = c("a_x", "a_y", "a_z")),
learn.desk(genFilePath("gyro"), col.names = c("g_x", "g_y", "g_z"))
)
}
# Perform to learn a given file and get the observations contained alongside
# with their lessons.
loadFileData operate(curExperiment, curUserId) {
# load sensor knowledge from file into dataframe
allData readInData(curExperiment, curUserId)
extractObservation operate(startPos, endPos){
allData[startPos:endPos,]
}
# get commentary places on this file from labels dataframe
dataLabels labels %>%
filter(userId == as.integer(curUserId),
experiment == as.integer(curExperiment))
# extract observations as dataframes and save as a column in dataframe.
dataLabels %>%
mutate(
knowledge = map2(startPos, endPos, extractObservation)
) %>%
choose(-startPos, -endPos)
}
# scan via all experiment and userId combos and collect knowledge right into a dataframe.
allObservations map2_df(fileInfo$experiment, fileInfo$userId, loadFileData) %>%
right_join(activityLabels, by = c("exercise" = "quantity")) %>%
rename(activityName = label)
write_rds(allObservations, "allObservations.rds")
allObservations readRDS("allObservations.rds")
desiredActivities c(
"STAND_TO_SIT", "SIT_TO_STAND", "SIT_TO_LIE",
"LIE_TO_SIT", "STAND_TO_LIE", "LIE_TO_STAND"
)
filteredObservations allObservations %>%
filter(activityName %in% desiredActivities) %>%
mutate(observationId = 1:n())
# get all customers
userIds allObservations$userId %>% distinctive()
# randomly select 24 (80% of 30 people) for coaching
set.seed(42) # seed for reproducibility
trainIds pattern(userIds, measurement = 24)
# set the remainder of the customers to the testing set
testIds setdiff(userIds,trainIds)
# filter knowledge.
# notice S.Okay.: renamed to train_data for consistency with
# variable naming used on this put up
train_data filteredObservations %>%
filter(userId %in% trainIds)
# notice S.Okay.: renamed to test_data for consistency with
# variable naming used on this put up
test_data filteredObservations %>%
filter(userId %in% testIds)
# notice S.Okay.: renamed to pad_size for consistency with
# variable naming used on this put up
pad_size trainData$knowledge %>%
map_int(nrow) %>%
quantile(p = 0.98) %>%
ceiling()
# notice S.Okay.: renamed to one_hot_classes for consistency with
# variable naming used on this put up
one_hot_classes . %>%
{. - 7} %>% # convey integers right down to 0-6 from 7-12
to_categorical() # One-hot encode