Veloso et. al. [1] developed a technique to detect how well a user performs a particular exercise. The authors propose a model-based solution that learns from biometric sensor data. They trained a model on 5 classes: class ‘A’ represents the “correct” technique for an exercise and classes B, C, D and E represent 4 “incorrect” methods of performing the technique. The goal is to develop an application that provides feedback to a user while performing a exercise, in order to reduce gym-related injuries due to incorrect technique.
In this project, we use R
to build a classifier using the sensor data generously shared by the authors of this paper. The data consists of training set containing over 19000 samples, each with 152 variables and classe
outcome variable with the value ‘A’, ‘B’, ‘C’, ‘D’ or ‘E’. The testing set consists of 20 samples without the classe
outcome variable. The goal is to build a classifier using the training data to predict the classe
of the testing data.
In a nutshell, here’s a summary of the analysis performed this paper.
6
sensor data columns with all missing (NA) data.95.8%
accuracy with 36 features.First, we explore the training data to look for missing (NA) sensor measurements. We assume that the values: "NA"
, ""
(empty string) and "#DIV/0!"
represent missing sensor measurements. Table 1
summarizes our findings. We remove the 6 sensor data columns that have all missing (NA) values. We also found that 19405 samples have at least 1 missing measurement. During model selection, we impute these missing measurements prior to building a classifier.
value | |
---|---|
total_variables_containing_measurement_data | 152 |
total_variables_all_missing | 6 |
total_rows | 19622 |
total_rows_with_at_least_one_missing_measurement | 19405 |
We calculate feature importance using the varImp function provided by the randomForest R package. First, we build a random forest model on the entire training set. Then we extract the variable importance data from the model. Figure 1
shows the top 15 features sorted by their Gini values. The Gini value measure impurity of splits for a variable m
over all trees in the forest.
First, we split the training data into 75% training set and 25% testing set. Then we build several random forest models using train split and measure accuracy using the test split.
Before we can build a random forest classifier, we must first impute the missing values in the training set. I discovered this when I tried to perform prediction on 100 test samples but predict()
only returned 40 predictions. This was due to the RandomForest
skipping samples with missing values - see this link in the predict.randomForest.R source.
Once we removed all the missing values, we built several random forest models with all 152 features with 10, 50 and 100 trees. The model with the best accuracy used 100 trees and 10-fold cross validation. The prediction accuracy is 95.4%. Figure 2
shows the confusion matrix for this model.
Next, we built a random forest model with 100 trees and top variables found during feature selection. We found the best performance with 100 trees and 36 features with an accuracy of 95.8%. The model training used 10-fold cross-validation. Figure 3
shows the confusion matrix for this model. The prediction accuracy is nearly identical to the model that uses the entire feature set.
The testing dataset contains several columns with missing data. We summarize our findings in Table 2
. We use the imputed training data from section “Impute PML training set” to replace the missing (NA) values using the na.roughfix R function. This function replaces missing (NA) values with column medians.
value | |
---|---|
total_variables_containing_measurement_data | 152 |
total_variables_all_missing | 100 |
total_rows | 20 |
total_rows_with_at_least_one_missing_measurement | 20 |
Once we cleaned the data, we select the 36 columns corresponding to the variables found during feature selection and perform our predictions. See Appendix 1
for code and predictions.
In this project, we use R
to build a classifier using the Qualitative Activity Recognition
sensor data. We performed data cleaning on the testing and training data set by replacing missing (NA) values. Our final random forest model has 95.7%
accuracy using only 36 of 152 features. Finally, we predict the classe for the 20 test samples using this model.
This is the code used to preprocess the pml_testing.csv
and perform predictions using our best model.
# load stored testing csv datt
pml_testing <- readRDS("data/pml_testing_csv.rds")
# read data frame with variable importance
vi <- readRDS("data/rf_variable_importance_df.rds")
# read the model trained with the most important features
modFit_vi <- readRDS("data/rf_fit_36_features.rds")
# read the training data.frame with imputed NA values.
df_imputed <- readRDS("data/rf_imputed_training_df.rds")
# Filter variables for the feature importance threshold
gini_threshold = 1
rf_important_varnames <- vi[vi$Overall > gini_threshold,]$varname
# Filter the columns
allcols <- names(pml_testing)
train_measure_cols <- allcols[grep("belt|arm|dumbbell",allcols)]
# pml-testing data for 'most important' predictors
reduced_pml_testing <- subset(pml_testing, select=train_measure_cols)
reduced_pml_testing <- subset(pml_testing, select=c(as.vector(rf_important_varnames)))
reduced_pml_testing$problem_id <- pml_testing$problem_id
# training data for 'most important' predictors
reduced_df_imputed <- subset(df_imputed, select=c(as.vector(rf_important_varnames)))
reduced_df_imputed$classe <- df_imputed$classe
###########################################
# replace NA's in the test data.
# simply replace with the median for the column
# For numeric variables, NAs are replaced with column medians
reduced_pml_testing$is_test_data <- TRUE
reduced_df_imputed$is_test_data <- FALSE
# combine the data frames, remove 'classe' and 'problem_id' cols
combined_data <- rbind(subset(reduced_df_imputed, select=-c(classe)),
subset(reduced_pml_testing, select=-c(problem_id)))
combined_rough <- na.roughfix(combined_data[,-length(combined_data)])
# get the roughfix data
combined_rough$is_test_data <- combined_data$is_test_data
pml_imputed <- filter(combined_rough, is_test_data==TRUE)
# predict on imputed pml-testing data
predict(modFit_vi, pml_imputed)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
https://github.com/telvis07/practical_machine_learning_peer_review
[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz439hx3Sdf