In this project, my goal will be to use data (Weight Lifting Exercise Dataset) from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly (class A) and incorrectly in 4 different ways (classes B, C, D and E). This data set is unusual for quantifying how well an activity is done, as well as the more commonly measured how much is done. The group Human Activity Recognition http://groupware.les.inf.puc-rio.br/har has kindly provided this data set (cite 3). An edited version of this data set was downloaded for training from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and for testing from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv.
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: partykit
## Warning: package 'partykit' was built under R version 3.1.3
## Loading required package: grid
## Loading required package: rpart
## Loading required package: rpart.plot
## Warning: package 'rpart.plot' was built under R version 3.1.2
train <- read.csv('pml-training.csv', na.strings=c("#DIV/0!", "", "NA"),
stringsAsFactors=FALSE)
test <- read.csv('pml-testing.csv', na.strings ="NA")
Initially \(train\) has 19622 rows and 160 columns. There are a lot of columns with mostly blanks, or divide by zero or NA (treated as NA’s by \(read.csv()\) above.) Except there are numbers in those columns when \(train\$new\_window\) is equal to \(yes\). These are only a few rows, don’t worry about them. (Deleting these rows made little difference in results.) Remove columns with any NA’s.
train_noNA <- train[,colSums(is.na(train)) == 0]
test_noNA <- test[,colSums(is.na(test)) == 0]
Coerce the column classes to match in the two data sets train and test.
for (i in 1:ncol(test_noNA)) {
if (names(test_noNA)[i] != names(train_noNA)[i] ) {#check col. aignment
message( 'name ', i,'!!! ', names(test_noNA)[i], " != ",
names(train_noNA)[i])
}
if (class(test_noNA[,i]) != class(train_noNA[,i]) ) {#convert type
message( 'class ', i,' ', names(test_noNA)[i], ' ',
class(test_noNA[,i]), "!=", class(train_noNA[,i]) )
if (class(train_noNA[,i]) == 'character') {
test_noNA[,i] <- as.character(test_noNA[,i])
}
if (class(test_noNA[,i]) == 'integer') {
train_noNA[,i] <- as.integer(train_noNA[,i])
}
}
}
## class 2 user_name factor!=character
## class 5 cvtd_timestamp factor!=character
## class 6 new_window factor!=character
## class 46 magnet_dumbbell_z integer!=numeric
## class 58 magnet_forearm_y integer!=numeric
## class 59 magnet_forearm_z integer!=numeric
## name 60!!! problem_id != classe
## class 60 problem_id integer!=character
Make the dates POSIX. Make the column \(classe\) a factor variable.
train_noNA$cvtd_timestamp <- as.POSIXct(strptime(train_noNA$cvtd_timestamp, "%d/%m/%Y %H:%M"))
test_noNA$cvtd_timestamp <- as.POSIXct(strptime(test_noNA$cvtd_timestamp, "%d/%m/%Y %H:%M"))
train_noNA$classe <- as.factor(train_noNA$classe)
Remove additional columns which should not be relevant, i.e. non-numeric (\(user\_name\), \(new\_window\)) or sequential(\(X\), \(num\_window\)) or time.
#names(train_noNA)[1:7]
#class(train_noNA)
#summary(train_noNA)
train_noNA <- subset(train_noNA, select = -X)
train_noNA <- subset(train_noNA, select = -user_name)
train_noNA <- subset(train_noNA, select = -raw_timestamp_part_1)
train_noNA <- subset(train_noNA, select = -raw_timestamp_part_2)
train_noNA <- subset(train_noNA, select = -new_window)
train_noNA <- subset(train_noNA, select = -cvtd_timestamp)
train_noNA <- subset(train_noNA, select = -num_window)
test_noNA <- subset(test_noNA, select = -X)
test_noNA <- subset(test_noNA, select = -user_name)
test_noNA <- subset(test_noNA, select = -raw_timestamp_part_1)
test_noNA <- subset(test_noNA, select = -raw_timestamp_part_2)
test_noNA <- subset(test_noNA, select = -new_window)
test_noNA <- subset(test_noNA, select = -cvtd_timestamp)
test_noNA <- subset(test_noNA, select = -num_window)
#ncol(train_noNA)
#names(train_noNA)
Note that if column “X” (1,2,3, …) is not removed, it will be the only column used by the random tree routine to fit to the class classe, because the file is sorted in order of A, B, C, D, E. The result will be an implausibly perfect fit!
The remaining variables:
vector variables
scalar variables
There are numerous 3 vectors of the sort (roll, pitch, yaw) and (x,y,z). There are summary scalars of the sort acceleration. We chose to remove the 3-D x-y-z vectors for acceleration and assume that they will be well represented by the total accelerations (total_accel_belt, total_accel_arm, total_accel_dumbbell and total_accel_forearm.)
train_noNA <- subset(train_noNA, select = -accel_belt_x)
train_noNA <- subset(train_noNA, select = -accel_belt_y)
train_noNA <- subset(train_noNA, select = -accel_belt_z)
train_noNA <- subset(train_noNA, select = -accel_arm_x)
train_noNA <- subset(train_noNA, select = -accel_arm_y)
train_noNA <- subset(train_noNA, select = -accel_arm_z)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_x)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_y)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_z)
train_noNA <- subset(train_noNA, select = -accel_forearm_x)
train_noNA <- subset(train_noNA, select = -accel_forearm_y)
train_noNA <- subset(train_noNA, select = -accel_forearm_z)
Look at correlations (\(symnum(cor(train_noNA))\)) to see if additional columns should be eliminated. We discard “duplicate” columns at the 1.0 and 0.95 correlation levels, as these have little effect on the resulting accuracy.
train_noNA <- subset(train_noNA, select = -roll_belt) # to 1.0 correlation
train_noNA <- subset(train_noNA, select = -gyros_arm_x)# to 1.0 correlation
train_noNA <- subset(train_noNA, select = -gyros_dumbbell_x)# to 0.95 correlation
train_noNA <- subset(train_noNA, select = -gyros_dumbbell_z)# to 0.95 correlation
Original file “pml-testing.csv” is too small to act as a test set (only 20 rows), seen around in this file as the matrix test_noNA. We have to split “pml-training.csv” (matrix train_noNA) into 70% “training” and 30% “testing” sets. Sorry for any confusion with the naming! “By default, createDataPartition does a stratified random split of the data.” (cite 2) This is good, since the data is sorted into long stretches of A’s, then B’s, etc.
nr <- nrow(train_noNA)
indexTrain <- createDataPartition(y=train_noNA$classe,p=0.7, list=FALSE)
training <- train_noNA[indexTrain,]
testing <- train_noNA[-indexTrain,]
Without any depth of knowledge, we try using the default values of the method ’rpart" in train(). “rpart” stands for Recursive PARTitioning and uses binary tree models. “By default, rpart will conduct as many splits as possible, then use 10–fold (default 10) cross–validation to prune the tree.” (cite ref. 1) We choose to use this default, under the assumption that it is a relatively robust, safe method.
We do not select any special preprocessing.
#ncol(training)
#preProc <- preProcess(training, method="pca", thresh=80)
preProc <- training # skip PCA for now.
#ncol(preProc)
#
set.seed(12345) # Should make results reproducible, but it doesn't. Naughty train().
tuneLen = 57
ecurl <- train(classe ~ ., data = preProc, method = 'rpart',
tuneLength = tuneLen)
ecurl # accuracy vs. complexity factor
## CART
##
## 13737 samples
## 36 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.001085003 0.8808540 0.8492132 0.009537772 0.01207030
## 0.001118910 0.8791355 0.8470318 0.010091274 0.01276924
## 0.001220629 0.8741583 0.8407266 0.010404739 0.01317296
## 0.001322348 0.8696684 0.8350361 0.011339719 0.01435252
## 0.001424067 0.8660921 0.8305116 0.011059922 0.01398661
## 0.001525786 0.8625939 0.8260728 0.011096681 0.01404242
## 0.001627505 0.8594769 0.8221037 0.010198019 0.01289416
## 0.001729224 0.8553365 0.8168581 0.010035058 0.01263050
## 0.001830943 0.8522367 0.8129559 0.010185987 0.01280033
## 0.001932662 0.8481686 0.8078164 0.008923042 0.01121950
## 0.002034381 0.8433763 0.8017161 0.009616024 0.01210722
## 0.002136100 0.8403938 0.7979522 0.010364174 0.01304447
## 0.002237819 0.8369607 0.7936318 0.010421257 0.01309219
## 0.002339538 0.8325711 0.7880685 0.009608944 0.01211139
## 0.002441257 0.8297570 0.7845054 0.010394060 0.01309671
## 0.002542976 0.8271648 0.7812273 0.010578594 0.01331584
## 0.002644695 0.8233028 0.7763662 0.010734706 0.01347792
## 0.002695555 0.8218472 0.7745183 0.011072188 0.01391836
## 0.002746414 0.8201766 0.7723869 0.010332279 0.01301941
## 0.002848133 0.8169554 0.7683313 0.010191694 0.01280363
## 0.002949853 0.8143870 0.7650662 0.011757800 0.01482560
## 0.003153291 0.8087807 0.7579492 0.011892688 0.01498567
## 0.003356729 0.8042734 0.7522312 0.011734823 0.01482389
## 0.003458448 0.8012604 0.7484137 0.012846560 0.01626398
## 0.003611026 0.7978328 0.7440567 0.012035850 0.01522795
## 0.003865324 0.7936912 0.7387999 0.013065084 0.01653196
## 0.004068762 0.7888658 0.7326677 0.011848130 0.01505912
## 0.004373919 0.7817661 0.7237014 0.012214945 0.01540532
## 0.004577357 0.7754863 0.7158288 0.015370042 0.01929532
## 0.004780795 0.7716600 0.7110039 0.016052833 0.02015046
## 0.004984234 0.7677888 0.7060790 0.014920467 0.01873674
## 0.005289391 0.7634386 0.7006102 0.017063733 0.02137203
## 0.005391110 0.7621584 0.6989706 0.017330616 0.02172068
## 0.006408300 0.7470547 0.6798613 0.016506540 0.02072862
## 0.007018615 0.7382331 0.6687370 0.017483611 0.02187545
## 0.007069474 0.7377465 0.6681162 0.017796415 0.02229953
## 0.007120334 0.7373286 0.6675896 0.017916996 0.02247066
## 0.007425491 0.7339008 0.6632985 0.019037621 0.02385369
## 0.007628929 0.7318842 0.6607947 0.018309636 0.02291759
## 0.007730648 0.7305628 0.6591350 0.018060609 0.02259936
## 0.009154715 0.7121037 0.6355773 0.020926726 0.02620143
## 0.009358153 0.7080305 0.6303466 0.019639851 0.02459704
## 0.009561591 0.7062965 0.6280785 0.019588642 0.02462334
## 0.010985658 0.6870730 0.6031877 0.019287622 0.02434831
## 0.012104567 0.6690407 0.5803797 0.024687230 0.03091297
## 0.012206286 0.6661275 0.5766940 0.023689708 0.02969428
## 0.012816601 0.6611544 0.5704193 0.020977665 0.02625231
## 0.013223477 0.6555380 0.5634306 0.020348295 0.02558808
## 0.013579493 0.6492097 0.5555795 0.019564540 0.02446947
## 0.013935510 0.6467380 0.5524711 0.020041587 0.02506769
## 0.014647543 0.6381176 0.5419425 0.017854023 0.02197053
## 0.019733496 0.5751783 0.4581687 0.035222927 0.05157210
## 0.020496389 0.5624660 0.4408688 0.031291701 0.04725406
## 0.021157563 0.5458416 0.4174330 0.029495192 0.04683469
## 0.029498525 0.4937783 0.3429248 0.020443241 0.03450669
## 0.030312277 0.4919614 0.3404424 0.020880368 0.03536750
## 0.066320822 0.3889187 0.1732551 0.096701363 0.15678399
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001085003.
The results are vectors of length 57, corresponding to tuneLength. The “best” training from cross-validation is that with a complexity parameter of cp = 0.001085, so the best training accuracy is 0.880854 \(\pm\) 0.0095378 and the best kappa is 0.8492132 \(\pm\) 0.0120703. Since this is the first row, we are a bit suspicious.
We have used the train() option \(tuneLength\), which defaults to 3. Values as high as about 90 “maximizes”" the training accuracy. However, with increasing values, the tree gets more and more branches and leaves and we suspect that we may be over-fitting. From a suggestion from community TA Ronny Restrepo, we investigated the “train” vs. “test” data sets. The accuracy will always be a bit lower for the test data set but the difference between the two will increase as the model overfits the train data (see table below) and the test data accuracy stagnates. We felt a good compromise is tuneLength = 57. The table below shows our exploration. For a fixed tuneLength, there is a fair amount of jiggle due to different random initializations within tune().
## tuneLen diffAccur test_Accur train_Accur
## 1 7 -0.0100 0.607 0.606
## 2 7 -0.0100 0.608 0.606
## 3 15 -0.0200 0.694 0.702
## 4 30 -0.0100 0.774 0.778
## 5 37 -0.0153 0.781 0.789
## 6 45 -0.0254 0.839 0.859
## 7 52 -0.0208 0.856 0.871
## 8 57 -0.0283 0.897 0.921
## 9 60 -0.0240 0.908 0.927
## 10 63 -0.0342 0.898 0.928
## 11 67 -0.0354 0.900 0.931
## 12 75 -0.0358 0.910 0.942
## 13 75 -0.0409 0.902 0.939
## 14 90 -0.0416 0.903 0.941
## 15 90 -0.0362 0.912 0.945
## 16 120 -0.0400 0.907 0.945
## 17 57 -0.0179 0.884 0.897
## 18 57 -0.0289 0.881 0.905
## 19 57 -0.0330 0.868 0.896
## 20 57 -0.0207 0.886 0.901
## 21 57 -0.0206 0.873 0.888
## 22 57 -0.0259 0.864 0.885
## 23 57 -0.0305 0.867 0.892
## 24 57 -0.0188 0.892 0.906
## 25 57 -0.0257 0.872 0.893
## 26 57 -0.0232 0.881 0.899
## 27 57 -0.0189 0.883 0.897
## 28 57 -0.0340 0.871 0.900
## 29 57 -0.0119 0.886 0.893
## 30 57 -0.0211 0.878 0.894
## 31 57 -0.0216 0.885 0.902
## 32 57 -0.0263 0.877 0.898
## 33 57 -0.0200 0.877 0.892
## 34 57 -0.0254 0.889 0.909
## 35 57 -0.0213 0.885 0.901
This plots the training accuracy as a function of the complexity parameter, with more complex trees on the left. The convergence point for this example is the left-most point.
Below is our best decision tree in text format, which is rather long. Sorry! Due to the number of branches and leaves, the plot of the tree has been left out, since it is overwhelmed.
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) pitch_forearm< -33.55 1106 9 A (0.99 0.0081 0 0 0) *
## 3) pitch_forearm>=-33.55 12631 9822 A (0.22 0.21 0.19 0.18 0.2)
## 6) magnet_belt_y>=557.5 11586 8778 A (0.24 0.23 0.21 0.18 0.15)
## 12) magnet_dumbbell_y< 439.5 9697 6948 A (0.28 0.18 0.24 0.17 0.12)
## 24) roll_forearm< 120.5 6068 3611 A (0.4 0.18 0.18 0.15 0.092)
## 48) magnet_dumbbell_z< -27.5 2041 673 A (0.67 0.2 0.014 0.067 0.054)
## 96) roll_forearm>=-136.5 1701 368 A (0.78 0.16 0.014 0.012 0.032)
## 192) roll_forearm< 113.5 1572 277 A (0.82 0.12 0.015 0.011 0.034)
## 384) magnet_dumbbell_y< 379.5 1287 101 A (0.92 0.061 0.0031 0.0016 0.012)
## 768) gyros_dumbbell_y>=-0.525 1271 85 A (0.93 0.05 0.0031 0.0016 0.013)
## 1536) gyros_belt_z< 0.16 1257 71 A (0.94 0.05 0.0032 0.0016 0.0016) *
## 1537) gyros_belt_z>=0.16 14 0 E (0 0 0 0 1) *
## 769) gyros_dumbbell_y< -0.525 16 0 B (0 1 0 0 0) *
## 385) magnet_dumbbell_y>=379.5 285 176 A (0.38 0.36 0.07 0.053 0.13)
## 770) pitch_belt>=17.45 53 0 B (0 1 0 0 0) *
## 771) pitch_belt< 17.45 232 123 A (0.47 0.22 0.086 0.065 0.16)
## 1542) roll_dumbbell>=-71.91276 156 51 A (0.67 0.23 0 0.096 0)
## 3084) pitch_arm>=-39.55 139 34 A (0.76 0.14 0 0.11 0) *
## 3085) pitch_arm< -39.55 17 0 B (0 1 0 0 0) *
## 1543) roll_dumbbell< -71.91276 76 38 E (0.053 0.18 0.26 0 0.5)
## 3086) pitch_belt>=13.65 38 18 C (0.11 0.37 0.53 0 0)
## 6172) pitch_belt>=14.7 18 4 B (0.22 0.78 0 0 0) *
## 6173) pitch_belt< 14.7 20 0 C (0 0 1 0 0) *
## 3087) pitch_belt< 13.65 38 0 E (0 0 0 0 1) *
## 193) roll_forearm>=113.5 129 42 B (0.29 0.67 0 0.031 0)
## 386) magnet_belt_x>=25.5 18 0 A (1 0 0 0 0) *
## 387) magnet_belt_x< 25.5 111 24 B (0.18 0.78 0 0.036 0) *
## 97) roll_forearm< -136.5 340 211 B (0.1 0.38 0.012 0.34 0.16)
## 194) gyros_arm_y>=0.985 121 32 B (0 0.74 0 0 0.26)
## 388) yaw_belt< 2.4 92 3 B (0 0.97 0 0 0.033) *
## 389) yaw_belt>=2.4 29 0 E (0 0 0 0 1) *
## 195) gyros_arm_y< 0.985 219 103 D (0.16 0.18 0.018 0.53 0.11)
## 390) yaw_arm< -71.6 30 1 B (0.033 0.97 0 0 0) *
## 391) yaw_arm>=-71.6 189 73 D (0.18 0.058 0.021 0.61 0.13)
## 782) gyros_belt_z< 0.075 167 51 D (0.2 0.066 0.024 0.69 0.012)
## 1564) yaw_belt>=-4.76 46 17 A (0.63 0.065 0.087 0.17 0.043) *
## 1565) yaw_belt< -4.76 121 13 D (0.041 0.066 0 0.89 0) *
## 783) gyros_belt_z>=0.075 22 0 E (0 0 0 0 1) *
## 49) magnet_dumbbell_z>=-27.5 4027 2938 A (0.27 0.17 0.27 0.18 0.11)
## 98) yaw_belt>=169.5 472 59 A (0.88 0.064 0 0.061 0)
## 196) pitch_belt>=-45.05 433 20 A (0.95 0.046 0 0 0) *
## 197) pitch_belt< -45.05 39 10 D (0 0.26 0 0.74 0) *
## 99) yaw_belt< 169.5 3555 2476 C (0.19 0.18 0.3 0.2 0.13)
## 198) pitch_belt< -43.15 346 59 B (0.02 0.83 0.069 0.049 0.032) *
## 199) pitch_belt>=-43.15 3209 2154 C (0.21 0.11 0.33 0.22 0.14)
## 398) yaw_arm< -119.5 213 6 A (0.97 0.028 0 0 0) *
## 399) yaw_arm>=-119.5 2996 1941 C (0.15 0.12 0.35 0.23 0.15)
## 798) magnet_belt_z< -342.5 1057 746 E (0.28 0.17 0.2 0.058 0.29)
## 1596) magnet_forearm_z>=-101.5 506 235 A (0.54 0.004 0.059 0.02 0.38)
## 3192) magnet_forearm_x>=-191.5 373 102 A (0.73 0.0054 0.024 0.016 0.23)
## 6384) magnet_dumbbell_z< 333 300 38 A (0.87 0.0067 0.03 0.013 0.077) *
## 6385) magnet_dumbbell_z>=333 73 11 E (0.12 0 0 0.027 0.85) *
## 3193) magnet_forearm_x< -191.5 133 25 E (0 0 0.16 0.03 0.81)
## 6386) pitch_belt>=12.3 27 6 C (0 0 0.78 0.15 0.074) *
## 6387) pitch_belt< 12.3 106 0 E (0 0 0 0 1) *
## 1597) magnet_forearm_z< -101.5 551 367 C (0.044 0.32 0.33 0.093 0.21)
## 3194) yaw_dumbbell< -38.15102 168 19 C (0 0.054 0.89 0.054 0.006) *
## 3195) yaw_dumbbell>=-38.15102 383 218 B (0.063 0.43 0.091 0.11 0.31)
## 6390) roll_dumbbell< 36.13496 187 37 B (0.011 0.8 0.11 0.021 0.053)
## 12780) magnet_dumbbell_y>=145.5 161 11 B (0.012 0.93 0.037 0.012 0.0062) *
## 12781) magnet_dumbbell_y< 145.5 26 11 C (0 0 0.58 0.077 0.35) *
## 6391) roll_dumbbell>=36.13496 196 89 E (0.11 0.077 0.071 0.19 0.55)
## 12782) magnet_belt_z< -401.5 170 63 E (0.13 0.088 0.082 0.071 0.63)
## 25564) yaw_belt< -88.85 26 11 A (0.58 0.42 0 0 0) *
## 25565) yaw_belt>=-88.85 144 37 E (0.049 0.028 0.097 0.083 0.74)
## 51130) yaw_dumbbell< 53.19289 19 5 C (0.26 0 0.74 0 0) *
## 51131) yaw_dumbbell>=53.19289 125 18 E (0.016 0.032 0 0.096 0.86) *
## 12783) magnet_belt_z>=-401.5 26 0 D (0 0 0 1 0) *
## 799) magnet_belt_z>=-342.5 1939 1098 C (0.086 0.088 0.43 0.33 0.064)
## 1598) roll_dumbbell< 59.01136 1458 640 C (0.078 0.092 0.56 0.22 0.048)
## 3196) yaw_belt< -2.955 326 217 B (0.2 0.33 0.12 0.33 0.018)
## 6392) roll_forearm>=-97.9 120 54 A (0.55 0.44 0.0083 0 0)
## 12784) roll_dumbbell>=-50.12326 97 31 A (0.68 0.31 0.01 0 0)
## 25568) gyros_belt_z>=-0.12 82 16 A (0.8 0.18 0.012 0 0) *
## 25569) gyros_belt_z< -0.12 15 0 B (0 1 0 0 0) *
## 12785) roll_dumbbell< -50.12326 23 0 B (0 1 0 0 0) *
## 6393) roll_forearm< -97.9 206 100 D (0 0.27 0.18 0.51 0.029)
## 12786) magnet_forearm_y< -498 54 7 B (0 0.87 0.019 0.037 0.074) *
## 12787) magnet_forearm_y>=-498 152 48 D (0 0.059 0.24 0.68 0.013)
## 25574) gyros_arm_y>=0.86 38 7 C (0 0.079 0.82 0.053 0.053) *
## 25575) gyros_arm_y< 0.86 114 12 D (0 0.053 0.053 0.89 0) *
## 3197) yaw_belt>=-2.955 1132 353 C (0.042 0.022 0.69 0.19 0.057)
## 6394) yaw_belt< 163.5 811 165 C (0.047 0.0074 0.8 0.07 0.079)
## 12788) total_accel_dumbbell>=4.5 769 124 C (0.049 0.0078 0.84 0.021 0.083)
## 25576) yaw_arm>=89.35 39 1 A (0.97 0.026 0 0 0) *
## 25577) yaw_arm< 89.35 730 85 C (0 0.0068 0.88 0.022 0.088)
## 51154) magnet_belt_x< 171.5 663 37 C (0 0.006 0.94 0.023 0.027)
## 102308) gyros_belt_x< 0.25 651 25 C (0 0.0061 0.96 0.022 0.011) *
## 102309) gyros_belt_x>=0.25 12 1 E (0 0 0 0.083 0.92) *
## 51155) magnet_belt_x>=171.5 67 21 E (0 0.015 0.28 0.015 0.69)
## 102310) yaw_belt>=154 21 2 C (0 0.048 0.9 0.048 0) *
## 102311) yaw_belt< 154 46 0 E (0 0 0 0 1) *
## 12789) total_accel_dumbbell< 4.5 42 1 D (0 0 0.024 0.98 0) *
## 6395) yaw_belt>=163.5 321 162 D (0.031 0.059 0.41 0.5 0)
## 12790) magnet_belt_x>=165.5 134 35 C (0.052 0.052 0.74 0.16 0) *
## 12791) magnet_belt_x< 165.5 187 49 D (0.016 0.064 0.18 0.74 0)
## 25582) pitch_belt< -42.05 38 18 C (0.079 0.32 0.53 0.079 0)
## 51164) magnet_dumbbell_y>=296 17 6 B (0.18 0.65 0 0.18 0) *
## 51165) magnet_dumbbell_y< 296 21 1 C (0 0.048 0.95 0 0) *
## 25583) pitch_belt>=-42.05 149 14 D (0 0 0.094 0.91 0) *
## 1599) roll_dumbbell>=59.01136 481 167 D (0.11 0.075 0.048 0.65 0.11)
## 3198) roll_forearm>=43.3 49 6 A (0.88 0.12 0 0 0) *
## 3199) roll_forearm< 43.3 432 118 D (0.023 0.069 0.053 0.73 0.13)
## 6398) magnet_belt_x< 167.5 382 75 D (0.026 0.079 0.039 0.8 0.052)
## 12796) pitch_belt< -42.6 21 2 B (0.095 0.9 0 0 0) *
## 12797) pitch_belt>=-42.6 361 54 D (0.022 0.03 0.042 0.85 0.055)
## 25594) magnet_forearm_y>=-525 341 34 D (0.023 0.021 0.044 0.9 0.012) *
## 25595) magnet_forearm_y< -525 20 4 E (0 0.2 0 0 0.8) *
## 6399) magnet_belt_x>=167.5 50 15 E (0 0 0.16 0.14 0.7) *
## 25) roll_forearm>=120.5 3629 2409 C (0.08 0.19 0.34 0.22 0.18)
## 50) magnet_dumbbell_y< 291.5 2199 1154 C (0.092 0.14 0.48 0.15 0.14)
## 100) magnet_forearm_z< -248.5 153 21 A (0.86 0.046 0 0.013 0.078)
## 200) roll_forearm< 175.5 134 2 A (0.99 0.015 0 0 0) *
## 201) roll_forearm>=175.5 19 7 E (0 0.26 0 0.11 0.63) *
## 101) magnet_forearm_z>=-248.5 2046 1001 C (0.034 0.15 0.51 0.16 0.15)
## 202) pitch_belt>=26.15 169 29 B (0.1 0.83 0.03 0 0.041) *
## 203) pitch_belt< 26.15 1877 837 C (0.028 0.09 0.55 0.17 0.16)
## 406) yaw_belt< 2.855 1787 747 C (0.03 0.094 0.58 0.18 0.11)
## 812) pitch_forearm< 38.95 1424 492 C (0.021 0.091 0.65 0.1 0.13)
## 1624) magnet_dumbbell_z< 283.5 1261 348 C (0.0056 0.074 0.72 0.098 0.099)
## 3248) gyros_belt_z< 0.075 1199 286 C (0.0058 0.078 0.76 0.08 0.075)
## 6496) roll_forearm>=125.5 1153 240 C (0.0061 0.072 0.79 0.052 0.078)
## 12992) gyros_belt_y>=-0.105 1121 208 C (0.0062 0.074 0.81 0.053 0.053)
## 25984) magnet_forearm_z< 784.5 1026 158 C (0.0068 0.062 0.85 0.058 0.027) *
## 25985) magnet_forearm_z>=784.5 95 50 C (0 0.2 0.47 0 0.33)
## 51970) magnet_forearm_x< -85 59 17 C (0 0.25 0.71 0 0.034) *
## 51971) magnet_forearm_x>=-85 36 7 E (0 0.11 0.083 0 0.81) *
## 12993) gyros_belt_y< -0.105 32 1 E (0 0 0 0.031 0.97) *
## 6497) roll_forearm< 125.5 46 10 D (0 0.22 0 0.78 0) *
## 3249) gyros_belt_z>=0.075 62 27 E (0 0 0 0.44 0.56)
## 6498) magnet_forearm_z< 281 27 0 D (0 0 0 1 0) *
## 6499) magnet_forearm_z>=281 35 0 E (0 0 0 0 1) *
## 1625) magnet_dumbbell_z>=283.5 163 104 E (0.14 0.22 0.12 0.16 0.36)
## 3250) roll_dumbbell< 24.19839 77 43 B (0.3 0.44 0.22 0.039 0)
## 6500) roll_arm< 85.5 21 0 A (1 0 0 0 0) *
## 6501) roll_arm>=85.5 56 22 B (0.036 0.61 0.3 0.054 0)
## 13002) yaw_belt>=-89.3 43 9 B (0.047 0.79 0.093 0.07 0) *
## 13003) yaw_belt< -89.3 13 0 C (0 0 1 0 0) *
## 3251) roll_dumbbell>=24.19839 86 27 E (0 0.023 0.023 0.27 0.69)
## 6502) pitch_belt< 0.415 21 0 D (0 0 0 1 0) *
## 6503) pitch_belt>=0.415 65 6 E (0 0.031 0.031 0.031 0.91) *
## 813) pitch_forearm>=38.95 363 190 D (0.063 0.11 0.3 0.48 0.055)
## 1626) magnet_dumbbell_z>=-50 141 40 C (0 0.035 0.72 0.19 0.057)
## 3252) magnet_dumbbell_y>=-508.5 111 18 C (0 0.036 0.84 0.054 0.072) *
## 3253) magnet_dumbbell_y< -508.5 30 9 D (0 0.033 0.27 0.7 0) *
## 1627) magnet_dumbbell_z< -50 222 76 D (0.1 0.15 0.032 0.66 0.054)
## 3254) yaw_belt>=-4.215 51 26 B (0.35 0.49 0.12 0.02 0.02)
## 6508) magnet_forearm_x< -695.5 15 1 A (0.93 0.067 0 0 0) *
## 6509) magnet_forearm_x>=-695.5 36 12 B (0.11 0.67 0.17 0.028 0.028) *
## 3255) yaw_belt< -4.215 171 26 D (0.029 0.053 0.0058 0.85 0.064)
## 6510) gyros_belt_x>=-0.28 19 8 E (0 0.42 0 0 0.58) *
## 6511) gyros_belt_x< -0.28 152 7 D (0.033 0.0066 0.0066 0.95 0) *
## 407) yaw_belt>=2.855 90 0 E (0 0 0 0 1) *
## 51) magnet_dumbbell_y>=291.5 1430 965 D (0.063 0.25 0.12 0.33 0.24)
## 102) pitch_forearm< 23.65 871 564 B (0.054 0.35 0.17 0.11 0.31)
## 204) roll_dumbbell< 44.5777 240 50 B (0.05 0.79 0.021 0.062 0.075)
## 408) magnet_dumbbell_x>=-504.5 203 18 B (0.059 0.91 0 0.015 0.015) *
## 409) magnet_dumbbell_x< -504.5 37 22 E (0 0.14 0.14 0.32 0.41)
## 818) yaw_belt>=-93.35 25 10 E (0 0.2 0.2 0 0.6) *
## 819) yaw_belt< -93.35 12 0 D (0 0 0 1 0) *
## 205) roll_dumbbell>=44.5777 631 377 E (0.055 0.19 0.22 0.13 0.4)
## 410) roll_forearm< 132.5 138 42 C (0.051 0.24 0.7 0 0.014)
## 820) pitch_forearm< -23.15 32 7 B (0.22 0.78 0 0 0) *
## 821) pitch_forearm>=-23.15 106 10 C (0 0.075 0.91 0 0.019) *
## 411) roll_forearm>=132.5 493 241 E (0.057 0.17 0.091 0.17 0.51)
## 822) magnet_arm_y>=184.5 151 88 B (0 0.42 0.19 0.17 0.23)
## 1644) gyros_belt_y>=-0.04 127 64 B (0 0.5 0.22 0.2 0.079)
## 3288) yaw_forearm< 122 96 35 B (0 0.64 0.042 0.23 0.094)
## 6576) yaw_belt>=-93.25 83 22 B (0 0.73 0.048 0.11 0.11) *
## 6577) yaw_belt< -93.25 13 0 D (0 0 0 1 0) *
## 3289) yaw_forearm>=122 31 7 C (0 0.065 0.77 0.13 0.032) *
## 1645) gyros_belt_y< -0.04 24 0 E (0 0 0 0 1) *
## 823) magnet_arm_y< 184.5 342 124 E (0.082 0.061 0.05 0.17 0.64)
## 1646) magnet_forearm_z< -221 24 1 A (0.96 0 0 0 0.042) *
## 1647) magnet_forearm_z>=-221 318 101 E (0.016 0.066 0.053 0.18 0.68)
## 3294) gyros_belt_z< -0.16 40 7 D (0 0 0 0.83 0.18) *
## 3295) gyros_belt_z>=-0.16 278 68 E (0.018 0.076 0.061 0.09 0.76) *
## 103) pitch_forearm>=23.65 559 193 D (0.077 0.093 0.052 0.65 0.12)
## 206) pitch_forearm>=61.2 33 4 A (0.88 0.12 0 0 0) *
## 207) pitch_forearm< 61.2 526 160 D (0.027 0.091 0.055 0.7 0.13)
## 414) magnet_belt_z>=-327.5 495 129 D (0.028 0.093 0.059 0.74 0.081) *
## 415) magnet_belt_z< -327.5 31 2 E (0 0.065 0 0 0.94) *
## 13) magnet_dumbbell_y>=439.5 1889 994 B (0.031 0.47 0.035 0.21 0.25)
## 26) total_accel_dumbbell>=5.5 1308 490 B (0.045 0.63 0.049 0.019 0.26)
## 52) magnet_belt_z< -292.5 1141 341 B (0.052 0.7 0.053 0.021 0.17)
## 104) gyros_belt_z>=-0.29 1066 266 B (0.055 0.75 0.056 0.023 0.12)
## 208) gyros_belt_z< 0.075 995 196 B (0.059 0.8 0.06 0.024 0.053)
## 416) yaw_dumbbell< -65.87001 104 45 A (0.57 0.25 0.038 0.12 0.019)
## 832) roll_forearm< 112 59 0 A (1 0 0 0 0) *
## 833) roll_forearm>=112 45 19 B (0 0.58 0.089 0.29 0.044)
## 1666) total_accel_arm>=18 31 5 B (0 0.84 0.032 0.065 0.065) *
## 1667) total_accel_arm< 18 14 3 D (0 0 0.21 0.79 0) *
## 417) yaw_dumbbell>=-65.87001 891 118 B (0 0.87 0.063 0.012 0.057)
## 834) magnet_dumbbell_z< 85.5 550 10 B (0 0.98 0 0.013 0.0055) *
## 835) magnet_dumbbell_z>=85.5 341 108 B (0 0.68 0.16 0.012 0.14)
## 1670) pitch_forearm>=-1.45 299 66 B (0 0.78 0.15 0.013 0.06)
## 3340) yaw_arm>=-75.4 283 50 B (0 0.82 0.16 0.014 0.0071)
## 6680) yaw_belt< 165.5 218 19 B (0 0.91 0.06 0.018 0.0092) *
## 6681) yaw_belt>=165.5 65 31 B (0 0.52 0.48 0 0)
## 13362) roll_arm>=25.3 31 0 B (0 1 0 0 0) *
## 13363) roll_arm< 25.3 34 3 C (0 0.088 0.91 0 0) *
## 3341) yaw_arm< -75.4 16 0 E (0 0 0 0 1) *
## 1671) pitch_forearm< -1.45 42 12 E (0 0 0.29 0 0.71)
## 3342) roll_forearm< 139.5 12 0 C (0 0 1 0 0) *
## 3343) roll_forearm>=139.5 30 0 E (0 0 0 0 1) *
## 209) gyros_belt_z>=0.075 71 1 E (0 0.014 0 0 0.99) *
## 105) gyros_belt_z< -0.29 75 0 E (0 0 0 0 1) *
## 53) magnet_belt_z>=-292.5 167 23 E (0 0.11 0.024 0.006 0.86) *
## 27) total_accel_dumbbell< 5.5 581 206 D (0 0.13 0.0034 0.65 0.22)
## 54) pitch_belt>=13.2 462 87 D (0 0.17 0.0043 0.81 0.017)
## 108) yaw_belt< -2.825 73 0 B (0 1 0 0 0) *
## 109) yaw_belt>=-2.825 389 14 D (0 0.01 0.0051 0.96 0.021) *
## 55) pitch_belt< 13.2 119 0 E (0 0 0 0 1) *
## 7) magnet_belt_y< 557.5 1045 200 E (0.00096 0.012 0.0029 0.18 0.81)
## 14) magnet_dumbbell_z>=146.5 232 64 D (0 0.056 0.013 0.72 0.21)
## 28) magnet_belt_z< -444.5 170 3 D (0 0.018 0 0.98 0) *
## 29) magnet_belt_z>=-444.5 62 14 E (0 0.16 0.048 0.016 0.77) *
## 15) magnet_dumbbell_z< 146.5 813 16 E (0.0012 0 0 0.018 0.98)
## 30) yaw_dumbbell< -122.7058 13 1 D (0 0 0 0.92 0.077) *
## 31) yaw_dumbbell>=-122.7058 800 4 E (0.0013 0 0 0.0038 0.99) *
Frequency histogram of A,B, etc. of predicted testing results.
rpartPred_train <- predict(ecurl,training)
rpartPred <- predict(ecurl,testing)
#rpartPred <- predict(ecurl,testing,type="class")
#rpartPred # too big
#str(rpartPred)
plot(rpartPred)
Out of 5885 test samples, the number of misidentified classes is only
sum(rpartPred != testing$classe)
## [1] 653
# df <- data.frame(testing$class, rpartPred)
giving an accuracy rate (out of sample error) of 0.8890399. But this is just a “by hand” estimate.
Confusion matrix for testing and training sets. Shown is for testing data set.
confusionMatrix(rpartPred, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1595 62 6 19 21
## B 41 937 34 25 28
## C 5 62 919 57 46
## D 15 30 33 821 27
## E 18 48 34 42 960
##
## Overall Statistics
##
## Accuracy : 0.889
## 95% CI : (0.8807, 0.897)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8596
## Mcnemar's Test P-Value : 0.0007313
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9528 0.8227 0.8957 0.8517 0.8872
## Specificity 0.9744 0.9730 0.9650 0.9787 0.9704
## Pos Pred Value 0.9366 0.8798 0.8439 0.8866 0.8711
## Neg Pred Value 0.9811 0.9581 0.9777 0.9712 0.9745
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2710 0.1592 0.1562 0.1395 0.1631
## Detection Prevalence 0.2894 0.1810 0.1850 0.1573 0.1873
## Balanced Accuracy 0.9636 0.8978 0.9304 0.9152 0.9288
The confusion matrix’s off diagonal elements show how many samples were mis-identified. The diagonal elements show how many were correctly identified. The adjoining “Overall Statistics” also identifies the accuracy (0.8890399), kappa (0.859579) and the p-value (7.313300710^{-4}) for the test set. In contrast, the training set has a “better” accuracy (0.9071122), kappa (0.8824533) and a p-value (2.440564210^{-13}).
The spread in the accuracies for the training and test sets is important for judging whether the model is over-fitting.
When repeatly run this markdown file, 80-90% of the values stay the same, reflecting the accuracy of the model used. For a fixed tuneLength, there is a fair amount of jiggle in the predictions due to different random initializations within tune(). Most of it is for certain test numbers. This table is not being included in the rMarkdown output.
rpartPred_20 <- predict(ecurl,test_noNA)
#rpartPred_20 # Don't show prediction on the web
The random tree model that we built has a 89% accuracy, which is a moderate success for first attempt. It is not as good as the 98% overall accuracy that the HAR group achieved with the same data (cite 3).
https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#wle_paper_section#ixzz3gjCVcSVT