Introduction

In this project, my goal will be to use data (Weight Lifting Exercise Dataset) from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly (class A) and incorrectly in 4 different ways (classes B, C, D and E). This data set is unusual for quantifying how well an activity is done, as well as the more commonly measured how much is done. The group Human Activity Recognition http://groupware.les.inf.puc-rio.br/har has kindly provided this data set (cite 3). An edited version of this data set was downloaded for training from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and for testing from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv.

Libraries

## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: partykit
## Warning: package 'partykit' was built under R version 3.1.3
## Loading required package: grid
## Loading required package: rpart
## Loading required package: rpart.plot
## Warning: package 'rpart.plot' was built under R version 3.1.2

Reading Data

 train <- read.csv('pml-training.csv', na.strings=c("#DIV/0!", "", "NA"), 
                   stringsAsFactors=FALSE)
 test <- read.csv('pml-testing.csv', na.strings ="NA")

Clean data and reduce number of columns

Initially \(train\) has 19622 rows and 160 columns. There are a lot of columns with mostly blanks, or divide by zero or NA (treated as NA’s by \(read.csv()\) above.) Except there are numbers in those columns when \(train\$new\_window\) is equal to \(yes\). These are only a few rows, don’t worry about them. (Deleting these rows made little difference in results.) Remove columns with any NA’s.

train_noNA <- train[,colSums(is.na(train)) == 0] 
test_noNA <- test[,colSums(is.na(test)) == 0]  

Coerce the column classes to match in the two data sets train and test.

  for (i in 1:ncol(test_noNA)) {
      if (names(test_noNA)[i] != names(train_noNA)[i] ) {#check col. aignment
          message( 'name ', i,'!!! ', names(test_noNA)[i], " != ",  
            names(train_noNA)[i])  
      }
      if (class(test_noNA[,i]) != class(train_noNA[,i]) ) {#convert type
          message( 'class ', i,' ', names(test_noNA)[i], ' ',
            class(test_noNA[,i]), "!=", class(train_noNA[,i]) )
          if (class(train_noNA[,i]) == 'character') {
              test_noNA[,i] <- as.character(test_noNA[,i])
          }
          if (class(test_noNA[,i]) == 'integer') {
              train_noNA[,i] <- as.integer(train_noNA[,i])
          }
      }
  }
## class 2 user_name factor!=character
## class 5 cvtd_timestamp factor!=character
## class 6 new_window factor!=character
## class 46 magnet_dumbbell_z integer!=numeric
## class 58 magnet_forearm_y integer!=numeric
## class 59 magnet_forearm_z integer!=numeric
## name 60!!! problem_id != classe
## class 60 problem_id integer!=character

Make the dates POSIX. Make the column \(classe\) a factor variable.

  train_noNA$cvtd_timestamp <- as.POSIXct(strptime(train_noNA$cvtd_timestamp, "%d/%m/%Y %H:%M"))
test_noNA$cvtd_timestamp <- as.POSIXct(strptime(test_noNA$cvtd_timestamp, "%d/%m/%Y %H:%M"))
train_noNA$classe <- as.factor(train_noNA$classe)

Remove additional columns which should not be relevant, i.e. non-numeric (\(user\_name\), \(new\_window\)) or sequential(\(X\), \(num\_window\)) or time.

  #names(train_noNA)[1:7]
  #class(train_noNA)
  #summary(train_noNA)
  train_noNA <- subset(train_noNA, select = -X)
  train_noNA <- subset(train_noNA, select = -user_name)
  train_noNA <- subset(train_noNA, select = -raw_timestamp_part_1)
  train_noNA <- subset(train_noNA, select = -raw_timestamp_part_2)
  train_noNA <- subset(train_noNA, select = -new_window)
  train_noNA <- subset(train_noNA, select = -cvtd_timestamp)
  train_noNA <- subset(train_noNA, select = -num_window)

  test_noNA <- subset(test_noNA, select = -X)
  test_noNA <- subset(test_noNA, select = -user_name)
  test_noNA <- subset(test_noNA, select = -raw_timestamp_part_1)
  test_noNA <- subset(test_noNA, select = -raw_timestamp_part_2)
  test_noNA <- subset(test_noNA, select = -new_window)
  test_noNA <- subset(test_noNA, select = -cvtd_timestamp)
  test_noNA <- subset(test_noNA, select = -num_window)
  
  #ncol(train_noNA)
  #names(train_noNA)

Note that if column “X” (1,2,3, …) is not removed, it will be the only column used by the random tree routine to fit to the class classe, because the file is sorted in order of A, B, C, D, E. The result will be an implausibly perfect fit!

The remaining variables:

vector variables

  1. “roll_belt” “pitch_belt” “yaw_belt”
  2. “gyros_belt_x” “gyros_belt_y” “gyros_belt_z”
  3. “accel_belt_x” “accel_belt_y” “accel_belt_z”
  4. “magnet_belt_x” “magnet_belt_y” “magnet_belt_z”
  5. “roll_arm” “pitch_arm” “yaw_arm”
  6. “gyros_arm_x” “gyros_arm_y” “gyros_arm_z”
  7. “magnet_arm_x” “magnet_arm_y” “magnet_arm_z”
  8. “roll_dumbbell” “pitch_dumbbell” “yaw_dumbbell”
  9. “gyros_dumbbell_x” “gyros_dumbbell_y” “gyros_dumbbell_z”
  10. “accel_dumbbell_x” “accel_dumbbell_y” “accel_dumbbell_z”
  11. “magnet_dumbbell_x” “magnet_dumbbell_y” “magnet_dumbbell_z”
  12. “roll_forearm” “pitch_forearm” “yaw_forearm”
  13. “gyros_forearm_x” “gyros_forearm_y” “gyros_forearm_z”
  14. “accel_forearm_x” “accel_forearm_y” “accel_forearm_z”
  15. “magnet_forearm_x” “magnet_forearm_y” “magnet_forearm_z”

scalar variables

  1. “total_accel_belt”
  2. “total_accel_arm”
  3. “total_accel_dumbbell”
  4. “total_accel_forearm”
  5. “classe” (Our result variable.)

There are numerous 3 vectors of the sort (roll, pitch, yaw) and (x,y,z). There are summary scalars of the sort acceleration. We chose to remove the 3-D x-y-z vectors for acceleration and assume that they will be well represented by the total accelerations (total_accel_belt, total_accel_arm, total_accel_dumbbell and total_accel_forearm.)

train_noNA <- subset(train_noNA, select = -accel_belt_x)
train_noNA <- subset(train_noNA, select = -accel_belt_y)
train_noNA <- subset(train_noNA, select = -accel_belt_z)
train_noNA <- subset(train_noNA, select = -accel_arm_x)
train_noNA <- subset(train_noNA, select = -accel_arm_y)
train_noNA <- subset(train_noNA, select = -accel_arm_z)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_x)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_y)
train_noNA <- subset(train_noNA, select = -accel_dumbbell_z)
train_noNA <- subset(train_noNA, select = -accel_forearm_x)
train_noNA <- subset(train_noNA, select = -accel_forearm_y)
train_noNA <- subset(train_noNA, select = -accel_forearm_z)

Check for correlations

Look at correlations (\(symnum(cor(train_noNA))\)) to see if additional columns should be eliminated. We discard “duplicate” columns at the 1.0 and 0.95 correlation levels, as these have little effect on the resulting accuracy.

train_noNA <- subset(train_noNA, select = -roll_belt)  # to 1.0 correlation
train_noNA <- subset(train_noNA, select = -gyros_arm_x)# to 1.0 correlation
train_noNA <- subset(train_noNA, select = -gyros_dumbbell_x)# to 0.95 correlation
train_noNA <- subset(train_noNA, select = -gyros_dumbbell_z)# to 0.95 correlation

Partition the Data.

Original file “pml-testing.csv” is too small to act as a test set (only 20 rows), seen around in this file as the matrix test_noNA. We have to split “pml-training.csv” (matrix train_noNA) into 70% “training” and 30% “testing” sets. Sorry for any confusion with the naming! “By default, createDataPartition does a stratified random split of the data.” (cite 2) This is good, since the data is sorted into long stretches of A’s, then B’s, etc.

nr <- nrow(train_noNA)
indexTrain <- createDataPartition(y=train_noNA$classe,p=0.7, list=FALSE)
training <- train_noNA[indexTrain,]
testing <- train_noNA[-indexTrain,]

Training with cross validation. Training Accuracy (in sample error)

Without any depth of knowledge, we try using the default values of the method ’rpart" in train(). “rpart” stands for Recursive PARTitioning and uses binary tree models. “By default, rpart will conduct as many splits as possible, then use 10–fold (default 10) cross–validation to prune the tree.” (cite ref. 1) We choose to use this default, under the assumption that it is a relatively robust, safe method.

We do not select any special preprocessing.

    #ncol(training)
    #preProc <- preProcess(training, method="pca", thresh=80)
    preProc <- training  # skip PCA for now.
    #ncol(preProc)
   #
  set.seed(12345) # Should make results reproducible, but it doesn't. Naughty train().
  tuneLen = 57
  ecurl <- train(classe ~ ., data = preProc, method = 'rpart', 
                   tuneLength = tuneLen)

  ecurl  # accuracy vs. complexity factor
## CART 
## 
## 13737 samples
##    36 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.001085003  0.8808540  0.8492132  0.009537772  0.01207030
##   0.001118910  0.8791355  0.8470318  0.010091274  0.01276924
##   0.001220629  0.8741583  0.8407266  0.010404739  0.01317296
##   0.001322348  0.8696684  0.8350361  0.011339719  0.01435252
##   0.001424067  0.8660921  0.8305116  0.011059922  0.01398661
##   0.001525786  0.8625939  0.8260728  0.011096681  0.01404242
##   0.001627505  0.8594769  0.8221037  0.010198019  0.01289416
##   0.001729224  0.8553365  0.8168581  0.010035058  0.01263050
##   0.001830943  0.8522367  0.8129559  0.010185987  0.01280033
##   0.001932662  0.8481686  0.8078164  0.008923042  0.01121950
##   0.002034381  0.8433763  0.8017161  0.009616024  0.01210722
##   0.002136100  0.8403938  0.7979522  0.010364174  0.01304447
##   0.002237819  0.8369607  0.7936318  0.010421257  0.01309219
##   0.002339538  0.8325711  0.7880685  0.009608944  0.01211139
##   0.002441257  0.8297570  0.7845054  0.010394060  0.01309671
##   0.002542976  0.8271648  0.7812273  0.010578594  0.01331584
##   0.002644695  0.8233028  0.7763662  0.010734706  0.01347792
##   0.002695555  0.8218472  0.7745183  0.011072188  0.01391836
##   0.002746414  0.8201766  0.7723869  0.010332279  0.01301941
##   0.002848133  0.8169554  0.7683313  0.010191694  0.01280363
##   0.002949853  0.8143870  0.7650662  0.011757800  0.01482560
##   0.003153291  0.8087807  0.7579492  0.011892688  0.01498567
##   0.003356729  0.8042734  0.7522312  0.011734823  0.01482389
##   0.003458448  0.8012604  0.7484137  0.012846560  0.01626398
##   0.003611026  0.7978328  0.7440567  0.012035850  0.01522795
##   0.003865324  0.7936912  0.7387999  0.013065084  0.01653196
##   0.004068762  0.7888658  0.7326677  0.011848130  0.01505912
##   0.004373919  0.7817661  0.7237014  0.012214945  0.01540532
##   0.004577357  0.7754863  0.7158288  0.015370042  0.01929532
##   0.004780795  0.7716600  0.7110039  0.016052833  0.02015046
##   0.004984234  0.7677888  0.7060790  0.014920467  0.01873674
##   0.005289391  0.7634386  0.7006102  0.017063733  0.02137203
##   0.005391110  0.7621584  0.6989706  0.017330616  0.02172068
##   0.006408300  0.7470547  0.6798613  0.016506540  0.02072862
##   0.007018615  0.7382331  0.6687370  0.017483611  0.02187545
##   0.007069474  0.7377465  0.6681162  0.017796415  0.02229953
##   0.007120334  0.7373286  0.6675896  0.017916996  0.02247066
##   0.007425491  0.7339008  0.6632985  0.019037621  0.02385369
##   0.007628929  0.7318842  0.6607947  0.018309636  0.02291759
##   0.007730648  0.7305628  0.6591350  0.018060609  0.02259936
##   0.009154715  0.7121037  0.6355773  0.020926726  0.02620143
##   0.009358153  0.7080305  0.6303466  0.019639851  0.02459704
##   0.009561591  0.7062965  0.6280785  0.019588642  0.02462334
##   0.010985658  0.6870730  0.6031877  0.019287622  0.02434831
##   0.012104567  0.6690407  0.5803797  0.024687230  0.03091297
##   0.012206286  0.6661275  0.5766940  0.023689708  0.02969428
##   0.012816601  0.6611544  0.5704193  0.020977665  0.02625231
##   0.013223477  0.6555380  0.5634306  0.020348295  0.02558808
##   0.013579493  0.6492097  0.5555795  0.019564540  0.02446947
##   0.013935510  0.6467380  0.5524711  0.020041587  0.02506769
##   0.014647543  0.6381176  0.5419425  0.017854023  0.02197053
##   0.019733496  0.5751783  0.4581687  0.035222927  0.05157210
##   0.020496389  0.5624660  0.4408688  0.031291701  0.04725406
##   0.021157563  0.5458416  0.4174330  0.029495192  0.04683469
##   0.029498525  0.4937783  0.3429248  0.020443241  0.03450669
##   0.030312277  0.4919614  0.3404424  0.020880368  0.03536750
##   0.066320822  0.3889187  0.1732551  0.096701363  0.15678399
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.001085003.

The results are vectors of length 57, corresponding to tuneLength. The “best” training from cross-validation is that with a complexity parameter of cp = 0.001085, so the best training accuracy is 0.880854 \(\pm\) 0.0095378 and the best kappa is 0.8492132 \(\pm\) 0.0120703. Since this is the first row, we are a bit suspicious.

We have used the train() option \(tuneLength\), which defaults to 3. Values as high as about 90 “maximizes”" the training accuracy. However, with increasing values, the tree gets more and more branches and leaves and we suspect that we may be over-fitting. From a suggestion from community TA Ronny Restrepo, we investigated the “train” vs. “test” data sets. The accuracy will always be a bit lower for the test data set but the difference between the two will increase as the model overfits the train data (see table below) and the test data accuracy stagnates. We felt a good compromise is tuneLength = 57. The table below shows our exploration. For a fixed tuneLength, there is a fair amount of jiggle due to different random initializations within tune().

##    tuneLen diffAccur test_Accur train_Accur
## 1        7   -0.0100      0.607       0.606
## 2        7   -0.0100      0.608       0.606
## 3       15   -0.0200      0.694       0.702
## 4       30   -0.0100      0.774       0.778
## 5       37   -0.0153      0.781       0.789
## 6       45   -0.0254      0.839       0.859
## 7       52   -0.0208      0.856       0.871
## 8       57   -0.0283      0.897       0.921
## 9       60   -0.0240      0.908       0.927
## 10      63   -0.0342      0.898       0.928
## 11      67   -0.0354      0.900       0.931
## 12      75   -0.0358      0.910       0.942
## 13      75   -0.0409      0.902       0.939
## 14      90   -0.0416      0.903       0.941
## 15      90   -0.0362      0.912       0.945
## 16     120   -0.0400      0.907       0.945
## 17      57   -0.0179      0.884       0.897
## 18      57   -0.0289      0.881       0.905
## 19      57   -0.0330      0.868       0.896
## 20      57   -0.0207      0.886       0.901
## 21      57   -0.0206      0.873       0.888
## 22      57   -0.0259      0.864       0.885
## 23      57   -0.0305      0.867       0.892
## 24      57   -0.0188      0.892       0.906
## 25      57   -0.0257      0.872       0.893
## 26      57   -0.0232      0.881       0.899
## 27      57   -0.0189      0.883       0.897
## 28      57   -0.0340      0.871       0.900
## 29      57   -0.0119      0.886       0.893
## 30      57   -0.0211      0.878       0.894
## 31      57   -0.0216      0.885       0.902
## 32      57   -0.0263      0.877       0.898
## 33      57   -0.0200      0.877       0.892
## 34      57   -0.0254      0.889       0.909
## 35      57   -0.0213      0.885       0.901

This plots the training accuracy as a function of the complexity parameter, with more complex trees on the left. The convergence point for this example is the left-most point.

Below is our best decision tree in text format, which is rather long. Sorry! Due to the number of branches and leaves, the plot of the tree has been left out, since it is overwhelmed.

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##      1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##        2) pitch_forearm< -33.55 1106    9 A (0.99 0.0081 0 0 0) *
##        3) pitch_forearm>=-33.55 12631 9822 A (0.22 0.21 0.19 0.18 0.2)  
##          6) magnet_belt_y>=557.5 11586 8778 A (0.24 0.23 0.21 0.18 0.15)  
##           12) magnet_dumbbell_y< 439.5 9697 6948 A (0.28 0.18 0.24 0.17 0.12)  
##             24) roll_forearm< 120.5 6068 3611 A (0.4 0.18 0.18 0.15 0.092)  
##               48) magnet_dumbbell_z< -27.5 2041  673 A (0.67 0.2 0.014 0.067 0.054)  
##                 96) roll_forearm>=-136.5 1701  368 A (0.78 0.16 0.014 0.012 0.032)  
##                  192) roll_forearm< 113.5 1572  277 A (0.82 0.12 0.015 0.011 0.034)  
##                    384) magnet_dumbbell_y< 379.5 1287  101 A (0.92 0.061 0.0031 0.0016 0.012)  
##                      768) gyros_dumbbell_y>=-0.525 1271   85 A (0.93 0.05 0.0031 0.0016 0.013)  
##                       1536) gyros_belt_z< 0.16 1257   71 A (0.94 0.05 0.0032 0.0016 0.0016) *
##                       1537) gyros_belt_z>=0.16 14    0 E (0 0 0 0 1) *
##                      769) gyros_dumbbell_y< -0.525 16    0 B (0 1 0 0 0) *
##                    385) magnet_dumbbell_y>=379.5 285  176 A (0.38 0.36 0.07 0.053 0.13)  
##                      770) pitch_belt>=17.45 53    0 B (0 1 0 0 0) *
##                      771) pitch_belt< 17.45 232  123 A (0.47 0.22 0.086 0.065 0.16)  
##                       1542) roll_dumbbell>=-71.91276 156   51 A (0.67 0.23 0 0.096 0)  
##                         3084) pitch_arm>=-39.55 139   34 A (0.76 0.14 0 0.11 0) *
##                         3085) pitch_arm< -39.55 17    0 B (0 1 0 0 0) *
##                       1543) roll_dumbbell< -71.91276 76   38 E (0.053 0.18 0.26 0 0.5)  
##                         3086) pitch_belt>=13.65 38   18 C (0.11 0.37 0.53 0 0)  
##                           6172) pitch_belt>=14.7 18    4 B (0.22 0.78 0 0 0) *
##                           6173) pitch_belt< 14.7 20    0 C (0 0 1 0 0) *
##                         3087) pitch_belt< 13.65 38    0 E (0 0 0 0 1) *
##                  193) roll_forearm>=113.5 129   42 B (0.29 0.67 0 0.031 0)  
##                    386) magnet_belt_x>=25.5 18    0 A (1 0 0 0 0) *
##                    387) magnet_belt_x< 25.5 111   24 B (0.18 0.78 0 0.036 0) *
##                 97) roll_forearm< -136.5 340  211 B (0.1 0.38 0.012 0.34 0.16)  
##                  194) gyros_arm_y>=0.985 121   32 B (0 0.74 0 0 0.26)  
##                    388) yaw_belt< 2.4 92    3 B (0 0.97 0 0 0.033) *
##                    389) yaw_belt>=2.4 29    0 E (0 0 0 0 1) *
##                  195) gyros_arm_y< 0.985 219  103 D (0.16 0.18 0.018 0.53 0.11)  
##                    390) yaw_arm< -71.6 30    1 B (0.033 0.97 0 0 0) *
##                    391) yaw_arm>=-71.6 189   73 D (0.18 0.058 0.021 0.61 0.13)  
##                      782) gyros_belt_z< 0.075 167   51 D (0.2 0.066 0.024 0.69 0.012)  
##                       1564) yaw_belt>=-4.76 46   17 A (0.63 0.065 0.087 0.17 0.043) *
##                       1565) yaw_belt< -4.76 121   13 D (0.041 0.066 0 0.89 0) *
##                      783) gyros_belt_z>=0.075 22    0 E (0 0 0 0 1) *
##               49) magnet_dumbbell_z>=-27.5 4027 2938 A (0.27 0.17 0.27 0.18 0.11)  
##                 98) yaw_belt>=169.5 472   59 A (0.88 0.064 0 0.061 0)  
##                  196) pitch_belt>=-45.05 433   20 A (0.95 0.046 0 0 0) *
##                  197) pitch_belt< -45.05 39   10 D (0 0.26 0 0.74 0) *
##                 99) yaw_belt< 169.5 3555 2476 C (0.19 0.18 0.3 0.2 0.13)  
##                  198) pitch_belt< -43.15 346   59 B (0.02 0.83 0.069 0.049 0.032) *
##                  199) pitch_belt>=-43.15 3209 2154 C (0.21 0.11 0.33 0.22 0.14)  
##                    398) yaw_arm< -119.5 213    6 A (0.97 0.028 0 0 0) *
##                    399) yaw_arm>=-119.5 2996 1941 C (0.15 0.12 0.35 0.23 0.15)  
##                      798) magnet_belt_z< -342.5 1057  746 E (0.28 0.17 0.2 0.058 0.29)  
##                       1596) magnet_forearm_z>=-101.5 506  235 A (0.54 0.004 0.059 0.02 0.38)  
##                         3192) magnet_forearm_x>=-191.5 373  102 A (0.73 0.0054 0.024 0.016 0.23)  
##                           6384) magnet_dumbbell_z< 333 300   38 A (0.87 0.0067 0.03 0.013 0.077) *
##                           6385) magnet_dumbbell_z>=333 73   11 E (0.12 0 0 0.027 0.85) *
##                         3193) magnet_forearm_x< -191.5 133   25 E (0 0 0.16 0.03 0.81)  
##                           6386) pitch_belt>=12.3 27    6 C (0 0 0.78 0.15 0.074) *
##                           6387) pitch_belt< 12.3 106    0 E (0 0 0 0 1) *
##                       1597) magnet_forearm_z< -101.5 551  367 C (0.044 0.32 0.33 0.093 0.21)  
##                         3194) yaw_dumbbell< -38.15102 168   19 C (0 0.054 0.89 0.054 0.006) *
##                         3195) yaw_dumbbell>=-38.15102 383  218 B (0.063 0.43 0.091 0.11 0.31)  
##                           6390) roll_dumbbell< 36.13496 187   37 B (0.011 0.8 0.11 0.021 0.053)  
##                            12780) magnet_dumbbell_y>=145.5 161   11 B (0.012 0.93 0.037 0.012 0.0062) *
##                            12781) magnet_dumbbell_y< 145.5 26   11 C (0 0 0.58 0.077 0.35) *
##                           6391) roll_dumbbell>=36.13496 196   89 E (0.11 0.077 0.071 0.19 0.55)  
##                            12782) magnet_belt_z< -401.5 170   63 E (0.13 0.088 0.082 0.071 0.63)  
##                              25564) yaw_belt< -88.85 26   11 A (0.58 0.42 0 0 0) *
##                              25565) yaw_belt>=-88.85 144   37 E (0.049 0.028 0.097 0.083 0.74)  
##                                51130) yaw_dumbbell< 53.19289 19    5 C (0.26 0 0.74 0 0) *
##                                51131) yaw_dumbbell>=53.19289 125   18 E (0.016 0.032 0 0.096 0.86) *
##                            12783) magnet_belt_z>=-401.5 26    0 D (0 0 0 1 0) *
##                      799) magnet_belt_z>=-342.5 1939 1098 C (0.086 0.088 0.43 0.33 0.064)  
##                       1598) roll_dumbbell< 59.01136 1458  640 C (0.078 0.092 0.56 0.22 0.048)  
##                         3196) yaw_belt< -2.955 326  217 B (0.2 0.33 0.12 0.33 0.018)  
##                           6392) roll_forearm>=-97.9 120   54 A (0.55 0.44 0.0083 0 0)  
##                            12784) roll_dumbbell>=-50.12326 97   31 A (0.68 0.31 0.01 0 0)  
##                              25568) gyros_belt_z>=-0.12 82   16 A (0.8 0.18 0.012 0 0) *
##                              25569) gyros_belt_z< -0.12 15    0 B (0 1 0 0 0) *
##                            12785) roll_dumbbell< -50.12326 23    0 B (0 1 0 0 0) *
##                           6393) roll_forearm< -97.9 206  100 D (0 0.27 0.18 0.51 0.029)  
##                            12786) magnet_forearm_y< -498 54    7 B (0 0.87 0.019 0.037 0.074) *
##                            12787) magnet_forearm_y>=-498 152   48 D (0 0.059 0.24 0.68 0.013)  
##                              25574) gyros_arm_y>=0.86 38    7 C (0 0.079 0.82 0.053 0.053) *
##                              25575) gyros_arm_y< 0.86 114   12 D (0 0.053 0.053 0.89 0) *
##                         3197) yaw_belt>=-2.955 1132  353 C (0.042 0.022 0.69 0.19 0.057)  
##                           6394) yaw_belt< 163.5 811  165 C (0.047 0.0074 0.8 0.07 0.079)  
##                            12788) total_accel_dumbbell>=4.5 769  124 C (0.049 0.0078 0.84 0.021 0.083)  
##                              25576) yaw_arm>=89.35 39    1 A (0.97 0.026 0 0 0) *
##                              25577) yaw_arm< 89.35 730   85 C (0 0.0068 0.88 0.022 0.088)  
##                                51154) magnet_belt_x< 171.5 663   37 C (0 0.006 0.94 0.023 0.027)  
##                                 102308) gyros_belt_x< 0.25 651   25 C (0 0.0061 0.96 0.022 0.011) *
##                                 102309) gyros_belt_x>=0.25 12    1 E (0 0 0 0.083 0.92) *
##                                51155) magnet_belt_x>=171.5 67   21 E (0 0.015 0.28 0.015 0.69)  
##                                 102310) yaw_belt>=154 21    2 C (0 0.048 0.9 0.048 0) *
##                                 102311) yaw_belt< 154 46    0 E (0 0 0 0 1) *
##                            12789) total_accel_dumbbell< 4.5 42    1 D (0 0 0.024 0.98 0) *
##                           6395) yaw_belt>=163.5 321  162 D (0.031 0.059 0.41 0.5 0)  
##                            12790) magnet_belt_x>=165.5 134   35 C (0.052 0.052 0.74 0.16 0) *
##                            12791) magnet_belt_x< 165.5 187   49 D (0.016 0.064 0.18 0.74 0)  
##                              25582) pitch_belt< -42.05 38   18 C (0.079 0.32 0.53 0.079 0)  
##                                51164) magnet_dumbbell_y>=296 17    6 B (0.18 0.65 0 0.18 0) *
##                                51165) magnet_dumbbell_y< 296 21    1 C (0 0.048 0.95 0 0) *
##                              25583) pitch_belt>=-42.05 149   14 D (0 0 0.094 0.91 0) *
##                       1599) roll_dumbbell>=59.01136 481  167 D (0.11 0.075 0.048 0.65 0.11)  
##                         3198) roll_forearm>=43.3 49    6 A (0.88 0.12 0 0 0) *
##                         3199) roll_forearm< 43.3 432  118 D (0.023 0.069 0.053 0.73 0.13)  
##                           6398) magnet_belt_x< 167.5 382   75 D (0.026 0.079 0.039 0.8 0.052)  
##                            12796) pitch_belt< -42.6 21    2 B (0.095 0.9 0 0 0) *
##                            12797) pitch_belt>=-42.6 361   54 D (0.022 0.03 0.042 0.85 0.055)  
##                              25594) magnet_forearm_y>=-525 341   34 D (0.023 0.021 0.044 0.9 0.012) *
##                              25595) magnet_forearm_y< -525 20    4 E (0 0.2 0 0 0.8) *
##                           6399) magnet_belt_x>=167.5 50   15 E (0 0 0.16 0.14 0.7) *
##             25) roll_forearm>=120.5 3629 2409 C (0.08 0.19 0.34 0.22 0.18)  
##               50) magnet_dumbbell_y< 291.5 2199 1154 C (0.092 0.14 0.48 0.15 0.14)  
##                100) magnet_forearm_z< -248.5 153   21 A (0.86 0.046 0 0.013 0.078)  
##                  200) roll_forearm< 175.5 134    2 A (0.99 0.015 0 0 0) *
##                  201) roll_forearm>=175.5 19    7 E (0 0.26 0 0.11 0.63) *
##                101) magnet_forearm_z>=-248.5 2046 1001 C (0.034 0.15 0.51 0.16 0.15)  
##                  202) pitch_belt>=26.15 169   29 B (0.1 0.83 0.03 0 0.041) *
##                  203) pitch_belt< 26.15 1877  837 C (0.028 0.09 0.55 0.17 0.16)  
##                    406) yaw_belt< 2.855 1787  747 C (0.03 0.094 0.58 0.18 0.11)  
##                      812) pitch_forearm< 38.95 1424  492 C (0.021 0.091 0.65 0.1 0.13)  
##                       1624) magnet_dumbbell_z< 283.5 1261  348 C (0.0056 0.074 0.72 0.098 0.099)  
##                         3248) gyros_belt_z< 0.075 1199  286 C (0.0058 0.078 0.76 0.08 0.075)  
##                           6496) roll_forearm>=125.5 1153  240 C (0.0061 0.072 0.79 0.052 0.078)  
##                            12992) gyros_belt_y>=-0.105 1121  208 C (0.0062 0.074 0.81 0.053 0.053)  
##                              25984) magnet_forearm_z< 784.5 1026  158 C (0.0068 0.062 0.85 0.058 0.027) *
##                              25985) magnet_forearm_z>=784.5 95   50 C (0 0.2 0.47 0 0.33)  
##                                51970) magnet_forearm_x< -85 59   17 C (0 0.25 0.71 0 0.034) *
##                                51971) magnet_forearm_x>=-85 36    7 E (0 0.11 0.083 0 0.81) *
##                            12993) gyros_belt_y< -0.105 32    1 E (0 0 0 0.031 0.97) *
##                           6497) roll_forearm< 125.5 46   10 D (0 0.22 0 0.78 0) *
##                         3249) gyros_belt_z>=0.075 62   27 E (0 0 0 0.44 0.56)  
##                           6498) magnet_forearm_z< 281 27    0 D (0 0 0 1 0) *
##                           6499) magnet_forearm_z>=281 35    0 E (0 0 0 0 1) *
##                       1625) magnet_dumbbell_z>=283.5 163  104 E (0.14 0.22 0.12 0.16 0.36)  
##                         3250) roll_dumbbell< 24.19839 77   43 B (0.3 0.44 0.22 0.039 0)  
##                           6500) roll_arm< 85.5 21    0 A (1 0 0 0 0) *
##                           6501) roll_arm>=85.5 56   22 B (0.036 0.61 0.3 0.054 0)  
##                            13002) yaw_belt>=-89.3 43    9 B (0.047 0.79 0.093 0.07 0) *
##                            13003) yaw_belt< -89.3 13    0 C (0 0 1 0 0) *
##                         3251) roll_dumbbell>=24.19839 86   27 E (0 0.023 0.023 0.27 0.69)  
##                           6502) pitch_belt< 0.415 21    0 D (0 0 0 1 0) *
##                           6503) pitch_belt>=0.415 65    6 E (0 0.031 0.031 0.031 0.91) *
##                      813) pitch_forearm>=38.95 363  190 D (0.063 0.11 0.3 0.48 0.055)  
##                       1626) magnet_dumbbell_z>=-50 141   40 C (0 0.035 0.72 0.19 0.057)  
##                         3252) magnet_dumbbell_y>=-508.5 111   18 C (0 0.036 0.84 0.054 0.072) *
##                         3253) magnet_dumbbell_y< -508.5 30    9 D (0 0.033 0.27 0.7 0) *
##                       1627) magnet_dumbbell_z< -50 222   76 D (0.1 0.15 0.032 0.66 0.054)  
##                         3254) yaw_belt>=-4.215 51   26 B (0.35 0.49 0.12 0.02 0.02)  
##                           6508) magnet_forearm_x< -695.5 15    1 A (0.93 0.067 0 0 0) *
##                           6509) magnet_forearm_x>=-695.5 36   12 B (0.11 0.67 0.17 0.028 0.028) *
##                         3255) yaw_belt< -4.215 171   26 D (0.029 0.053 0.0058 0.85 0.064)  
##                           6510) gyros_belt_x>=-0.28 19    8 E (0 0.42 0 0 0.58) *
##                           6511) gyros_belt_x< -0.28 152    7 D (0.033 0.0066 0.0066 0.95 0) *
##                    407) yaw_belt>=2.855 90    0 E (0 0 0 0 1) *
##               51) magnet_dumbbell_y>=291.5 1430  965 D (0.063 0.25 0.12 0.33 0.24)  
##                102) pitch_forearm< 23.65 871  564 B (0.054 0.35 0.17 0.11 0.31)  
##                  204) roll_dumbbell< 44.5777 240   50 B (0.05 0.79 0.021 0.062 0.075)  
##                    408) magnet_dumbbell_x>=-504.5 203   18 B (0.059 0.91 0 0.015 0.015) *
##                    409) magnet_dumbbell_x< -504.5 37   22 E (0 0.14 0.14 0.32 0.41)  
##                      818) yaw_belt>=-93.35 25   10 E (0 0.2 0.2 0 0.6) *
##                      819) yaw_belt< -93.35 12    0 D (0 0 0 1 0) *
##                  205) roll_dumbbell>=44.5777 631  377 E (0.055 0.19 0.22 0.13 0.4)  
##                    410) roll_forearm< 132.5 138   42 C (0.051 0.24 0.7 0 0.014)  
##                      820) pitch_forearm< -23.15 32    7 B (0.22 0.78 0 0 0) *
##                      821) pitch_forearm>=-23.15 106   10 C (0 0.075 0.91 0 0.019) *
##                    411) roll_forearm>=132.5 493  241 E (0.057 0.17 0.091 0.17 0.51)  
##                      822) magnet_arm_y>=184.5 151   88 B (0 0.42 0.19 0.17 0.23)  
##                       1644) gyros_belt_y>=-0.04 127   64 B (0 0.5 0.22 0.2 0.079)  
##                         3288) yaw_forearm< 122 96   35 B (0 0.64 0.042 0.23 0.094)  
##                           6576) yaw_belt>=-93.25 83   22 B (0 0.73 0.048 0.11 0.11) *
##                           6577) yaw_belt< -93.25 13    0 D (0 0 0 1 0) *
##                         3289) yaw_forearm>=122 31    7 C (0 0.065 0.77 0.13 0.032) *
##                       1645) gyros_belt_y< -0.04 24    0 E (0 0 0 0 1) *
##                      823) magnet_arm_y< 184.5 342  124 E (0.082 0.061 0.05 0.17 0.64)  
##                       1646) magnet_forearm_z< -221 24    1 A (0.96 0 0 0 0.042) *
##                       1647) magnet_forearm_z>=-221 318  101 E (0.016 0.066 0.053 0.18 0.68)  
##                         3294) gyros_belt_z< -0.16 40    7 D (0 0 0 0.83 0.18) *
##                         3295) gyros_belt_z>=-0.16 278   68 E (0.018 0.076 0.061 0.09 0.76) *
##                103) pitch_forearm>=23.65 559  193 D (0.077 0.093 0.052 0.65 0.12)  
##                  206) pitch_forearm>=61.2 33    4 A (0.88 0.12 0 0 0) *
##                  207) pitch_forearm< 61.2 526  160 D (0.027 0.091 0.055 0.7 0.13)  
##                    414) magnet_belt_z>=-327.5 495  129 D (0.028 0.093 0.059 0.74 0.081) *
##                    415) magnet_belt_z< -327.5 31    2 E (0 0.065 0 0 0.94) *
##           13) magnet_dumbbell_y>=439.5 1889  994 B (0.031 0.47 0.035 0.21 0.25)  
##             26) total_accel_dumbbell>=5.5 1308  490 B (0.045 0.63 0.049 0.019 0.26)  
##               52) magnet_belt_z< -292.5 1141  341 B (0.052 0.7 0.053 0.021 0.17)  
##                104) gyros_belt_z>=-0.29 1066  266 B (0.055 0.75 0.056 0.023 0.12)  
##                  208) gyros_belt_z< 0.075 995  196 B (0.059 0.8 0.06 0.024 0.053)  
##                    416) yaw_dumbbell< -65.87001 104   45 A (0.57 0.25 0.038 0.12 0.019)  
##                      832) roll_forearm< 112 59    0 A (1 0 0 0 0) *
##                      833) roll_forearm>=112 45   19 B (0 0.58 0.089 0.29 0.044)  
##                       1666) total_accel_arm>=18 31    5 B (0 0.84 0.032 0.065 0.065) *
##                       1667) total_accel_arm< 18 14    3 D (0 0 0.21 0.79 0) *
##                    417) yaw_dumbbell>=-65.87001 891  118 B (0 0.87 0.063 0.012 0.057)  
##                      834) magnet_dumbbell_z< 85.5 550   10 B (0 0.98 0 0.013 0.0055) *
##                      835) magnet_dumbbell_z>=85.5 341  108 B (0 0.68 0.16 0.012 0.14)  
##                       1670) pitch_forearm>=-1.45 299   66 B (0 0.78 0.15 0.013 0.06)  
##                         3340) yaw_arm>=-75.4 283   50 B (0 0.82 0.16 0.014 0.0071)  
##                           6680) yaw_belt< 165.5 218   19 B (0 0.91 0.06 0.018 0.0092) *
##                           6681) yaw_belt>=165.5 65   31 B (0 0.52 0.48 0 0)  
##                            13362) roll_arm>=25.3 31    0 B (0 1 0 0 0) *
##                            13363) roll_arm< 25.3 34    3 C (0 0.088 0.91 0 0) *
##                         3341) yaw_arm< -75.4 16    0 E (0 0 0 0 1) *
##                       1671) pitch_forearm< -1.45 42   12 E (0 0 0.29 0 0.71)  
##                         3342) roll_forearm< 139.5 12    0 C (0 0 1 0 0) *
##                         3343) roll_forearm>=139.5 30    0 E (0 0 0 0 1) *
##                  209) gyros_belt_z>=0.075 71    1 E (0 0.014 0 0 0.99) *
##                105) gyros_belt_z< -0.29 75    0 E (0 0 0 0 1) *
##               53) magnet_belt_z>=-292.5 167   23 E (0 0.11 0.024 0.006 0.86) *
##             27) total_accel_dumbbell< 5.5 581  206 D (0 0.13 0.0034 0.65 0.22)  
##               54) pitch_belt>=13.2 462   87 D (0 0.17 0.0043 0.81 0.017)  
##                108) yaw_belt< -2.825 73    0 B (0 1 0 0 0) *
##                109) yaw_belt>=-2.825 389   14 D (0 0.01 0.0051 0.96 0.021) *
##               55) pitch_belt< 13.2 119    0 E (0 0 0 0 1) *
##          7) magnet_belt_y< 557.5 1045  200 E (0.00096 0.012 0.0029 0.18 0.81)  
##           14) magnet_dumbbell_z>=146.5 232   64 D (0 0.056 0.013 0.72 0.21)  
##             28) magnet_belt_z< -444.5 170    3 D (0 0.018 0 0.98 0) *
##             29) magnet_belt_z>=-444.5 62   14 E (0 0.16 0.048 0.016 0.77) *
##           15) magnet_dumbbell_z< 146.5 813   16 E (0.0012 0 0 0.018 0.98)  
##             30) yaw_dumbbell< -122.7058 13    1 D (0 0 0 0.92 0.077) *
##             31) yaw_dumbbell>=-122.7058 800    4 E (0.0013 0 0 0.0038 0.99) *

Test on 30% reserved testing set

Frequency histogram of A,B, etc. of predicted testing results.

  rpartPred_train <- predict(ecurl,training)
  rpartPred <- predict(ecurl,testing)
  #rpartPred <- predict(ecurl,testing,type="class")
  #rpartPred  # too big
  #str(rpartPred)   

  plot(rpartPred)

Results: Calculate accuracy (out of sample error) of test group by hand.

Out of 5885 test samples, the number of misidentified classes is only

sum(rpartPred != testing$classe)
## [1] 653
# df <- data.frame(testing$class, rpartPred)

giving an accuracy rate (out of sample error) of 0.8890399. But this is just a “by hand” estimate.

Results: Confusion Matrix and Accuracy

Confusion matrix for testing and training sets. Shown is for testing data set.

  confusionMatrix(rpartPred, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1595   62    6   19   21
##          B   41  937   34   25   28
##          C    5   62  919   57   46
##          D   15   30   33  821   27
##          E   18   48   34   42  960
## 
## Overall Statistics
##                                          
##                Accuracy : 0.889          
##                  95% CI : (0.8807, 0.897)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8596         
##  Mcnemar's Test P-Value : 0.0007313      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9528   0.8227   0.8957   0.8517   0.8872
## Specificity            0.9744   0.9730   0.9650   0.9787   0.9704
## Pos Pred Value         0.9366   0.8798   0.8439   0.8866   0.8711
## Neg Pred Value         0.9811   0.9581   0.9777   0.9712   0.9745
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2710   0.1592   0.1562   0.1395   0.1631
## Detection Prevalence   0.2894   0.1810   0.1850   0.1573   0.1873
## Balanced Accuracy      0.9636   0.8978   0.9304   0.9152   0.9288

The confusion matrix’s off diagonal elements show how many samples were mis-identified. The diagonal elements show how many were correctly identified. The adjoining “Overall Statistics” also identifies the accuracy (0.8890399), kappa (0.859579) and the p-value (7.313300710^{-4}) for the test set. In contrast, the training set has a “better” accuracy (0.9071122), kappa (0.8824533) and a p-value (2.440564210^{-13}).

The spread in the accuracies for the training and test sets is important for judging whether the model is over-fitting.

Use 20 test cases file (from pml-testing.csv):

When repeatly run this markdown file, 80-90% of the values stay the same, reflecting the accuracy of the model used. For a fixed tuneLength, there is a fair amount of jiggle in the predictions due to different random initializations within tune(). Most of it is for certain test numbers. This table is not being included in the rMarkdown output.

   rpartPred_20 <- predict(ecurl,test_noNA)
   #rpartPred_20   # Don't show prediction on the web

Conclusions

The random tree model that we built has a 89% accuracy, which is a moderate success for first attempt. It is not as good as the 98% overall accuracy that the HAR group achieved with the same data (cite 3).

References

  1. http://static1.squarespace.com/static/51156277e4b0b8b2ffe11c00/t/53ad86e5e4b0b52e4e71cfab/1403881189332/Applied_Predictive_Modeling_in_R.pdf

  2. https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf

  3. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Read more: http://groupware.les.inf.puc-rio.br/har#wle_paper_section#ixzz3gjCVcSVT