## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(interest_vars)` instead of `interest_vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Your hands can’t hit what your eyes can’t see. Muhammad Ali
In real-world data, there is often so much to be re-arranged, re-formatted and supplemented that it’s easy to overlook what isn’t there. I wrote about this recently, but decided it could be improved.
Some real world data, masked and anonymous
To illustrate some of the ways that a dataset can snooker you, I’ll be using a tibble with 22,471 rows and 38 columns. That’s 853,898 slots for values, far beyond eyeballing’s reach.
This observational data is to be used for a logistic model of VAR30 and VAR31 as the alternative dependent variables. The continuous variables are VAR1 (simply a record id), VAR2 and VAR7. All others are dichotomous, coded 1
for present and 0
for absent. VAR14-VAR29 are the principal independent variables of interest.
Here is the data layout.
## tibble [22,471 × 38] (S3: tbl_df/tbl/data.frame)
## $ VAR1 : int [1:22471] 420893 303936 191488 602555 319182 975171 327170 820841 385483 996417 ...
## $ VAR2 : num [1:22471] 36 22 17 39 64 54 46 45 44 16 ...
## $ VAR3 : num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR4 : num [1:22471] 0 0 1 0 0 0 0 0 0 1 ...
## $ VAR5 : num [1:22471] 1 1 0 1 1 1 1 1 1 0 ...
## $ VAR6 : num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR7 : num [1:22471] 2008 2008 2008 2008 2008 ...
## $ VAR8 : num [1:22471] 1 0 0 0 0 0 0 0 0 0 ...
## $ VAR9 : num [1:22471] 0 1 1 1 1 1 1 1 1 1 ...
## $ VAR10: num [1:22471] 1 0 1 1 0 1 0 1 1 1 ...
## $ VAR11: num [1:22471] 1 0 1 1 1 1 1 1 1 1 ...
## $ VAR12: num [1:22471] 0 1 0 0 0 0 0 0 0 0 ...
## $ VAR13: num [1:22471] 1 1 1 1 1 1 1 1 1 0 ...
## $ VAR14: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR15: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR16: num [1:22471] 0 0 0 0 0 0 1 0 0 1 ...
## $ VAR17: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR18: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR19: num [1:22471] 0 0 0 0 0 0 1 0 0 0 ...
## $ VAR20: num [1:22471] 0 0 0 0 0 0 0 0 0 1 ...
## $ VAR21: num [1:22471] 0 0 0 0 1 1 0 0 0 0 ...
## $ VAR22: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR23: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR24: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR25: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR26: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR27: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR28: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR29: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR30: num [1:22471] 1 0 0 1 1 1 1 0 1 1 ...
## $ VAR31: num [1:22471] 0 1 1 0 0 0 0 1 0 0 ...
## $ VAR32: num [1:22471] 1 0 1 1 0 0 0 0 0 0 ...
## $ VAR33: num [1:22471] 0 1 0 0 1 1 1 1 1 1 ...
## $ VAR34: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR35: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR36: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR37: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
## $ VAR38: num [1:22471] 0 0 0 0 0 0 0 0 0 0 ...
What’s missing?
The not applicables
To see if a data frame or tibble contains any missing values, the handy function is anyNA()
. For our data, the result is TRUE.
A quick way to identify how many missing values there are is CountCompCases{DescTools}.
(There is also complete.cases{base}
, but it returns a logical list, rather than a summary.)
##
## Total rows: 22471
## Complete Cases: 22052 (98.1%)
##
## vname nas nas_p cifnot cifnot_p
## 1 VAR1 0 0.0% 22052 98.1%
## 2 VAR2 0 0.0% 22052 98.1%
## 3 VAR3 0 0.0% 22052 98.1%
## 4 VAR4 0 0.0% 22052 98.1%
## 5 VAR5 0 0.0% 22052 98.1%
## 6 VAR6 0 0.0% 22052 98.1%
## 7 VAR7 0 0.0% 22052 98.1%
## 8 VAR8 0 0.0% 22052 98.1%
## 9 VAR9 0 0.0% 22052 98.1%
## 10 VAR10 0 0.0% 22052 98.1%
## 11 VAR11 0 0.0% 22052 98.1%
## 12 VAR12 0 0.0% 22052 98.1%
## 13 VAR13 0 0.0% 22052 98.1%
## 14 VAR14 0 0.0% 22052 98.1%
## 15 VAR15 0 0.0% 22052 98.1%
## 16 VAR16 0 0.0% 22052 98.1%
## 17 VAR17 0 0.0% 22052 98.1%
## 18 VAR18 0 0.0% 22052 98.1%
## 19 VAR19 0 0.0% 22052 98.1%
## 20 VAR20 0 0.0% 22052 98.1%
## 21 VAR21 0 0.0% 22052 98.1%
## 22 VAR22 0 0.0% 22052 98.1%
## 23 VAR23 0 0.0% 22052 98.1%
## 24 VAR24 0 0.0% 22052 98.1%
## 25 VAR25 0 0.0% 22052 98.1%
## 26 VAR26 0 0.0% 22052 98.1%
## 27 VAR27 0 0.0% 22052 98.1%
## 28 VAR28 0 0.0% 22052 98.1%
## 29 VAR29 0 0.0% 22052 98.1%
## 30 VAR30 0 0.0% 22052 98.1%
## 31 VAR31 0 0.0% 22052 98.1%
## 32 VAR32 419 1.9% 22052 98.1%
## 33 VAR33 419 1.9% 22052 98.1%
## 34 VAR34 419 1.9% 22052 98.1%
## 35 VAR35 419 1.9% 22052 98.1%
## 36 VAR36 419 1.9% 22052 98.1%
## 37 VAR37 419 1.9% 22052 98.1%
## 38 VAR38 0 0.0% 22052 98.1%
We see that 419 have NA
values, about 1.9% of all observations.
Why do we care?
NAs are squirrely when it comes to computation.1
- NA+1 = NA
- NA-1 = NA
- NA*1 = NA
- NA/1 = NA
- NA^1 = NA
- NA^0 = 1
- NA**2 = NA
More likely than not, leaving them will throw errors.
What can we do?
We have two alternatives: remove the rows with NAs or impute values. Because the NAs are only 1.9% of the data, I’m going to avoid all the skull sweat needed for good imputation (substituting some summary statistic for the NA). An additional consideration is that the variables with the NA are a bit dodgy to begin with for reasons not germane to the subject. They also turned out not to contribute to the modeling.
How do we remove the NAs?
The dplyr
filter function takes a logical argument and !isNA() provides one.
mydata_no_na <- mydata %>% filter(!is.na(VAR32))
A little sleight of hand was involved. I guessed that all of the variables with NA would be missing in each observation, so that by filtering on one, I’d hit them all. Sure enough, anyNA()
confirmed my guess. If it hadn’t, of course, I would have needed to either run multiple filters or expand to
mydata_no_na <- mydata %>% filter(!is.na(VAR32) | !is.na(VAR33) ...)
But wait!
Dependent variables of interest are all zero valued for some observations
These data were assembled for a purpose – to determine the influence of a group of 16 variables (different types of medical trauma) on 2 variables considered separately (different types of internal injuries to the same organ). Observations in which none of the 16 variables were coded as present (1) shed no light on the problem.
How do we identify them?
Recall that each of the 16 variables is either valued as 1 or 0. If the sum of the variables for an observation is zero, we don’t need the observation for the main purposes of the model project. There are several possible ways to sum variables rowwise. I chose a little linear algebra jujitsu.
Linear algebra!!! Relax, I don’t know much about this subject either. But I know a piece that’s simple and well suited to this problem.
Starting with mydata_no_na
, the first step is to create an object for the 16 variables
interest_vars <- c("VAR14", "VAR15", "VAR16", "VAR17", "VAR18", "VAR19", "VAR20", "VAR21", "VAR22", "VAR23", "VAR24", "VAR25", "VAR26", "VAR27", "VAR28", "VAR29")
and to use it to subset the NAs removed data
mydata_vars <- mydata_no_na %>% dplyr::select(interest_vars)
Here comes the trick: It’s easy to convert a tibble into a matrix.
m <- as.matrix(mydata_vars)
Why do we want a matrix? A matrix is an object that you can perform row operations on easily.
rowtots <- rowSums(m)
We now have the sum of the 16 variables for each observation.
The return trip is equally easy.
vcode_count <- enframe(rowtots) %>% dplyr::select(value) %>% dplyr::rename(V_COUNT = value)
mydata_w_row_count <- as_tibble(cbind(mydata_no_na, vcode_count))
mydata_singlets <- mydata_with_row_count %>% filter(V_COUNT == 1)
In short, we’ve added a row total field, V_COUNT
and filtered for sums of 1.
The results are dramatic. We started with 22,471 observations, subtracted 419 NA, and then isolated the observations with one (only one) of the 16 variables and ended up with 9,114 observations.
Even more
We’ve dealt with NA, eliminated observations with 0 or more than 1 occurrences. There is another possibility, observations with more than one variable of interest. These are easy to identify
mydata_multi <- mydata_with_row_count %>% filter(V_COUNT > 1)
and there are 2,954 observations. These are candidates for a separate model because there are potentially multiple interacting independent variables that influence the dependent variables. (Or they could represent inconsistencies in data collection.)
Flash cards
- Don’t eyeball
- Use anyNA()
- Check for “useless” observations
- Check for observations needing different treatment
“Numerical computations using NA will normally result in NA: a possible exception is where NaN is also involved, in which case either might result (which may depend on the R platform). Logical computations treat NA as a missing TRUE/FALSE value, and so may return TRUE or FALSE if the expression does not depend on the NA operand.” [NA]{base}↩︎