S3 class that allows different flavors of missing in numeric vectors.
One can divide measures into two groups: qualitative and quantitative. However, record formats often mix the two. Some of the values are simply interpreted as is: a 2 is a 2. Some of the values are codes which represent qualities instead of numbers: an 8 means the measure's not applicable. These are sometimes called "sentinel values." And, of course, some values are just plain missing.
When handling these data in R, a common idiom is to split the column in twain: a numeric vector for the quantitative and a factor for the qualitative. This is the simplest solution and will often work fine. But it does something risky: it separates linked data. The user must remember to keep them together, and usually does this with clever variable or column names.
Clever is bad. Code with my_data[, paste0(vars, c("_num", "_flag"))] is hard to read. Code with get is hard to follow.
The sentinel package offers the sentineled class to bundle numeric and categorical missing values into a single object.
library(sentinel)
x <- sentineled(
c(10, 20, 98, 99, NA),
sentinels = c(98, 99),
labels = c("refused", "not recorded")
)
x## [1] 10 20 <refused> <not recorded>
## [5] NA
## sentinel values: "" "refused" "not recorded"
The numbers are numbers, the categories are categorical, and the unknowns are just unknown.
A sentineled object is a vector. When subsetting, a it will remain a sentineled object with the same possible sentinel values.
x[1]## [1] 10
## sentinel values: "" "refused" "not recorded"
x[1:2]## [1] 10 20
## sentinel values: "" "refused" "not recorded"
x[[3]]## [1] <refused>
## sentinel values: "" "refused" "not recorded"
x[x < 15]## [1] 10 <refused> <not recorded> NA
## sentinel values: "" "refused" "not recorded"
A sentineled vector can be used in arithmetic, with all non-missing values acting like normal numeric values. If possible, a sentineled object with the appropriate sentinel values will be the result.
mean(x, na.rm = TRUE)## [1] 15
x / 100## [1] 0.1 0.2 <refused> <not recorded>
## [5] NA
## sentinel values: "" "refused" "not recorded"
It can even be a column in a data.frame.
data.frame(
element = c("argon", "boron", "chlorine"),
mass = sentineled(c(3, "x", 8), "x", "scale malfunction")
)## element mass
## 1 argon 3
## 2 boron <scale malfunction>
## 3 chlorine 8
The sentinel codes are treated as missing, but the different categories of missing are stored as a factor vector in the "sentinels" attribute of the object. Use the sentinels function to access them.
sentinels(x)## [1] refused not recorded <NA>
## Levels: refused not recorded
x[sentinels(x) != "refused"]## [1] 10 20 <not recorded> NA
## sentinel values: "" "refused" "not recorded"
Notice that, for the non-missing values in x, their respective sentinel codes are blanks ("").
as.character(sentinels(x))## [1] "" "" "refused" "not recorded"
## [5] NA
It's recommended to use explanatory sentinel levels for all expected types of missing. That way, if a value is shown as just plain NA, it's a sign something went wrong in the analysis.