19. Dealing with missing values in R
Real-world datasets often have missing values. Here’s some tips for dealing with them in R.
The NA
In spreadsheets, missing data is typically represented by a blank cell. In R, missing data is given the symbol NA
.
For an example, let’s make a vector called x
and fill it with some numbers and some missing values (NA’s):
x = c( 1, NA, 5, NA, 10 )
Now let’s try to work with this data.
Stats functions don’t like NA’s
Now that we’ve got x
, let’s do some simple statistics. Let’s calculate the mean:
mean(x)
R tells me that the mean is undefined:
> mean(x)
[1] NA
What’s going on? Well, by default, R will try to include the missing values in the calculation. Since missing values are undefined, the resulting mean is undefined.
We can fix the problem by telling R to remove the NA’s from the calculation. We do that by setting the na.rm
option to TRUE
:
mean(x, na.rm = TRUE)
(Note: if you prefer to be less verbose, you can substitute T
for TRUE
, as in na.rm = T
.)
The na.rm
option tells R to remove missing values when it calculates the mean: Now we get back a sensible result:
> mean(x, na.rm = TRUE)
[1] 5.333333
Other stats functions like sd
(standard deviation), median
, and sum
work the same way. If you want them to ignore missing values, you have use na.rm
.
(Side note: na.rm
is what we call a ‘default’ option. If you don’t specify it, R will assume you want na.rm = FALSE
.)
summary
handles NA’s by default
Interestingly, the summary
function works differently. By default, it will remove missing values when it calculates summary statistics. Then, it will report how may NA’s are in the data.
Let’s run summary
on x
:
summary(x)
I get back:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.000 3.000 5.000 5.333 7.500 10.000 2
The last column tells us the number of missing values. As expected, there are 2.
Removing missing values from your data
When you pass the na.rm
option to stats functions, your underlying data remains unchanged. Sometimes that’s what you want. Other times, you want to omit the NA’s from your source data.
In R, there are a few ways to do that. The most universal is a function called na.omit
. Let’s run it on x
:
na.omit(x)
Here’s what I get back:
[1] 1 5 10
attr(,"na.action")
[1] 2 4
attr(,"class")
[1] "omit"
That’s weird output.
What’s going on here is that na.omit
is returning two things:
- a vector with NA’s removed from
x
- metadata containing the indexes of the NA’s that were removed
I personally find this output an odd choice on the part of R developers. When I use a function called na.omit
, I expect it will return my data with the NA’s omitted. Interestingly, when we pass a data frame to na.omit
, that’s exactly what happens. But for some reason, the function works differently on vectors.
At any rate, we can get rid of the unwanted metadata by putting the as.vector
function around na.omit
. That tells R to convert the output to a plain vector:
as.vector( na.omit(x) )
The output is what we expect
> as.vector( na.omit(x) )
[1] 1 5 10
Remove NA’s using is.na
Because na.omit
has unexpected behavior when applied to a vector, you’ll often see R programmers use the following code to remove NA’s:
x[ ! is.na(x) ]
The code is simple enough to copy and paste. But what does it mean?
To understand the code, we’ll start with the is.na
function. Lets’ run it on x
:
is.na(x)
I get back:
[1] FALSE TRUE FALSE TRUE FALSE
The purpose of is.na
is to test if the elements of a vector are NA. If is.na
finds a missing value, it returns TRUE
. If not, the function returns FALSE
. The result is a vector of true/false values that correspond to the NA status of elements in x
.
Now, the interesting thing about R is that you use true/false values to subset a vector. For example, suppose I have a vector z
that contains the elements 1 and 2:
z = c(1, 2)
Now let’s make a logical vector that contains the values FALSE
and TRUE
:
index = c(FALSE, TRUE)
In R, we can pass these true/false indexes to z using the bracket operator []
:
z[index]
What happens is that R will return the elements of z
for which index
is TRUE
.
> z[index]
[1] 2
The code x[ ! is.na(x) ]
does something similar, but with more nested pieces.
Tracking the nested logic
To keep track of what’s going on with nested code, I find it helpful to do things in steps. To break down our NA-remove code, let’s define a variable called index
that will store the true/false output of is.na
:
index = is.na(x)
Now we can tell R to subset x
according to index
:
x[index]
We’ll get back the values of x
for which is.na(x)
is TRUE
. In other words, we’ll get back our NA’s:
> x[index]
[1] NA NA
That’s fun, but not what we want. Our goal is to remove missing values, not keep them. That’s where the !
comes in. In R, the !
is the logical ’not’ operator.
In the case of true/false data, the !
operator reverses the values, converting TRUE
to FALSE
, and vice versa.
Back to our index
vector. To review, index
contained:
> index
[1] FALSE TRUE FALSE TRUE FALSE
If we tell R to return !index
, our logical values get flipped:
> !index
[1] TRUE FALSE TRUE FALSE TRUE
Now, if we pass !index
back to x
, we’ll get all of the values that are not NA’s:
x[ !index ]
The results are what we want — our vector x
with the NA’s removed:
> x[ !index ]
[1] 1 5 10
Seasoned R programmers take this thinking and compress it into one line of code. Find the indexes that are not NA, and subset x
using those values. Translated into code, we get:
# long version
index = is.na(x)
x[ !index ]
# nested version
x[ ! is.na(x) ]
na.omit
on data frames
Unlike on vectors, na.omit
behaves as expected on data frames. It returns your data frame with the missing values removed.
To use na.omit
, let’s first make an example dataset called d
:
x = c(1, NA, 3)
y = c(2, NA, 5)
d = data.frame(x, y)
Here’s what’s in d
:
x y
1 1 2
2 NA NA
3 3 5
Now let’s use na.omit
to get rid of the missing values:
na.omit(d)
It works like a charm. We get:
> na.omit(d)
x y
1 1 2
3 3 5
Be careful when using na.omit
on sparse data
Be warned that if your dataset is ‘sparse’ — meaning there are lots of NAs sprinkled in different places — na.omit
can have unintended consequences.
To see what can happen, let’s make another data frame:
x = c(NA, 1, 3)
y = c(2, NA, 5)
z = c(4, 2, NA)
d = data.frame(x, y, z)
Here’s what’s in d
:
x y z
1 NA 2 4
2 1 NA 2
3 3 5 NA
Notice that the missing values are not in one row. Instead, they’re sprinkled in different rows. When we use na.omit
on this data, we have a problem. We get back an empty data frame:
> na.omit(d)
[1] x y z
<0 rows> (or 0-length row.names)
Why did this happen?
The problem is that na.omit
works row by row. If it finds a missing value in a row, it removes the whole row. The effect is to remove data that, in other columns, was not missing.
Often, that’s not what you want to do. For example, suppose that in our dataset’, we’re only interested in x
and y
. In that case, we don’t care if an element of z
is missing. So when we run na.omit
on the whole dataset, we’re losing data for x
and y
.
How do we fix the problem?
Usually what I do is make a new data frame that keeps only the data that I want. Let’s define d_keep
to have the x
column (d$x
) and y
column d$y
from d
:
d_keep = data.frame( x = d$x, y = d$y )
Here’s what’s in d_keep
:
> d_keep
x y
1 NA 2
2 1 NA
3 3 5
Now we can run na.omit
and keep rows with values for both x
and y
:
> na.omit(d_keep)
x y
3 3 5
Clean, but not too clean
Removing missing values is a basic part of ‘cleaning’ a dataset. The strategy is to clean your data just the right amount — enough so the data is easy to work with, but not so much that you lose data that you want.
Whenever you run na.omit
, it’s good practice to check that you haven’t unintentionally lost data. (I’ve learned this ‘good practice’ the hard way.)