Blair’s Science Desk

19. Dealing with missing values in R

· Blair Fix

Real-world datasets often have missing values. Here’s some tips for dealing with them in R.

The NA

In spreadsheets, missing data is typically represented by a blank cell. In R, missing data is given the symbol NA.

For an example, let’s make a vector called x and fill it with some numbers and some missing values (NA’s):

x = c( 1, NA, 5, NA, 10 )

Now let’s try to work with this data.

Stats functions don’t like NA’s

Now that we’ve got x, let’s do some simple statistics. Let’s calculate the mean:

mean(x)

R tells me that the mean is undefined:

> mean(x)
[1] NA

What’s going on? Well, by default, R will try to include the missing values in the calculation. Since missing values are undefined, the resulting mean is undefined.

We can fix the problem by telling R to remove the NA’s from the calculation. We do that by setting the na.rm option to TRUE:

mean(x, na.rm = TRUE)

(Note: if you prefer to be less verbose, you can substitute T for TRUE, as in na.rm = T.)

The na.rm option tells R to remove missing values when it calculates the mean: Now we get back a sensible result:

> mean(x, na.rm = TRUE)
[1] 5.333333

Other stats functions like sd (standard deviation), median, and sum work the same way. If you want them to ignore missing values, you have use na.rm.

(Side note: na.rm is what we call a ‘default’ option. If you don’t specify it, R will assume you want na.rm = FALSE.)

summary handles NA’s by default

Interestingly, the summary function works differently. By default, it will remove missing values when it calculates summary statistics. Then, it will report how may NA’s are in the data.

Let’s run summary on x:

summary(x)

I get back:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  1.000   3.000   5.000   5.333   7.500  10.000       2

The last column tells us the number of missing values. As expected, there are 2.

Removing missing values from your data

When you pass the na.rm option to stats functions, your underlying data remains unchanged. Sometimes that’s what you want. Other times, you want to omit the NA’s from your source data.

In R, there are a few ways to do that. The most universal is a function called na.omit. Let’s run it on x:

na.omit(x)

Here’s what I get back:

[1]  1  5 10
attr(,"na.action")
[1] 2 4
attr(,"class")
[1] "omit"

That’s weird output.

What’s going on here is that na.omit is returning two things:

  1. a vector with NA’s removed from x
  2. metadata containing the indexes of the NA’s that were removed

I personally find this output an odd choice on the part of R developers. When I use a function called na.omit, I expect it will return my data with the NA’s omitted. Interestingly, when we pass a data frame to na.omit, that’s exactly what happens. But for some reason, the function works differently on vectors.

At any rate, we can get rid of the unwanted metadata by putting the as.vector function around na.omit. That tells R to convert the output to a plain vector:

as.vector( na.omit(x) )

The output is what we expect

> as.vector( na.omit(x) )
[1]  1  5 10

Remove NA’s using is.na

Because na.omit has unexpected behavior when applied to a vector, you’ll often see R programmers use the following code to remove NA’s:

x[ ! is.na(x) ]

The code is simple enough to copy and paste. But what does it mean?

To understand the code, we’ll start with the is.na function. Lets’ run it on x:

is.na(x)

I get back:

[1] FALSE  TRUE FALSE  TRUE FALSE

The purpose of is.na is to test if the elements of a vector are NA. If is.na finds a missing value, it returns TRUE. If not, the function returns FALSE. The result is a vector of true/false values that correspond to the NA status of elements in x.

Now, the interesting thing about R is that you use true/false values to subset a vector. For example, suppose I have a vector z that contains the elements 1 and 2:

z = c(1, 2)

Now let’s make a logical vector that contains the values FALSE and TRUE:

index = c(FALSE, TRUE)

In R, we can pass these true/false indexes to z using the bracket operator []:

z[index]

What happens is that R will return the elements of z for which index is TRUE.

> z[index]
[1] 2

The code x[ ! is.na(x) ] does something similar, but with more nested pieces.

Tracking the nested logic

To keep track of what’s going on with nested code, I find it helpful to do things in steps. To break down our NA-remove code, let’s define a variable called index that will store the true/false output of is.na:

index = is.na(x)

Now we can tell R to subset x according to index:

x[index]

We’ll get back the values of x for which is.na(x) is TRUE. In other words, we’ll get back our NA’s:

> x[index]
[1] NA NA

That’s fun, but not what we want. Our goal is to remove missing values, not keep them. That’s where the ! comes in. In R, the ! is the logical ’not’ operator.

In the case of true/false data, the ! operator reverses the values, converting TRUE to FALSE, and vice versa.

Back to our index vector. To review, index contained:

> index
[1] FALSE  TRUE FALSE  TRUE FALSE

If we tell R to return !index, our logical values get flipped:

> !index
[1]  TRUE FALSE  TRUE FALSE  TRUE

Now, if we pass !index back to x, we’ll get all of the values that are not NA’s:

x[ !index ]

The results are what we want — our vector x with the NA’s removed:

> x[ !index ]
[1]  1  5 10

Seasoned R programmers take this thinking and compress it into one line of code. Find the indexes that are not NA, and subset x using those values. Translated into code, we get:

# long version
index = is.na(x)
x[ !index ]

# nested version
x[ ! is.na(x) ]

na.omit on data frames

Unlike on vectors, na.omit behaves as expected on data frames. It returns your data frame with the missing values removed.

To use na.omit, let’s first make an example dataset called d:

x = c(1, NA, 3)
y = c(2, NA, 5)

d = data.frame(x, y)

Here’s what’s in d:

   x  y
1  1  2
2 NA NA
3  3  5

Now let’s use na.omit to get rid of the missing values:

na.omit(d)

It works like a charm. We get:

> na.omit(d)
  x y
1 1 2
3 3 5

Be careful when using na.omit on sparse data

Be warned that if your dataset is ‘sparse’ — meaning there are lots of NAs sprinkled in different places — na.omit can have unintended consequences.

To see what can happen, let’s make another data frame:

x = c(NA, 1, 3)
y = c(2, NA, 5)
z = c(4, 2, NA)

d = data.frame(x, y, z)

Here’s what’s in d:

   x  y  z
1 NA  2  4
2  1 NA  2
3  3  5 NA

Notice that the missing values are not in one row. Instead, they’re sprinkled in different rows. When we use na.omit on this data, we have a problem. We get back an empty data frame:

> na.omit(d)
[1] x y z
<0 rows> (or 0-length row.names)

Why did this happen?

The problem is that na.omit works row by row. If it finds a missing value in a row, it removes the whole row. The effect is to remove data that, in other columns, was not missing.

Often, that’s not what you want to do. For example, suppose that in our dataset’, we’re only interested in x and y. In that case, we don’t care if an element of z is missing. So when we run na.omit on the whole dataset, we’re losing data for x and y.

How do we fix the problem?

Usually what I do is make a new data frame that keeps only the data that I want. Let’s define d_keep to have the x column (d$x) and y column d$y from d:

d_keep = data.frame( x = d$x, y = d$y )

Here’s what’s in d_keep:

> d_keep
   x  y
1 NA  2
2  1 NA
3  3  5

Now we can run na.omit and keep rows with values for both x and y:

> na.omit(d_keep)
  x y
3 3 5

Clean, but not too clean

Removing missing values is a basic part of ‘cleaning’ a dataset. The strategy is to clean your data just the right amount — enough so the data is easy to work with, but not so much that you lose data that you want.

Whenever you run na.omit, it’s good practice to check that you haven’t unintentionally lost data. (I’ve learned this ‘good practice’ the hard way.)