13. R functions that I use all the time
Out of the box, R comes with many functions that make data analysis easy. In this post, I’ll review some of the functions that I use frequently, in whatever order rolls off the top of my head.
To get the ball rolling, we need a vector to work with. Let’s define a vector x
that contains some random(ish) numbers:
> x = c(1, 4, -2, 10, 15, -5, -1, 2, 8, 0)
(Remember that the c
function is how we combine numbers into vectors.)
Now on to the functions
length
Having filled a vector with some data, often the first thing I want to know is how much data I have. R has a function for that called length
. It tells you how many elements in a vector:
> length(x)
[1] 10
Nice! It looks like I put 10 numbers into x
.
sort
When I see numbers that are in no particular order, I have an urge to sort them. R has a sort
function for that:
> sort(x)
[1] -5 -2 -1 0 1 2 4 8 10 15
It just feels better seeing things in order!
mean
Now let’s get to some summary statistics. The mean is by far the most popular. R has a function for that call (drum roll) mean
:
> mean(x)
[1] 3.2
median
Let’s not leave out our friendly median
function, which returns the midpoint of the data:
> median(x)
[1] 1.5
Hmm … the median is not the same as the mean. That means the data (that I made up) has a skew. More on skewness sometime in the future.
standard deviation
The standard deviation is a workhouse stat that mathematicians love and the general public often misunderstands. Calculate it in R with sd
:
> sd(x)
[1] 6.124632
max/min
Want to know the high and/or low values in your data. Use the max
and min
functions
> max(x)
[1] 15
> min(x)
[1] -5
summary
If you want to see all the summary statistics in one go, R has a function for that called summary
:
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.00 -0.75 1.50 3.20 7.00 15.00
From left to right, you get the minimum value, the 1st quartile, the median, the mean, the third quartile, and the max. Nifty!
quantiles
Speaking of quartiles, R has a nice function for calculating ‘quantiles’. It’s called quantiles
. If that language is confusing, you can think of the function as returning percentiles.
For example, here’s how I’d get the 30th percentile in x
:
> quantile(x, 0.3)
30%
-0.3
R gives me the percentile I’m calculating, followed by its value.
Here’s the 90th percentile:
> quantile(x, 0.9)
90%
10.5
head and tail
The head
and tail
functions return the start and end of a vector.
Here’s the first 4 values of x
:
> head(x, 4)
[1] 1 4 -2 10
And here are the last 3 values of x
:
> tail(x, 3)
[1] 2 8 0
N largest/smallest values
To get the single largest/smallest value, we’ve got the max
and min
functions. But what about if I want to know the 3 largest/smallest values? How would I do that?
The answer is that we combine the max/min functions with head/tail functions.
Suppose we want the 3 largest values in x
. First we’d sort x
:
> sort(x)
[1] -5 -2 -1 0 1 2 4 8 10 15
You can see that the 3 largest values live in the last 3 elements. So we’ll take the tail of the sorted values:
x_sort = sort(x)
tail(x_sort, 3)
[1] 8 10 15
Or suppose we want the 2 smallest values. Now we take the head of x_sort
:
x_sort = sort(x)
head(x_sort, 2)
[1] -5 -2
That pretty much covers it
I’d say that the functions above cover 90% of the calculating that I do in my own research.
The hard part isn’t using these functions. (As you can see, they’re super easy.) The hard part is usually getting the data into a suitable form to apply these functions.
More on that in the future.