# 24. Linear regressions in R

· Blair Fix

The linear regression is the workhorse model in science. Today I’ll teach you how to use it in R.

### Make some random data

To get started, let’s make a linear relation between two variables, `x` and `y`. To do that, we’ll use the `rnorm` function, which generates random numbers from a normal distribution.

The code below generates 1000 random numbers from a normal distribution with a mean of 0 and standard deviation of 1. We’ll call this batch of numbers `x`:

``````x = rnorm( n = 1000, mean = 0, sd = 1 )
``````

Next, let’s generate some statistical noise. Like `x`, our noise will be random numbers from a normal distribution. But we’ll make the standard deviation smaller than in `x`. Let’s make it `sd = 0.2`:

``````noise = rnorm( n = 1000, mean = 0, sd = 0.2 )
``````

Finally, let’s make a linear relation. We’ll define the variable `y` as:

``````y = x + noise
``````

Looking ahead, we’re going to fit a linear regression to our x-y data. But before we do that, it’s always best to plot the data to see what it looks like. Let’s do that now:

``````plot(x, y)
``````

The result looks like this — a nice linear relation between `x` and `y`: Warning: if you plot your data and find that the trend is curved, stop now. It’s silly to fit a linear model to curvy data. You’ll need to do something more sophisticated, which I’ll discuss another time.

### Correlation

Alright, we’ve got a linear relation. Before we run a regression, let’s start with its simpler cousin, the correlation.

To get the correlation coefficient between two sets of data, we use the `cor` function:

``````cor(x, y)
``````

It will spit back the correlation coefficient between `x` and `y`. As expected, it’s high:

``````> cor(x, y)
 0.9797063
``````

For some context, the correlation coefficient is a number that varies from -1 to 1. It’s often given the symbol r:

• r = 0: no correlation
• r = 1: a perfect positive correlation
• r = -1: a perfect negative correlation

So our x-y relation has a correlation coefficient of r = 0.98 — a strong relation.

### Linear regression

Now to our more complicated friend, the linear regression. To run a regression in R, we use the `lm` function. Here’s what it looks like:

``````lm( y ~ x )
``````

Some notes. Yes, you need to put the dependent variable (here `y`) before the independent variable (here `x`). And if you’re wondering, the `~` symbol is code for ‘related to’.

Let’s look at the output of this code:

``````> lm( y ~ x )

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
0.009812     1.005313
``````

Right away, you can see that it’s more complicated that the output of the correlation function. Here’s what it means.

First, R dumps back the formula that you gave it:

``````Call:
lm(formula = y ~ x)
``````

Next, R tells you the coefficients of the linear regression.

``````Coefficients:
(Intercept)            x
0.009812     1.005313
``````

To make sense of these numbers, let’s review some high school math. The equation for a line is often taught as:

y = mx + b

Here m is the slope and b is the y intercept. To confuse you a bit, statisticians usually flip the order of the equation like this;

y = b + mx

Thinking like a statistician, R dumps the coefficients in this latter order. So the intercept (b) is 0.0098. And the slope (m) is 1.005. So our model is;

y = 0.0098 + 1.005 x

Now we can see what our noise did. Without it, we’d have a perfect linear relation, y = x. But the noise fudges things up, adding a non-zero intercept and making the slope differ slightly from 1.

### Accessing the full regression menu

To recap, if we put `lm( y ~ x)` into the terminal, R will dump the coefficients of the regression. But under the hood, R can give us much more detail.

To see these details, we need to save our regression as a variable. Let’s call it `model`:

``````model = lm( y ~ x )
``````

To get the gory details about our model, we can use the `summary` function:

``````summary( model )
``````

We’ll get a data dump that looks like this:

``````Call:
lm(formula = y ~ x)

Residuals:
Min       1Q   Median       3Q      Max
-0.72652 -0.13974 -0.00039  0.13968  0.75094

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009812   0.006531   1.502    0.133
x           1.005313   0.006511 154.412   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2065 on 998 degrees of freedom
Multiple R-squared:  0.9598,	Adjusted R-squared:  0.9598
F-statistic: 2.384e+04 on 1 and 998 DF,  p-value: < 2.2e-16
``````

There’s a confusing amount of information here, so let’s wade through it.

First, R gives us a summary of the regression ‘residuals’. These are the distances between the y values in our data and the y values in our model.

``````Residuals:
Min       1Q   Median       3Q      Max
-0.72652 -0.13974 -0.00039  0.13968  0.75094
``````

Next, R dumps the coefficients of the linear model. The actual estimates are in the `Estimate` column. Next to these values, we get a bunch of statistics that I won’t wade through here.

``````Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009812   0.006531   1.502    0.133
x           1.005313   0.006511 154.412   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
``````

Finally, we get a summary of the regression fit, the most important of which is the R-squared value.

``````Multiple R-squared:  0.9598,	Adjusted R-squared:  0.9598
``````

### Getting coefficients and the R-squared value

Having confused you with R’s data dump, let’s make things simpler.

When I run a regression, I typically want to save the regression coefficients. To get them, we first save our regression (here as `model`). Then we extract the model coefficients (the y-intercept and slope) using the `coef` function:

``````model = lm( y ~ x )
model_coefficients = coef(model)
``````

Let’s have a look at them:

``````> model_coefficients
(Intercept)           x
0.009811796 1.005312922
``````

Yep, still the same as before.

Now let’s extract the R-squared value. To do that, we use this code:

``````summary(model)\$r.squared
``````

I get back an R-squared of about 0.96 — nice and high:

``````> summary(model)\$r.squared
 0.9598244
``````

But wait, what does an R-squared value mean? Well, in simple terms the R-squared tells us the portion of variation in y that is explained by our linear model. So in this example, a linear model between y and x explains 96% of the variation in y.

### Some homework

To get comfortable with linear regressions, here’s some homework.

First, try adjusting the standard deviation in the noise distribution. Here, for example, I’ve increased the noise to have `sd = 0.6`. See how that affects the regression.

``````x = rnorm( n = 1000, mean = 0, sd = 1)
noise = rnorm( n = 1000, mean = 0, sd = 0.6 )
y = x + noise
``````

You can also change the slope of the relation like this:

``````y = 3*x + noise
``````

You might also try varying the sample size. To do that, change the `n` value in the `rnorm` function. Here’s a sample size of 10:

``````x = rnorm( n = 10, mean = 0, sd = 1)
noise = rnorm( n = 10, mean = 0, sd = 0.6 )
y = x + noise
``````

See what you find!