# 24. Linear regressions in R

The linear regression is the workhorse model in science. Today I’ll teach you how to use it in R.

### Make some random data

To get started, let’s make a linear relation between two variables, `x`

and `y`

. To do that, we’ll use the `rnorm`

function, which generates random numbers from a normal distribution.

The code below generates 1000 random numbers from a normal distribution with a mean of 0 and standard deviation of 1. We’ll call this batch of numbers `x`

:

```
x = rnorm( n = 1000, mean = 0, sd = 1 )
```

Next, let’s generate some statistical noise. Like `x`

, our noise will be random numbers from a normal distribution. But we’ll make the standard deviation smaller than in `x`

. Let’s make it `sd = 0.2`

:

```
noise = rnorm( n = 1000, mean = 0, sd = 0.2 )
```

Finally, let’s make a linear relation. We’ll define the variable `y`

as:

```
y = x + noise
```

### Plot your data first!

Looking ahead, we’re going to fit a linear regression to our x-y data. But before we do that, it’s always best to plot the data to see what it looks like. Let’s do that now:

```
plot(x, y)
```

The result looks like this — a nice linear relation between `x`

and `y`

:

Warning: if you plot your data and find that the trend is *curved*, stop now. It’s silly to fit a linear model to curvy data. You’ll need to do something more sophisticated, which I’ll discuss another time.

### Correlation

Alright, we’ve got a linear relation. Before we run a regression, let’s start with its simpler cousin, the correlation.

To get the correlation coefficient between two sets of data, we use the `cor`

function:

```
cor(x, y)
```

It will spit back the correlation coefficient between `x`

and `y`

. As expected, it’s high:

```
> cor(x, y)
[1] 0.9797063
```

For some context, the correlation coefficient is a number that varies from -1 to 1. It’s often given the symbol r:

- r = 0: no correlation
- r = 1: a perfect positive correlation
- r = -1: a perfect negative correlation

So our x-y relation has a correlation coefficient of r = 0.98 — a strong relation.

### Linear regression

Now to our more complicated friend, the linear regression. To run a regression in R, we use the `lm`

function. Here’s what it looks like:

```
lm( y ~ x )
```

Some notes. Yes, you need to put the dependent variable (here `y`

) *before* the independent variable (here `x`

). And if you’re wondering, the `~`

symbol is code for ‘related to’.

Let’s look at the output of this code:

```
> lm( y ~ x )
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.009812 1.005313
```

Right away, you can see that it’s more complicated that the output of the correlation function. Here’s what it means.

First, R dumps back the formula that you gave it:

```
Call:
lm(formula = y ~ x)
```

Next, R tells you the coefficients of the linear regression.

```
Coefficients:
(Intercept) x
0.009812 1.005313
```

To make sense of these numbers, let’s review some high school math. The equation for a line is often taught as:

y = mx + b

Here m is the slope and b is the y intercept. To confuse you a bit, statisticians usually flip the order of the equation like this;

y = b + mx

Thinking like a statistician, R dumps the coefficients in this latter order. So the intercept (b) is 0.0098. And the slope (m) is 1.005. So our model is;

y = 0.0098 + 1.005 x

Now we can see what our noise did. Without it, we’d have a perfect linear relation, y = x. But the noise fudges things up, adding a non-zero intercept and making the slope differ slightly from 1.

### Accessing the full regression menu

To recap, if we put `lm( y ~ x)`

into the terminal, R will dump the coefficients of the regression. But under the hood, R can give us much more detail.

To see these details, we need to save our regression as a variable. Let’s call it `model`

:

```
model = lm( y ~ x )
```

To get the gory details about our model, we can use the `summary`

function:

```
summary( model )
```

We’ll get a data dump that looks like this:

```
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.72652 -0.13974 -0.00039 0.13968 0.75094
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009812 0.006531 1.502 0.133
x 1.005313 0.006511 154.412 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2065 on 998 degrees of freedom
Multiple R-squared: 0.9598, Adjusted R-squared: 0.9598
F-statistic: 2.384e+04 on 1 and 998 DF, p-value: < 2.2e-16
```

There’s a confusing amount of information here, so let’s wade through it.

First, R gives us a summary of the regression ‘residuals’. These are the distances between the y values in our data and the y values in our model.

```
Residuals:
Min 1Q Median 3Q Max
-0.72652 -0.13974 -0.00039 0.13968 0.75094
```

Next, R dumps the coefficients of the linear model. The actual estimates are in the `Estimate`

column. Next to these values, we get a bunch of statistics that I won’t wade through here.

```
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.009812 0.006531 1.502 0.133
x 1.005313 0.006511 154.412 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Finally, we get a summary of the regression fit, the most important of which is the R-squared value.

```
Multiple R-squared: 0.9598, Adjusted R-squared: 0.9598
```

### Getting coefficients and the R-squared value

Having confused you with R’s data dump, let’s make things simpler.

When I run a regression, I typically want to save the regression coefficients. To get them, we first save our regression (here as `model`

). Then we extract the model coefficients (the y-intercept and slope) using the `coef`

function:

```
model = lm( y ~ x )
model_coefficients = coef(model)
```

Let’s have a look at them:

```
> model_coefficients
(Intercept) x
0.009811796 1.005312922
```

Yep, still the same as before.

Now let’s extract the R-squared value. To do that, we use this code:

```
summary(model)$r.squared
```

I get back an R-squared of about 0.96 — nice and high:

```
> summary(model)$r.squared
[1] 0.9598244
```

But wait, what does an R-squared value mean? Well, in simple terms the R-squared tells us the portion of variation in y that is explained by our linear model. So in this example, a linear model between y and x explains 96% of the variation in y.

### Some homework

To get comfortable with linear regressions, here’s some homework.

First, try adjusting the standard deviation in the noise distribution. Here, for example, I’ve increased the noise to have `sd = 0.6`

. See how that affects the regression.

```
x = rnorm( n = 1000, mean = 0, sd = 1)
noise = rnorm( n = 1000, mean = 0, sd = 0.6 )
y = x + noise
```

You can also change the slope of the relation like this:

```
y = 3*x + noise
```

You might also try varying the sample size. To do that, change the `n`

value in the `rnorm`

function. Here’s a sample size of 10:

```
x = rnorm( n = 10, mean = 0, sd = 1)
noise = rnorm( n = 10, mean = 0, sd = 0.6 )
y = x + noise
```

See what you find!