Blair’s Science Desk

15. Reading data into R

· Blair Fix

Today we’re going to talk about reading data into R. I’ll first show you the options that come with base R (meaning you don’t need any packages). Then I’ll talk about some improved versions of these functions.

Reading unformatted data

Sometimes the data that you’re working with consists of unformatted text. For example, maybe you’re analyzing the contents of a website. To read this kind of data into R, I use the readLines function. As the name suggests, this function takes the lines of a file and reads them into memory. The syntax works like this:

readLines(path_to_file)

By default, the readLines command will dump what it reads to your screen. If you want to actually use the data, you need to save it to a variable like this:

data = readLines(path_to_file)

Here’s an example. I’ve put a plain text file at the following url:

https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/raw_text.txt

The file contains this text:

This is a text file.
It is not formatted in any particular way.
The best way to read it into R is with the readLines() function.

The line above is blank.
That's all for now.

A nice feature of R’s reading functions is that they can accept a url as in input for the file path. So we could read in my example like this:

url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/raw_text.txt"

data = readLines(url)

Let’s see what’s in out data variable:

> data
[1] "This is a text file."
[2] "It is not formatted in any particular way."
[3] "The best way to read it into R is with the readLines() function."
[4] ""
[5] "The line above is blank."
[6] "That's all for now."

So what happened is that each line of the file becomes an element in a character vector. Note that if there are empty lines in your file, R will give them an empty element, "".

If we wanted the text in the second line, for example, we’d input:

> data[2]
[1] "It is not formatted in any particular way."

If your file happens to contain numbers, readLines will interpret them as characters. So if you want to actually use the numbers (as numbers), you’ll have to do some data cleaning. More on that later.

For now, let’s move on to reading in formatted data.

Reading csv data

Formatted data can come in many shapes and forms. However, the most common is probably the ‘csv’ format — which stands for comma separated values.

As the name suggests, this is data that is separated by a comma. Here’s an example:

name,age,height
Sally,29,1.64
David,55,1.71
Brenda,38,1.56
Emily,70,1.67
Mark,44,1.91

This is a toy dataset containing the names of 5 people, plus their age and height. I’ve stored it at the url below:

https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv

To read this csv data into R, we’ll use the read.csv function. To get the data, we can just pass the url along:

url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv"

data = read.csv(url)

Let’s see what’s in the data:

> data
    name age height
1  Sally  29   1.64
2  David  55   1.71
3 Brenda  38   1.56
4  Emily  70   1.67
5   Mark  44   1.91

The read.csv function will import the data as a data frame. We can access the columns using the $ syntax. Let’s get the data in the names column:

> data$name
[1] "Sally"  "David"  "Brenda" "Emily"  "Mark"

Note that name represents the heading of the column containing the data we want. (R has no idea that the name column actually contains people’s names.)

Let’s get the data in the height column:

> data$height
[1] 1.64 1.71 1.56 1.67 1.91

Notice that the heights are formatted as numbers (not characters). That’s because the read.csv function is smart enough to know that if a particular column contains only numbers, you’ll want the data formatted as a number.

Because we have numeric data, we can apply all of the standard stats functions that I talked about here. For example, let’s get the average (mean) height:

> mean(data$height)
[1] 1.698

Or how about the minimum age:

> min(data$age)
[1] 29

These are just some simple examples. Once you’ve read the data into R, the real work begins.

When you need speed

For small datasets, the read.csv function does the job just fine. But for big data, it’s too slow to be usable. Here are two alternatives that are faster:

fread

My preferred reading function is called fread. It comes from the data.table package, and is blazing fast. To use it, install the data.table package:

install.packages("data.table")

Load the library and fread something:

library(data.table)

url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv"

data = fread(url)

Like the other reading functions, fread accepts a file path (or a url). fread is also smart and will guess how the data is formatted. So if the data happens to be tab separated (instead of comma separated) fread will know what to do.

Note that fread will import the data as a ‘data table’, which is slightly different than an R data frame. For simple analysis, though, nothing changes.

read_csv

The read_csv (note the underscore in the name) function comes from the readr package. It does much the same thing as the basic read.csv, except it does it much faster.

Install readr like this:

install.packages("readr")

Read in data like this:

library(readr)

url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv"

data = read_csv(url)

Reading from you local computer

I’ve shown you how to read data from a url. Now you should try the following:

  1. Download my example files
  2. Stash them somewhere on your computer
  3. Read them into R

As an example, suppose I saved the dataset.csv file in my Downloads folder. On my computer, the filepath is:

/home/blair/Downloads

Here’s one way to get the data. First, set R’s working directory to the Downloads path. To do that we use the setwd command:

setwd("/home/blair/Downloads")

(Here’s a review of changing directories.)

Then read the file called dataset.csv:

data = read.csv("dataset.csv")

Yes, you need the quotes around the file name.

As practice, try putting the downloaded data in different locations on your computer and asking R to read it. If it can’t find the file, it will tell you:

> data = read.csv("dataset.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'dataset.csv': No such file or directory

Happy reading!