Today we’re going to talk about reading data into R. I’ll first show you the options that come with base R (meaning you don’t need any packages). Then I’ll talk about some improved versions of these functions.
Reading unformatted data
Sometimes the data that you’re working with consists of unformatted text. For example, maybe you’re analyzing the contents of a website. To read this kind of data into R, I use the
readLines function. As the name suggests, this function takes the lines of a file and reads them into memory. The syntax works like this:
By default, the
readLines command will dump what it reads to your screen. If you want to actually use the data, you need to save it to a variable like this:
data = readLines(path_to_file)
Here’s an example. I’ve put a plain text file at the following url:
The file contains this text:
This is a text file. It is not formatted in any particular way. The best way to read it into R is with the readLines() function. The line above is blank. That's all for now.
A nice feature of R’s reading functions is that they can accept a url as in input for the file path. So we could read in my example like this:
url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/raw_text.txt" data = readLines(url)
Let’s see what’s in out
> data  "This is a text file."  "It is not formatted in any particular way."  "The best way to read it into R is with the readLines() function."  ""  "The line above is blank."  "That's all for now."
So what happened is that each line of the file becomes an element in a character vector. Note that if there are empty lines in your file, R will give them an empty element,
If we wanted the text in the second line, for example, we’d input:
> data  "It is not formatted in any particular way."
If your file happens to contain numbers,
readLines will interpret them as characters. So if you want to actually use the numbers (as numbers), you’ll have to do some data cleaning. More on that later.
For now, let’s move on to reading in formatted data.
Reading csv data
Formatted data can come in many shapes and forms. However, the most common is probably the ‘csv’ format — which stands for comma separated values.
As the name suggests, this is data that is separated by a comma. Here’s an example:
name,age,height Sally,29,1.64 David,55,1.71 Brenda,38,1.56 Emily,70,1.67 Mark,44,1.91
This is a toy dataset containing the names of 5 people, plus their age and height. I’ve stored it at the url below:
To read this csv data into R, we’ll use the
read.csv function. To get the data, we can just pass the url along:
url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv" data = read.csv(url)
Let’s see what’s in the data:
> data name age height 1 Sally 29 1.64 2 David 55 1.71 3 Brenda 38 1.56 4 Emily 70 1.67 5 Mark 44 1.91
read.csv function will import the data as a data frame. We can access the columns using the
$ syntax. Let’s get the data in the names column:
> data$name  "Sally" "David" "Brenda" "Emily" "Mark"
name represents the heading of the column containing the data we want. (R has no idea that the
name column actually contains people’s names.)
Let’s get the data in the
> data$height  1.64 1.71 1.56 1.67 1.91
Notice that the heights are formatted as numbers (not characters). That’s because the
read.csv function is smart enough to know that if a particular column contains only numbers, you’ll want the data formatted as a number.
Because we have numeric data, we can apply all of the standard stats functions that I talked about here. For example, let’s get the average (mean) height:
> mean(data$height)  1.698
Or how about the minimum age:
> min(data$age)  29
These are just some simple examples. Once you’ve read the data into R, the real work begins.
When you need speed
For small datasets, the
read.csv function does the job just fine. But for big data, it’s too slow to be usable. Here are two alternatives that are faster:
My preferred reading function is called
fread. It comes from the
data.table package, and is blazing fast. To use it, install the data.table package:
Load the library and
library(data.table) url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv" data = fread(url)
Like the other reading functions,
fread accepts a file path (or a url).
fread is also smart and will guess how the data is formatted. So if the data happens to be tab separated (instead of comma separated)
fread will know what to do.
fread will import the data as a ‘data table’, which is slightly different than an R data frame. For simple analysis, though, nothing changes.
read_csv (note the underscore in the name) function comes from the
readr package. It does much the same thing as the basic
read.csv, except it does it much faster.
readr like this:
Read in data like this:
library(readr) url = "https://sciencedesk.economicsfromthetopdown.com/2022/10/read-data/example_data/dataset.csv" data = read_csv(url)
Reading from you local computer
I’ve shown you how to read data from a url. Now you should try the following:
- Download my example files
- Stash them somewhere on your computer
- Read them into R
As an example, suppose I saved the
dataset.csv file in my Downloads folder. On my computer, the filepath is:
Here’s one way to get the data. First, set R’s working directory to the Downloads path. To do that we use the
(Here’s a review of changing directories.)
Then read the file called
data = read.csv("dataset.csv")
Yes, you need the quotes around the file name.
As practice, try putting the downloaded data in different locations on your computer and asking R to read it. If it can’t find the file, it will tell you:
> data = read.csv("dataset.csv") Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'dataset.csv': No such file or directory