2. Merging data in R
One of the frustrating things about doing empirical research is that the data you want lives in many different places. Sometimes you can download a beautifully formatted database that contains exactly the data you need. But more often, the data you want must be pulled from many corners of the internet.
Once you have the data, you need to put it together so you can work with it. If you’re working in R, you can use the merge
function. As the name suggests, the function merges two datasets.
What’s funny is that I was several years into using R before I discovered this function. I had been banging my head writing custom code to combine various datasets, only to learn that there was a base R function that did exactly what I want. The lesson here is that before you write custom code, you’re better off doing a quick search to see if there’s a pre-written function that will do what you want. There usually is.
Back to the merge
function. To understand how it works, imagine that you have a database called sales
that contains sales data for various companies:
sales:
year company sales profit
1: 1997 Apple 105 5
2: 1999 Ford 505 105
3: 1999 Ford 505 51
4: 2000 Microsoft 2450 181
5: 2001 Microsoft 2509 203
Imagine that we want to calculate the ‘markup’ of each company. To do that we need profit data. The trouble is that this data lives in a seperate database, profit
, that looks like this:
profit:
year company profit
1: 1999 Ford 105
2: 2001 Microsoft 203
3: 1997 Apple 5
4: 1999 Ford 51
5: 2000 Microsoft 181
To work with our sales and profit data, we need to combine it into one dataset. That’s whate the merge
function is designed for. It takes as an input two sets of data, x
and y
:
merge(x, y, by)
The by
value tells merge
the variables you would like to use as indexes. In our example, we want to merge the data by year
and company
. The function will then look for all the entries with the same year and company and merge them. Applying that to our sale and profit data, we’d enter:
all = merge(sales, profit, by = c("year", "company"))
Some comments. Yes, the quotes around "year"
and "company"
are required. I don’t know why … that’s just the way the function works. The c()
argument is how R combines elements. Note that by default, merge
will keep only the by elements that are common to both sets of data.
Here’s what the function will output:
all:
year company sales profit
1: 1997 Apple 105 5
2: 1999 Ford 505 105
3: 2000 Microsoft 2450 181
4: 2001 Microsoft 2509 203
Now we can calculate the markup for each company by dividing profits by sales:
all$markup = all$profit / all$sales * 100
all:
year company sales profit markup
1: 1997 Apple 105 5 4.761905
2: 1999 Ford 505 105 20.792079
3: 2000 Microsoft 2450 181 7.387755
4: 2001 Microsoft 2509 203 8.090873
So far, I haven’t talked about the type of data we’ve been working with. The nice thing about merge
is that it works both on data.frames — R’s basic data holder — and also on data.tables. If you’re working with large data, I recommend using data.tables as they are much much faster.
Happy merging!