18. Accessing data elements in R

Nov 14, 2022 · Blair Fix

Probably the most important part of learning R is figuring out how to access the elements of a database. In this post, we’ll play with some financial data from US public corporations. I’ll show you the quick way to access data using indexes. Then I’ll show you some better options that are ‘human readable’.

Read in the data

Every R analysis starts by reading in the data you want. Here, we’ll read in data from the following url:

url = "https://sciencedesk.economicsfromthetopdown.com/2022/11/r-access-elements/data/fdata.csv"

We’ll import the data using the read.csv command and dump it into a variable called fdata.

fdata = read.csv(url)

What’s in my data?

Alright, we’ve got some data contained in a variable called fdata. The first thing we’ll do is look at what’s there.

We have many options. The simplest approach is to just type our variable name (fdata) in the console and hit enter: This option is fine for small datasets. But it’s unwieldy if the dataset is large. The data dump will completely clog your terminal.

A better option is to use the head function. By default, it will dump the first 6 lines of the data:

head(fdata)

Here’s what I get back:

  year ticker   profit     sales
1 2010    AIR   69.826  1775.782
2 2010 ADCT.1   62.000  1156.600
3 2010    AAL -471.000 22170.000
4 2010   CECE    2.105   140.602
5 2010    AVX  244.003  1653.176
6 2010    PNW  350.053  3263.645

Alright, it looks like we have four columns of data: year, ticker, profit and sales. If you haven’t already guessed, we’re looking at financial data for individual companies, organized by year and the stock ticker of the company.

How much data do I have?

Once I’ve read in a dataset, usually the first thing I do is see how much data I have. For that, I use the nrow function, which tells us how many rows are in our database:

nrow(fdata)

Here’s what I get back:

[1] 9019

So our database has just over 9 thousand rows.

If we want to know how many columns we have, we can use the ncol function:

ncol(fdata)

R will return:

[1] 4

So our database has 4 columns (which we already knew).

Accessing data by index

Now that we’ve got our database, let’s talk about how we access its elements. One option is to use row and column indexes. The syntax looks like this:

fdata[ row_number, column_number ]

For example, to access the data in the first row and first column of fdata, I’d type:

fdata[ 1, 1 ]

I get back the first entry in the year column:

[1] 2010

Another example. Let’s get data from the 3rd row and 4th column:

fdata[ 3, 4 ]

I get back the third entry in the sales column:

[1] 22170

Accessing whole rows by index

If we want all of the data in a particular row, we leave the column entry blank. For example, here’s how to get all of the data in the second row of fdata:

fdata[ 2, ]

Here’s what we get back:

  year ticker profit  sales
2 2010 ADCT.1     62 1156.6

Accessing whole columns by index

If we want all of the data in a particular column, we leave the row entry blank. Here’s all the data in the third column of fdata:

fdata[ , 3 ]

R will dump out several thousand entries that begin like this:

 [1]     69.826     62.000   -471.000      2.105    244.003    350.053    118.376   4626.172

Don’t hardcode indexes

It’s good to understand how indexes work, because they are the most basic way of accessing data. That said, you want to avoid hard coding them into your analysis. Here’s why.

Suppose my code contains the following entry:

p = fdata[ 5, 3 ]

What does this code do? Well, it takes data from the 5th row and 3rd column of fdata and dumps it into a variable called p. But what is this data? Our code doesn’t tell us.

That’s a problem. Remember, when you write R code, the goal is to both analyze data and document what you’ve done. Sure, hard-coded indexes are a quick and easy way to analyze data. But since they contain no metadata, they lead to code that is horribly opaque.

Fortunately, R has many ‘human readable’ ways to access data. The most basic is probably the column operator, $.

Accessing columns by name

To understand the $ operator, we need to back up a bit and look at how R imports data. By default, R assumes that the first row of your dataset contains column names — names that describe the data below. When R reads in this data, it applies these names to each column of the imported data frame.

We can access these column names using the name function:

names(fdata)

R returns:

[1] "year"   "ticker" "profit" "sales"

The $ operator allows us to access columns by name. For example, if I wanted all of the data in the year column of fdata, I’d type:

fdata$year

Note that I could access the same data using indexes. The year data is in the first column, so I’d enter:

fdata[ , 1 ]

To R, the two operations are identical. But to humans, fdata$year is far more descriptive.

Accessing rows by condition

To review, the $ allows us to access columns by name. Is there a similar way to access rows by name?

The answer is no. Unlike columns, R does not (by default) attach names to each row of data. The assumption is that column names describe the data, and then everything that follows is part of the dataset. So instead of accessing rows by ’name’, we access them by ‘condition’.

Here’s an example. Suppose we want to find data for Apple. How would we do it? Well, the obvious option is to search the ticker column for Apple’s stock ticker “AAPL”.

R has many options for this type of search, but here we’ll use the subset function. Its syntax works like this:

subset( dataset, contition )

Here, our dataset is fdata. And the condition is that the data in the ticker column should equal "AAPL". To write this condition, we use the == symbol. The full request is:

subset( fdata, ticker == "AAPL" )

(Note: If you use the single equals sign (as in ticker = "AAPL"), R will throw an error. That’s because it thinks you’re defining the ticker variable, rather than searching for equality.)

Here’s what I get back from our subset request:

   year ticker profit sales
96 2010   AAPL  14013 65225

So it looks like there’s one line of AAPL data, which happens to be in row 96.

Getting Apple’s profit

Suppose that I want to get profit data for Apple. Here’s one way to do it. From the example above, I know that Apple’s data is in row 96. And profit data is in column 3. So I can get Apple’s profit by hard coding those indexes:

apple_profit = fdata[ 96, 3 ]

The code is succinct, but opaque. And if the database changed, my code would give me the wrong result.

Here’s a better way to get Apple’s profit data. First, I ask R to subset the data on the condition that the ticker column equals “AAPL”. Let’s call that result apple:

apple = subset( fdata, ticker == "AAPL" )

Then I use the $ operator to access the profit column of our apple data:

apple_profit = apple$profit

Let’s see what we found:

> apple_profit
[1] 14013

If all went well, it’s the same data as fdata[ 96, 3]. Let’s check:

> fdata[ 96, 3]
[1] 14013

Yep, it’s the same.

The coding payoff

At first, it may seem annoying to take the long, descriptive route to accessing elements of your data. But as you learn to use R, you’ll realize the advantages. First, your code will be self-documenting. Second, your code will ‘scale’ with ease. By ‘scale’, I mean that you can use the same code, regardless of the size of your database.

In our example, there was one line of Apple data. But the same code (using the subset function) would work if there were a million lines of Apple data. That’s the beauty of a coding language like R. For simple analysis, the code may seem cumbersome compared to what you’d do in a spreadsheet. But that same code will work on a enormous database — one that Excel couldn’t even load.

Because accessing elements of your data is such a basic part of analysis, R provides many many ways to do it. But that’s a topic for the future. For now, I recommend that you play with the subset function to see what you can do.

Here’s some examples to try:

# data with positive profit
subset( fdata, profit > 0 )

# data with negative profit
subset( fdata, profit < 0 )

# data in which sales are less than profit
subset( fdata, sales < profit )

Happy subsetting!