As a long-time student, I’ve spent most of my career working on cheap computers with minimal memory. That can make it difficult to work with big datasets.
When you work in R, for example, the analysis usually has 3 steps:
- Read the data into memory
- Analyze the data
- Write out some results
As long as you can read the whole database into your computer’s memory, this approach works well. But if the database is huge, then you can’t get past step 1. Here are some tricks to help get around this roadblock.
1. Open a Linux/Unix terminal
For the steps ahead, we’re going to use Linux/Unix commands that can be executed from a terminal. The reason we’re going to do this (instead of opening R or Python) is because the Linux/Unix terminal has some fantastic commands for manipulating text files.
If you use Linux or MacOS, just open your terminal and try out the commands below. If you’re a Windows user, you have to first install the Windows Subsystem for Linux. Once you’ve got that running, you’ll have a Linux terminal at your disposal.
2. Get some data
Assuming you have a Linux/Unix terminal open, we’re going to use it to get some data.
Let’s download postal code data from geonames.org. To do that, we’ll use the
If all goes well, you’ll now have a file called
GB_full.csv.zip, which is a zip file containing postal codes for Great Britain. (The file will be located in the directory in which you ran the
Next, let’s unzip the data:
Now you should have a file called
GB_full.txt. That’s got the data we’ll work with.
GB_full.txt database is about 170 MB, so it’s not actually that large. You could easily read the whole file into memory on almost any computer. But let’s pretend it’s an enormous dataset that’s too big for your computer. Here’s some tips for working with the data.
head to get a sense for the data
Whenever I download a new dataset, one of the first things I do is have a look at it. If the dataset is small, I typically read the whole thing into R, and then peruse it from there. But if the data is too big for that, I use the Linux
head command. The
head command will print out the top of the file. To see the head of our
GB_full.txt file, we’d enter:
head will print the first 10 lines of the file. For our dataset, it looks like this (scroll to the left to see all the data):
GB AL3 8QE Slip End England ENG Bedfordshire Central Bedfordshire E06000056 51.8479 -0.4474 6
GB AL5 3NG Harpenden England ENG Bedfordshire Central Bedfordshire E06000056 51.8321 -0.383 6
GB AL5 3NS Hyde England ENG Bedfordshire Central Bedfordshire E06000056 51.8333 -0.3763 6
GB AL5 3QF Hyde England ENG Bedfordshire Central Bedfordshire E06000056 51.8341 -0.385 6
GB B10 0AB Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.4706 -1.875 6
GB B10 0AD Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.4691 -1.8737 6
GB B10 0AE Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.4661 -1.8637 6
GB B10 0AF Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.469 -1.859 6
GB B10 0AL Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.4697 -1.8682 6
GB B10 0AR Birmingham England ENG West Midlands Birmingham District (B) E08000025 52.4654 -1.8604 6
What’s nice about
head is that it is super fast, even if the database is enormous. And although it prints only a small portion of the data, that’s often enough to understand what’s in the dataset.
You can control how many lines
head prints out using the
-n flag. For example, to print out the first 4 lines of our file, we’d enter:
head GB_full.txt -n 4
wc to tell how much data you have
The next thing you might want to know is how many lines are in your database. To find that out, we can use the
wc command. By default, this command will print out the number of lines in the file, the number of words, and the number of bytes. If we only want the number of lines, we can use the
wc -l GB_full.txt
Here’s what we get back:
So our database has about 1.8 million lines. On the scale of big datasets, that’s fairly small. But it’s not something you’d want to open in Excel.
grep the data you want
grep is a legendary command line tool created in 1973 by Unix developer Ken Thompson. It’s basically a tool to do command line searches.
grep is useful if you have a huge dataset, but you only want to use a small portion of it. For example, in our postal code data, suppose we only want the data for one city — say ‘Birmingham’. We can use the
grep function to find all the lines of data that contain the word ‘Birmingham’:
grep "Birmingham" GB_full.txt
By default, grep will dump the results to your screen. If the dataset is large, this screen dump won’t be helpful.
A better option is to dump the grep results to a new file. For that, you can use the
> pipe. The code below will dump our
grep results into a file called
grep "Birmingham" GB_full.txt > birmingham.txt
birmingham.txt file is a reasonable size, you can then read it into memory and work with.
Another option is to use
grep to explore the data. To do that, you can use the
| command to send the output of grep to another command. For example, we already reviewed how to use
wc -l to get the number of lines in a file. We can also use it to count the number of lines from our
For example, this code would tell us how many lines of our database reference ‘Birmingham’:
grep "Birmingham" GB_full.txt | wc -l
I get back:
So there are about 24K entries for Birmingham.
Another option is to pipe the
grep results to the
grep "Birmingham" GB_full.txt | less
less will allow you to scroll through the results in your terminal. (When you’re finished, press
Now, you may be wondering why we’re doing this kind of data exploration with the Linux command line, when we could be doing the same thing with R (or Python). Well, the advantage here is that unlike most R commands, these Linux commands don’t require that you read the whole file into memory. If the dataset is huge, that’s important.
split to breakup the dataset
Often, really big datasets are comprised of many different files, each of which is small enough to be read into memory. That’s convenient. But even if the dataset comes as one giant file, you can easily break it up into smaller ones using the
There are a number of ways to use
split. You can use the
-l flag to split by the number of lines. For example, the code below would split
GB_full.txt into many files, each of which have no more than 100,000 lines.
split GB_full.txt -l 100000
Be careful with this option! If you’re not paying attention, the command could return thousands and thousands of files. For example, out database contains about 1.7 million lines. If you decided to split every 100 lines, you’d get back 17,000 files. Not ideal!
Because of these pitfalls, I prefer to use the
-n flag, which tells
split how many files to create. The code below would split our dataset into 10 different files:
split GB_full.txt -n 10
Here’s what they would be called:
Now that we have our split data, we can work with it iteratively. The idea would be to read one file into memory at a time, and do the manipulation and/or analysis you want. Then move on to the next one.
If you’re on a terminal, you can find out more about a function using the manual command,
man. For example:
Usually what I look for in the
man are the options for the function. In unix-speak, command line options are often called ‘flags’, and denoted with a dash (as in the
A moving target
What constitutes ‘big data’ depends on the specs of your computer. For example, if you have 64GB of RAM, you can read the vast majority of datasets into your memory without a problem. But if you have 8GB of memory, you need to be more careful.
Even if I can read a database into memory, I often use the commands above to get a quick sense of the data before I begin to work with it.