Hi. I’m Sharon Machlis at IDG Communications, here with Episode 50 of Do More With R: 5 things you may not know about data.table’s fread() function.
fread() imports external data files into R. If it’s a data.table function, you know it’s fast. Very fast! But there’s more to fread() than speed. It has several helpful options — some you might not know about. Or, maybe you knew about them once but since stopped using them. Let’s take a look.
I’ll start with a file of daily Covid-19 data from every county in the U.S.: a 12 megabyte file from the New York Times on GitHub. If you’d like to follow along, download the file from the URL on screen
1: Is your file large? Would you like to examine its structure before importing the whole thing – without having to open it in a text editor or Excel? Use fread’s nrows option.
In the line of code creating mydt10, I’m importing just the first 10 rows of a CSV.
If you just want to see column names without any data at all, you can use nrows = 0.
2: Once you know the file structure, you can choose which columns to import. fread()’s select option lets you pick columns you want to keep. And, they’ll appear in your data.table in the same order as you named them. select takes a vector of either column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings, as you see in the first line of code.
As always, numbers don’t need quotation marks.
You can use an R object with a vector of column names inside fread, as you can see in this second group of code. I create a vector, my_cols, with date, county, state, and cases; then I use that vector inside fread().
The opposite of select is drop. You can choose to import all columns except the ones you specify with drop, as in this third group of code
3. If you’re familiar with Unix, you can combine fread() with command-line tools right from inside fread()! For example, if I just wanted California data, I could use grep to only import lines that contain the text “California”. Note that this searches each entire row as a text string, not a specific column, so your data has to be in a format where that makes sense.
In this first line of code, you see the grep command to find the expression California in the US counties file. Unfortunately, grep doesn’t understand the original file’s column names, so you end up with default names. But fread() lets us specify column names with the col-dot-names option. I can put the names back using names from mydt10, the small data.table I created looking at just the first 10 rows of the file.
Next is an example of using regular expressions, with grep’s -E option, letting us do more complex searches, such as looking for four states at once.
Once again, a reminder, though: This is looking for each of those state names anywhere in the row, not just in the state column. If I check what states are included in my results you’ll see Oklahoma and Missouri in there also, not just the four I wanted. That’s because both Oklahoma and Missouri have counties named Texas. So, this is a way to filter out a lot of data you don’t want from a very large data set; but after this kind of import, you’ll still want to filter specifically by column afterwards to make sure you didn’t get anything unexpected.
That’s the final group of code here. I use the optimized-for-speed c.h.i.n. operator – like base R’s %in%, but faster – to select rows where the state column is any of those four states. Now if I check what states are included, it’s only the ones I want.
4: You can set column classes during import – for just a few columns, not every one. For example, the date column in this data is coming in as character strings, even though it’s in year-month-day format. We can set the column named “date” (lower-case d) to the data type Date (capital D) in the import using the colClasses option. Now dates are Dates.
And, number 5: You can import a zipped file without unzipping it first. fread() can import dot-gz and dot-bz2 files directly. Here, I’m importing a local gzipped file.
If you need to import a dot-zip file, you can unzip it with the unzip system command right within fread(), like in the second line of code.
That’s it for this episode, thanks for watching! For more R tips, head to the Do More With R page at bit-dot-l-y slash do more with R, all lowercase except for the R You can also find the Do More With R playlist on the YouTube IDG Tech Talk channel — where you can subscribe so you never miss an episode. Hope to see you next time. Stay healthy and safe, everyone!