Like all features in the info.table R deal, fread is speedy. Quite speedy. But there is much more to fread than velocity. It has a number of beneficial characteristics and options when importing exterior info into R. In this article are five of the most valuable.
Observe: If you’d like to follow along, download the New York Situations CSV file of every day Covid-19 cases by U.S. county at https://github.com/nytimes/covid-19-info/uncooked/master/us-counties.csv.
Use fread’s nrows selection
Is your file big? Would you like to take a look at its framework right before importing the full detail – without acquiring to open it in a text editor or Excel? Use fread’s
nrows selection to import only a portion of a file for exploration.
The code down below imports just the to start with 10 rows of the CSV.
mydt10 <- fread("us-counties.csv", nrows = 10)
If you just want to see column names without any info at all, you can use
nrows = .
Use fread’s pick selection
The moment you know the file framework, you can opt for which columns to import. fread’s
pick selection allows you pick columns you want to keep.
pick normally takes a vector of possibly column names or column-position numbers. If names, they need to be in quotation marks, like most vectors of character strings:
mydt <- fread("us-counties.csv",
pick = c("date", "county", "state", "cases"))
As often, numbers never need quotation marks:
mydt <- fread("us-counties.csv", select = c(1,2,3,5))
You can use an R item with a vector of column names within fread, as you can see in this next group of code. I create a vector my_cols with date, county, state, and cases then I use that vector within fread.
my_cols <- c("date", "county", "state", "cases")
mydt <- fread("us-counties.csv", select = my_cols)
The opposite of
drop. You can opt for to import all columns besides the ones you specify with
drop, this kind of as:
mydt <- fread("us-counties.csv", drop = c("fips", "deaths"))
drop normally takes a vector of column names or numerical positions.
Use fread with grep
If you are acquainted with Unix, you can execute command-line applications correct from within fread. For case in point, if I just wanted California info, I could use grep to only import lines that comprise the text “California.” Observe that this queries every single overall row as a text string, not a precise column, so your info has to be in a structure in which that will make sense.
ca <- fread("grep California us-counties.csv")
Sadly, grep does not realize the initial file’s column names, so you close up with default names.
head(ca) V1 V2 V3 V4 V5 V6 1: 2020-01-twenty five Orange California 6059 1 2: 2020-01-26 Los Angeles California 6037 1 3: 2020-01-26 Orange California 6059 1 four: 2020-01-27 Los Angeles California 6037 1 5: 2020-01-27 Orange California 6059 1 6: 2020-01-28 Los Angeles California 6037 1
Nonetheless, fread allows us specify column names with the
col.names selection. I can set the names based on names from mydt10 that I designed previously mentioned.
ca <- fread("grep California us-counties.csv", col.names = names(mydt10))> head(ca) date county state fips cases deaths 1: 2020-01-twenty five Orange California 6059 1 2: 2020-01-26 Los Angeles California 6037 1 3: 2020-01-26 Orange California 6059 1 four: 2020-01-27 Los Angeles California 6037 1 5: 2020-01-27 Orange California 6059 1 6: 2020-01-28 Los Angeles California 6037 1
We can also use typical expressions, with grep’s
-E selection, letting us do much more complicated queries, this kind of as on the lookout for 4 states at at the time.
states4 <- fread(cmd = "grep -E 'Texas|Arizona|Florida|South Carolina' us-counties.csv",
col.names = names(mydt10))
The moment once more, a reminder: This is on the lookout for every single of all those state names everywhere in the row, not just in the state column. If you operate the code previously mentioned and check out what states are involved in the results with
exclusive(states4$state), you are going to see Oklahoma and Missouri in the states column along with Texas, Arizona, Florida, and South Carolina. Which is mainly because equally Oklahoma and Missouri have counties named Texas.
So, grep throughout file import is a way to filter out a lot of info you never want from a quite big info set but it does not assurance you only get what you want. Just after this type of import, you should really still filter exclusively on column info to make sure you didn’t get something unanticipated.
Use fread’s colClasses selection
You can set column classes throughout import – for just a several columns, not each and every one. For case in point, the date column in this info is coming in as character strings, even though it’s in year-month-day structure. We can set the column named date to the info form Date during import applying the
mydt <- fread("us-counties.csv", colClasses = c("date" = "Date"))
Now, dates are Dates.
> str(mydt) Classes ‘data.table’ and 'data.frame':322651 obs. of 6 variables: $ date : Date, structure: "2020-01-21" "2020-01-22" "2020-01-23" ... $ county: chr "Snohomish" "Snohomish" "Snohomish" "Prepare dinner" ... $ state : chr "Washington" "Washington" "Washington" "Illinois" ... $ fips : int 53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ... $ cases : int 1 1 1 1 1 1 1 1 1 1 ... $ deaths: int ...
Use fread on zipped documents
You can import a zipped file without unzipping it to start with. fread can import gz and bz2 documents right, this kind of as
mydt <- fread("myfile.gz"). If you need to import a zip file, you can unzip it with the
unzip system command in fread, applying the syntax
mydt <- fread(cmd = 'unzip -cq myfile.zip').
For much more R tips, head to InfoWorld’s Do A lot more With R site.
Copyright © 2020 IDG Communications, Inc.