Chapter 2 Data wrangling

Data wrangling refers to the task of processing raw data into useful formats. This Chapter introduces basic data wrangling operations in R using data.tables.

2.1 Data.tables

2.1.1 Overview

data.table objects are a modern implementation of tables containing variables stored in columns and observations stored in rows. Base R provides a similar structure called data.frame. However, we will exclusively use data.table in this course because data.frame objects are a lot slower and often a little more complicated to use.

A data.table is a memory efficient and faster implementation of data.frame. It is more efficient because it operates on its columns by reference. In contrast modifying a data.frame leads R to copy the entire data.frame.

Like a data.frame, each column can have a different type. Unlike a data.frame, it doesn’t have row names. It accepts all data.frame functions to ensure compatibility, but it has a shorter and more flexible syntax. This may be not so straightforward in the beginning but pays off and saves time on two fronts:

  • programming (easier to code, read, debug and maintain)
  • computing (fast and memory efficient)

The general basic form of the data.table syntax is:

DT[ i,  j,  by ] #
    |   |   |
    |   |    -------> grouped by what?
    |    -------> what to do with the columns?
     ---> on which rows?
     

The way to read this out loud is: “Take DT, subset rows by i, then compute j grouped by by”.

We will now describe some basic usage examples expanding on this definition. First of all, let us create and inspect some data.tables to get a first impression.

2.1.2 Creating and loading tables

To create a data.table, we just name its columns and populate them. All the columns have to have the same length. If vectors of different lengths are provided upon creation of a data.table, R automatically recycles the values of the shorter vectors. Here is an example:

# install.packages("data.table")
library(data.table)
DT <- data.table(x = rep(c("a","b","c"), each = 3), y = c(1, 3, 6), v = 1:9)
DT # note how column y was recycled
##    x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9

If we want to convert any other R object to a data.table, all we have to do is to call the as.data.table() function. This is typically done for data.frame objects.

# This way we can for example convert any built-in dataset
# coming as a data.frame into a data.table:
titanic_dt <- as.data.table(Titanic)
class(titanic_dt)
## [1] "data.table" "data.frame"

Here you can see that the class function informs us that titanic_dt is both a data.table and a data.frame as data.table inherits from data.frame.

Alternatively, we can read files from disk and process them using data.table. The easiest way to do so is to use the function fread(). Here is an example using a subset of the Kaggle flight and airports dataset that is limited to flights going in or to the Los Angeles airport. We refer to the description of the Kaggle flights and airports challenge for more details [https://www.kaggle.com/tylerx/flights-and-airports-data].

We provide the file flightsLAX.csv as part of our datasets (See Datasets). To run the following code, save the comma-separated value file flightsLAX.csv into a local folder of your choice and replace the string "path_to_file" with the actual path to your flightsLAX.csv file. For example "path_to_file" could be substituted with "/Users/samantha/mydataviz_folder/extdata". See Appendix I “Importing data” for more details.

flights <- fread('path_to_file/flightsLAX.csv')

Typing the name of the newly created data.table (flights) in the console displays its first and last rows. We observe that reading the file was successful.

flights
##         YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
##      1: 2015     1   1           4      AA          2336      N3KUAA
##      2: 2015     1   1           4      AA           258      N3HYAA
##      3: 2015     1   1           4      US          2013      N584UW
##      4: 2015     1   1           4      DL          1434      N547US
##      5: 2015     1   1           4      AA           115      N3CTAA
##     ---                                                             
## 389365: 2015    12  31           4      AA          1538      N866AA
## 389366: 2015    12  31           4      AS           175      N431AS
## 389367: 2015    12  31           4      AS           471      N570AS
## 389368: 2015    12  31           4      AA           219      N3LYAA
## 389369: 2015    12  31           4      B6           688      N657JB
##         ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
##      1:            LAX                 PBI              2      263     2330
##      2:            LAX                 MIA             15      258     2342
##      3:            LAX                 CLT             44      228     2125
##      4:            LAX                 MSP             35      188     1535
##      5:            LAX                 MIA            103      255     2342
##     ---                                                                    
## 389365:            LAX                 MIA           2357      250     2342
## 389366:            LAX                 ANC           2350      291     2345
## 389367:            LAX                 SEA           2353      132      954
## 389368:            LAX                 ORD           2358      198     1744
## 389369:            LAX                 BOS           2355      272     2611
##         ARRIVAL_TIME
##      1:          741
##      2:          756
##      3:          753
##      4:          605
##      5:          839
##     ---             
## 389365:          731
## 389366:          400
## 389367:          225
## 389368:          544
## 389369:          753

2.1.3 Inspecting tables

A first step in any analysis should involve inspecting the data we just read in. This often starts by looking at the first and last rows of the table as we did above. The next information we are often interested in is the size of our data set. We can use the following commands to obtain it:

ncol(flights)   # nrow(flights) for number of rows
## [1] 13
dim(flights)    # returns nrow and ncol
## [1] 389369     13

Next, we are often interested in basic statistics on the columns. To obtain this information we can call the summary() function on the table.

summary(flights[,1:6])
##       YEAR          MONTH             DAY        DAY_OF_WEEK   
##  Min.   :2015   Min.   : 1.000   Min.   : 1.0   Min.   :1.000  
##  1st Qu.:2015   1st Qu.: 3.000   1st Qu.: 8.0   1st Qu.:2.000  
##  Median :2015   Median : 6.000   Median :16.0   Median :4.000  
##  Mean   :2015   Mean   : 6.198   Mean   :15.7   Mean   :3.934  
##  3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:23.0   3rd Qu.:6.000  
##  Max.   :2015   Max.   :12.000   Max.   :31.0   Max.   :7.000  
##    AIRLINE          FLIGHT_NUMBER 
##  Length:389369      Min.   :   1  
##  Class :character   1st Qu.: 501  
##  Mode  :character   Median :1296  
##                     Mean   :1905  
##                     3rd Qu.:2617  
##                     Max.   :6896

This provides us already a lot of information about our data. We can for example see that all data is from 2015 as all values in the YEAR column are 2015. But for categorical data this is not very insightful, as we can see for the AIRLINE column.

To investigate categorical columns we can have a look at their unique elements using:

flights[, unique(AIRLINE)]
##  [1] "AA" "US" "DL" "UA" "OO" "AS" "B6" "NK" "VX" "WN" "HA" "F9" "MQ"

This command provided us the airline identifiers present in the dataset. Another valuable information for categorical variables is how often each category occurs. This can be obtained using the following commands:

flights[, table(AIRLINE)]
## AIRLINE
##    AA    AS    B6    DL    F9    HA    MQ    NK    OO    UA    US    VX    WN 
## 65483 16144  8216 50343  2770  3112   368  8688 73389 54862  7374 23598 75022

2.2 Row subsetting

As mentioned, the general basic form of the data.table syntax is:

DT[ i,  j,  by ] #
    |   |   |
    |   |    -------> grouped by what?
    |    -------> what to do with the columns?
    ---> on which rows?

Let us first look at the i argument, i.e. row indexing. The parameter i can be any vector of integers, corresponding to the row indices to select, or some logical vectors indicating which rows to select. Here are some typical examples.

2.2.1 Subsetting rows by indices

If we want to see the second element of the table, we can do the following:

flights[2, ]   # Access the 2nd row (also flights[2] or flights[i = 2])
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      AA           258      N3HYAA            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1:                 MIA             15      258     2342          756

A shorter writing allows leaving out the comma:

flights[2]   # Access the 2nd row (also flights[2] or flights[i = 2])
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      AA           258      N3HYAA            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1:                 MIA             15      258     2342          756

For accessing multiple consecutive rows we can use the start:stop syntax as for example:

flights[1:3]   
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      AA          2336      N3KUAA            LAX
## 2: 2015     1   1           4      AA           258      N3HYAA            LAX
## 3: 2015     1   1           4      US          2013      N584UW            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1:                 PBI              2      263     2330          741
## 2:                 MIA             15      258     2342          756
## 3:                 CLT             44      228     2125          753

Accessing multiple rows that are not necessarily consecutive can be done by creating an index vector with c():

flights[c(3, 5)]
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      US          2013      N584UW            LAX
## 2: 2015     1   1           4      AA           115      N3CTAA            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1:                 CLT             44      228     2125          753
## 2:                 MIA            103      255     2342          839

2.2.2 Subsetting rows by logical conditions

Often, a more useful way to subset rows is using logical conditions, using for i a logical vector. We can create such logical vectors using the following binary operators:

  • ==
  • <
  • >
  • !=
  • %in%

For example, entries of flights operated by “AA” (American Airlines) can be extracted using:

flights[AIRLINE == "AA"]
##        YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
##     1: 2015     1   1           4      AA          2336      N3KUAA
##     2: 2015     1   1           4      AA           258      N3HYAA
##     3: 2015     1   1           4      AA           115      N3CTAA
##     4: 2015     1   1           4      AA          2410      N3BAAA
##     5: 2015     1   1           4      AA          1515      N3HMAA
##    ---                                                             
## 65479: 2015    12  31           4      AA          1116      N029AA
## 65480: 2015    12  31           4      AA           246      N863AA
## 65481: 2015    12  31           4      AA          1927      N837AW
## 65482: 2015    12  31           4      AA          1538      N866AA
## 65483: 2015    12  31           4      AA           219      N3LYAA
##        ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
##     1:            LAX                 PBI              2      263     2330
##     2:            LAX                 MIA             15      258     2342
##     3:            LAX                 MIA            103      255     2342
##     4:            LAX                 DFW            600      150     1235
##     5:            LAX                 ORD            557      202     1744
##    ---                                                                    
## 65479:            LAX                 ATL           2256      207     1947
## 65480:            KOA                 LAX           2335      272     2504
## 65481:            LAX                 IAD           2346      239     2288
## 65482:            LAX                 MIA           2357      250     2342
## 65483:            LAX                 ORD           2358      198     1744
##        ARRIVAL_TIME
##     1:          741
##     2:          756
##     3:          839
##     4:         1052
##     5:         1139
##    ---             
## 65479:          553
## 65480:          635
## 65481:          712
## 65482:          731
## 65483:          544

Alternatively, if we are interested in all flights from any destination to the airports in NYC (“JFK” and “LGA”), we can subset the rows using the following command:

flights[DESTINATION_AIRPORT %in% c("LGA", "JFK")]
##        YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
##     1: 2015     1   1           4      B6            24      N923JB
##     2: 2015     1   1           4      DL           476      N196DN
##     3: 2015     1   1           4      AA           118      N788AA
##     4: 2015     1   1           4      VX           404      N621VA
##     5: 2015     1   1           4      UA           275      N598UA
##    ---                                                             
## 12011: 2015    12  31           4      AA           180      N796AA
## 12012: 2015    12  31           4      B6           524      N934JB
## 12013: 2015    12  31           4      B6           624      N942JB
## 12014: 2015    12  31           4      DL          1262      N394DL
## 12015: 2015    12  31           4      B6          1124      N943JB
##        ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
##     1:            LAX                 JFK            620      279     2475
##     2:            LAX                 JFK            650      274     2475
##     3:            LAX                 JFK            650      268     2475
##     4:            LAX                 JFK            728      268     2475
##     5:            LAX                 JFK            806      277     2475
##    ---                                                                    
## 12011:            LAX                 JFK           1640      259     2475
## 12012:            LAX                 JFK           1645      261     2475
## 12013:            LAX                 JFK           2107      280     2475
## 12014:            LAX                 JFK           2244      256     2475
## 12015:            LAX                 JFK           2349      274     2475
##        ARRIVAL_TIME
##     1:         1413
##     2:         1458
##     3:         1436
##     4:         1512
##     5:         1606
##    ---             
## 12011:           18
## 12012:           18
## 12013:          513
## 12014:          625
## 12015:          748

Additionally, we can concatenate multiple conditions using the logical OR | or the logical AND & operator. For instance, if we want to inspect all flights departing between 6am and 7am operated by American Airlines we can use the following statement:

flights[AIRLINE == "AA" & DEPARTURE_TIME > 600 & DEPARTURE_TIME < 700]
##       YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
##    1: 2015     1   1           4      AA          1686      N4XXAA
##    2: 2015     1   1           4      AA          1361      N3KYAA
##    3: 2015     1   1           4      AA          2420      N3ERAA
##    4: 2015     1   1           4      AA           338      N3DJAA
##    5: 2015     1   1           4      AA          2424      N3DLAA
##   ---                                                             
## 3820: 2015    12  31           4      AA           169      N787AA
## 3821: 2015    12  31           4      AA          1352      N7CAAA
## 3822: 2015    12  31           4      AA           146      N3MKAA
## 3823: 2015    12  31           4      AA          2453      N869AA
## 3824: 2015    12  31           4      AA           118      N791AA
##       ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
##    1:            LAX                 STL            609      183     1592
##    2:            BNA                 LAX            607      255     1797
##    3:            LAX                 DFW            619      149     1235
##    4:            LAX                 SFO            644       55      337
##    5:            LAX                 DFW            641      146     1235
##   ---                                                                    
## 3820:            SFO                 LAX            623       54      337
## 3821:            MIA                 LAX            651      303     2342
## 3822:            LAX                 BOS            650      268     2611
## 3823:            LAX                 DFW            651      142     1235
## 3824:            LAX                 JFK            659      272     2475
##       ARRIVAL_TIME
##    1:         1134
##    2:          847
##    3:         1119
##    4:          803
##    5:         1149
##   ---             
## 3820:          740
## 3821:          913
## 3822:         1446
## 3823:         1134
## 3824:         1505

2.3 Column operations

You may have wondered why R correctly runs code such as flights[AIRLINE == "AA"] although AIRLINE is not a variable of the environment but a column of the data.table flights. Such a call would not execute properly with a data.frame. The reason is that code entered inside the [] brackets of a data.table is interpreted using the data.table environment. Inside this environment, columns are seen as variables already. This makes the syntax very light and readable for row subsetting as we just saw. It becomes particularly powerful for column operations. We now look at this.

Although feasible, it is not advisable to access a column by its number since the ordering or number of columns can easily change. Also, if you have a data set with a large number of columns (e.g. 50), how do you know which one is column 18? Therefore, use the column name instead. Using column names prevents bugs.

2.3.1 Working with columns

Accessing columns by name also makes the code more readable: flights[, TAIL_NUMBER] instead of flights[, 7].

flights[1:10, TAIL_NUMBER]    # Access column x (also DT$x or DT[j=x]). 
##  [1] "N3KUAA" "N3HYAA" "N584UW" "N547US" "N3CTAA" "N76517" "N925SW" "N719SK"
##  [9] "N435SW" "N560SW"

For accessing a specific cell (i.e. specific column and specific row), we can use the following syntax:

flights[4, TAIL_NUMBER]   # Access a specific cell.
## [1] "N547US"

This command for accessing multiple columns would return a vector:

flights[1:2, c(TAIL_NUMBER, ORIGIN_AIRPORT)]  
## [1] "N3KUAA" "N3HYAA" "LAX"    "LAX"

However, when accessing many columns, we probably want to return a data.table instead of a vector. For that, we need to provide R with a list, so we use list(colA, colB) or its simplified version .(colA, colB):

flights[1:2, list(TAIL_NUMBER, ORIGIN_AIRPORT)]
##    TAIL_NUMBER ORIGIN_AIRPORT
## 1:      N3KUAA            LAX
## 2:      N3HYAA            LAX
# Same as before.
flights[1:2, .(TAIL_NUMBER, ORIGIN_AIRPORT)]
##    TAIL_NUMBER ORIGIN_AIRPORT
## 1:      N3KUAA            LAX
## 2:      N3HYAA            LAX

2.3.2 Column operations

We already saw that inside the [] environment, columns are seen as variables, so we can apply functions to them.

# Similar to mean(flights[, AIR_TIME])
flights[, mean(AIR_TIME, na.rm=TRUE)]   
## [1] 162.1379
flights[AIRLINE == "OO", mean(AIR_TIME, na.rm=TRUE)]
## [1] 68.02261

To compute operations in multiple columns, we must provide a list (unless we want the result to be a vector).

# Same as flights[, .(mean(AIR_TIME), median(AIR_TIME))]
flights[, list(mean(AIR_TIME, na.rm=TRUE), median(AIR_TIME, na.rm=TRUE))]
##          V1  V2
## 1: 162.1379 150

To give meaningful names to the computations from before, we can use the following command:

flights[, .(mean_AIR_TIME = mean(AIR_TIME, na.rm=TRUE), median_AIR_TIME = median(AIR_TIME, na.rm=TRUE))] 
##    mean_AIR_TIME median_AIR_TIME
## 1:      162.1379             150

Any operation can be applied to the columns, just as with variables. This code computes the average speed as the ratio of AIR_TIME over DISTANCE for the 5 first entries of the table flights:

flights[1:5,AIR_TIME/DISTANCE]
## [1] 0.1128755 0.1101623 0.1072941 0.1224756 0.1088813

2.3.3 Advanced commands: *apply() over columns

The columns of a data.table are exposed as a list to the environment. Therefore functions applying to a list can be applied to them, including those of the *apply family such as sapply(). For example:

sapply(flights, class)   # Try the same with lapply
##                YEAR               MONTH                 DAY         DAY_OF_WEEK 
##           "integer"           "integer"           "integer"           "integer" 
##             AIRLINE       FLIGHT_NUMBER         TAIL_NUMBER      ORIGIN_AIRPORT 
##         "character"           "integer"         "character"         "character" 
## DESTINATION_AIRPORT      DEPARTURE_TIME            AIR_TIME            DISTANCE 
##         "character"           "integer"           "integer"           "integer" 
##        ARRIVAL_TIME 
##           "integer"

2.4 The ‘by’ option

The by option allows executing the j command by groups. For example, we can useby = to compute the mean flight time per airline:

flights[, .(mean_AIRTIME = mean(AIR_TIME, na.rm=TRUE)), by = AIRLINE]
##     AIRLINE mean_AIRTIME
##  1:      AA    219.48133
##  2:      US    210.39488
##  3:      DL    207.07201
##  4:      UA    211.62008
##  5:      OO     68.02261
##  6:      AS    141.01870
##  7:      B6    309.79568
##  8:      NK    179.55828
##  9:      VX    185.36374
## 10:      WN    105.19976
## 11:      HA    307.95961
## 12:      F9    159.94041
## 13:      MQ    102.15210

This way we can easily spot that one airline conducts on average shorter flights. Moreover, we can compute the mean and standard deviation of the air time of every airline:

flights[, .(mean_AIRTIME = mean(AIR_TIME, na.rm=TRUE), sd_AIR_TIME = sd(AIR_TIME, na.rm=TRUE)), by = AIRLINE]
##     AIRLINE mean_AIRTIME sd_AIR_TIME
##  1:      AA    219.48133   92.889719
##  2:      US    210.39488  105.224833
##  3:      DL    207.07201   88.908566
##  4:      UA    211.62008   94.832456
##  5:      OO     68.02261   41.065036
##  6:      AS    141.01870   51.806424
##  7:      B6    309.79568   28.457740
##  8:      NK    179.55828   78.194706
##  9:      VX    185.36374  113.504572
## 10:      WN    105.19976   69.257334
## 11:      HA    307.95961   23.905491
## 12:      F9    159.94041   61.412379
## 13:      MQ    102.15210    8.531046

Although we could write flights[i = 5, j = AIRLINE], we usually omit the i = and j = from the syntax, and write flights[5, AIRLINE] instead. However, for clarity we usually include the by = in the syntax.

2.5 Counting occurences with .N

The .N is a special in-built variable that counts the number observations within a table. Evaluating .N alone is equal to nrow() of a table.

flights[, .N]
## [1] 389369
nrow(flights)
## [1] 389369

But the .N command becomes a lot more powerful when used with grouping or conditioning. We already saw earlier how we can use it to count the number of occurrences of elements in categorical columns. For instance, we can get the number of flights for each airline:

flights[, .N, by = 'AIRLINE']
##     AIRLINE     N
##  1:      AA 65483
##  2:      US  7374
##  3:      DL 50343
##  4:      UA 54862
##  5:      OO 73389
##  6:      AS 16144
##  7:      B6  8216
##  8:      NK  8688
##  9:      VX 23598
## 10:      WN 75022
## 11:      HA  3112
## 12:      F9  2770
## 13:      MQ   368

Remembering the data.table definition: “Take DT, subset rows using i, then select or calculate j, grouped by by”, we can build even more powerful statements using all three elements. For example, we can, for each airline, get the number of flights arriving to the airport JFK:

flights[DESTINATION_AIRPORT == "JFK", .N, by = 'AIRLINE']
##    AIRLINE    N
## 1:      B6 2488
## 2:      DL 2546
## 3:      AA 3804
## 4:      VX 1652
## 5:      UA 1525

2.6 Extending tables

2.6.1 Creating new columns (the := command)

The := operator updates the data.table we are working with in place, so writing DT <- DT[,... := ...] is redundant. This operator, plus all set functions (e.g. setnames for column name renaming, setcolorder for changing the positions in which the columns are positioned inside the DT, etc), change their input by reference. No copy of the object is made, which makes the operation faster and less memory-consuming.

As an example, we can add a new column called SPEED (in miles per hour) whose value is the DISTANCE divided by AIR_TIME times 60:

flights[, SPEED := DISTANCE / AIR_TIME * 60]
head(flights)
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      AA          2336      N3KUAA            LAX
## 2: 2015     1   1           4      AA           258      N3HYAA            LAX
## 3: 2015     1   1           4      US          2013      N584UW            LAX
## 4: 2015     1   1           4      DL          1434      N547US            LAX
## 5: 2015     1   1           4      AA           115      N3CTAA            LAX
## 6: 2015     1   1           4      UA          1545      N76517            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME    SPEED
## 1:                 PBI              2      263     2330          741 531.5589
## 2:                 MIA             15      258     2342          756 544.6512
## 3:                 CLT             44      228     2125          753 559.2105
## 4:                 MSP             35      188     1535          605 489.8936
## 5:                 MIA            103      255     2342          839 551.0588
## 6:                 IAH            112      156     1379          607 530.3846

Having computed a new column using the := operator, we can use it for further analyses. For instance, we can compute the average speed, air time and distance for each airline:

flights[, .(mean_AIR_TIME = mean(AIR_TIME, na.rm=TRUE), 
            mean_SPEED = mean(SPEED, na.rm=TRUE),
            mean_DISTANCE = mean(DISTANCE, na.rm=TRUE)
            ), by=AIRLINE] 
##     AIRLINE mean_AIR_TIME mean_SPEED mean_DISTANCE
##  1:      AA     219.48133   461.2839     1739.2331
##  2:      US     210.39488   452.1641     1658.2581
##  3:      DL     207.07201   466.0330     1656.2165
##  4:      UA     211.62008   464.2928     1693.5504
##  5:      OO      68.02261   349.5549      437.2337
##  6:      AS     141.01870   439.0120     1040.0340
##  7:      B6     309.79568   484.8242     2486.1489
##  8:      NK     179.55828   450.0221     1402.1591
##  9:      VX     185.36374   433.0870     1432.5384
## 10:      WN     105.19976   409.3803      760.2593
## 11:      HA     307.95961   497.3118     2537.8107
## 12:      F9     159.94041   461.0684     1235.6664
## 13:      MQ     102.15210   435.5580      737.0000

Now we can see that the flights by the carrier “OO” are not just shorter, but also slow. This could for example lead us to the hypothesis, that “OO” is a small regional carrier, which operates slower planes.

Additionally we can use the := operator to remove columns. If we for example observe that tail numbers are not important for our analysis we can remove them with the following statement:

flights[, TAIL_NUMBER := NULL]
head(flights)
##    YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 1: 2015     1   1           4      AA          2336            LAX
## 2: 2015     1   1           4      AA           258            LAX
## 3: 2015     1   1           4      US          2013            LAX
## 4: 2015     1   1           4      DL          1434            LAX
## 5: 2015     1   1           4      AA           115            LAX
## 6: 2015     1   1           4      UA          1545            LAX
##    DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME    SPEED
## 1:                 PBI              2      263     2330          741 531.5589
## 2:                 MIA             15      258     2342          756 544.6512
## 3:                 CLT             44      228     2125          753 559.2105
## 4:                 MSP             35      188     1535          605 489.8936
## 5:                 MIA            103      255     2342          839 551.0588
## 6:                 IAH            112      156     1379          607 530.3846

Here we observe, that the tail numbers are gone from the data.table.

2.6.2 Advanced: Multiple assignments

With the following syntax we can assign multiple new columns at once. We load the base R dataset iris 2, which we first transform into a data.table.

# load the Iris data table
iris_dt <- as.data.table(iris)
# Add columns with sepal and petal area. Note the syntax of multiple assignment.
iris_dt[, `:=` (Sepal.Area = Sepal.Length * Sepal.Width,
               Petal.Area = Petal.Length * Petal.Width)][1:3]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1:          5.1         3.5          1.4         0.2  setosa      17.85
## 2:          4.9         3.0          1.4         0.2  setosa      14.70
## 3:          4.7         3.2          1.3         0.2  setosa      15.04
##    Petal.Area
## 1:       0.28
## 2:       0.28
## 3:       0.26

You can also delete columns by using the := command.

# Let's assume setosa flowers are orange, versicolor purple and virginica pink.
# Add a column with these colors.
iris_dt[Species == "setosa", color := "orange"]
iris_dt[Species == "versicolor", color := "purple"]
iris_dt[Species == "virginica", color := "pink"]
unique(iris_dt[, .(Species, color)])
##       Species  color
## 1:     setosa orange
## 2: versicolor purple
## 3:  virginica   pink
# We can delete this new column by setting it to NULL
iris_dt[, color := NULL]
colnames(iris_dt)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "Sepal.Area"   "Petal.Area"

2.6.3 Copying tables

What do we mean when we say that data.table modifies columns by reference?

It means that no new copy of the object is made in the memory, unless we actually create one using copy().

or_dt <- data.table(a = 1:10, b = 11:20)
# No new object is created, both new_dt and or_dt point to the same memory chunk.
new_dt <- or_dt   
new_dt[, ab := a*b]
colnames(or_dt)   # or_dt was also affected by changes in new_dt
## [1] "a"  "b"  "ab"
or_dt <- data.table(a = 1:10, b = 11:20)
copy_dt <- copy(or_dt)   # By creating a copy, we have 2 objects in memory
copy_dt[, ab := a*b]
colnames(or_dt)    # Changes in the copy don't affect the original
## [1] "a" "b"

2.7 Summary

By now, you should be able to answer the following questions:

  • How to subset by rows or columns? Remember: DT[i, j, by].
  • How to add columns?
  • How to make operations with different columns?