Chapter 2 Data wrangling
Data wrangling refers to the task of processing raw data into useful formats.
This Chapter introduces basic data wrangling operations in R using data.tables
.
2.1 Data.tables
2.1.1 Overview
data.table
objects are a modern implementation of tables containing variables stored in columns and observations stored in rows. Base R provides a similar structure called data.frame
. However, we will exclusively use data.table
in this course because data.frame
objects are a lot slower and often a little more complicated to use.
A data.table
is a memory efficient and faster implementation of data.frame
. It is more efficient because it operates on its columns by reference. In contrast modifying a data.frame
leads R to copy the entire data.frame
.
Like a data.frame
, each column can have a different type. Unlike a data.frame
, it doesn’t have row names. It accepts all data.frame
functions to ensure compatibility, but it has a shorter and more flexible syntax. This may be not so straightforward in the beginning but pays off and saves time on two fronts:
- programming (easier to code, read, debug and maintain)
- computing (fast and memory efficient)
The general basic form of the data.table syntax is:
DT[ i, j, by ] #
| | |
| | -------> grouped by what?
| -------> what to do with the columns?
---> on which rows?
The way to read this out loud is: “Take DT
, subset rows by i
, then compute j
grouped by by
”.
We will now describe some basic usage examples expanding on this definition. First of all, let us create and inspect some data.tables
to get a first impression.
2.1.2 Creating and loading tables
To create a data.table
, we just name its columns and populate them. All the columns have to have the same length. If vectors of different lengths are provided upon creation of a data.table
, R automatically recycles the values of the shorter vectors. Here is an example:
# install.packages("data.table")
library(data.table)
DT <- data.table(x = rep(c("a","b","c"), each = 3), y = c(1, 3, 6), v = 1:9)
DT # note how column y was recycled
## x y v
## 1: a 1 1
## 2: a 3 2
## 3: a 6 3
## 4: b 1 4
## 5: b 3 5
## 6: b 6 6
## 7: c 1 7
## 8: c 3 8
## 9: c 6 9
If we want to convert any other R object to a data.table
, all we have to do is to call the as.data.table()
function. This is typically done for data.frame
objects.
# This way we can for example convert any built-in dataset
# coming as a data.frame into a data.table:
titanic_dt <- as.data.table(Titanic)
class(titanic_dt)
## [1] "data.table" "data.frame"
Here you can see that the class function informs us that titanic_dt
is both a data.table
and a data.frame
as data.table
inherits
from data.frame
.
Alternatively, we can read files from disk and process them using data.table
. The easiest way to do so is to use the function fread()
. Here is an example using a subset of the Kaggle flight and airports dataset that is limited to flights going in or to the Los Angeles airport. We refer to the description of the Kaggle flights and airports challenge for more details [https://www.kaggle.com/tylerx/flights-and-airports-data].
We provide the file flightsLAX.csv
as part of our datasets (See Datasets). To run the following code, save the comma-separated value file flightsLAX.csv
into a local folder of your choice and replace the string "path_to_file"
with the actual path to your flightsLAX.csv
file. For example "path_to_file"
could be substituted with "/Users/samantha/mydataviz_folder/extdata"
. See Appendix I “Importing data” for more details.
Typing the name of the newly created data.table
(flights
) in the console displays its first and last rows. We observe that reading the file was successful.
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
## 1: 2015 1 1 4 AA 2336 N3KUAA
## 2: 2015 1 1 4 AA 258 N3HYAA
## 3: 2015 1 1 4 US 2013 N584UW
## 4: 2015 1 1 4 DL 1434 N547US
## 5: 2015 1 1 4 AA 115 N3CTAA
## ---
## 389365: 2015 12 31 4 AA 1538 N866AA
## 389366: 2015 12 31 4 AS 175 N431AS
## 389367: 2015 12 31 4 AS 471 N570AS
## 389368: 2015 12 31 4 AA 219 N3LYAA
## 389369: 2015 12 31 4 B6 688 N657JB
## ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
## 1: LAX PBI 2 263 2330
## 2: LAX MIA 15 258 2342
## 3: LAX CLT 44 228 2125
## 4: LAX MSP 35 188 1535
## 5: LAX MIA 103 255 2342
## ---
## 389365: LAX MIA 2357 250 2342
## 389366: LAX ANC 2350 291 2345
## 389367: LAX SEA 2353 132 954
## 389368: LAX ORD 2358 198 1744
## 389369: LAX BOS 2355 272 2611
## ARRIVAL_TIME
## 1: 741
## 2: 756
## 3: 753
## 4: 605
## 5: 839
## ---
## 389365: 731
## 389366: 400
## 389367: 225
## 389368: 544
## 389369: 753
2.1.3 Inspecting tables
A first step in any analysis should involve inspecting the data we just read in. This often starts by looking at the first and last rows of the table as we did above. The next information we are often interested in is the size of our data set. We can use the following commands to obtain it:
## [1] 13
## [1] 389369 13
Next, we are often interested in basic statistics on the columns. To obtain this information we can call the summary()
function on the table.
## YEAR MONTH DAY DAY_OF_WEEK
## Min. :2015 Min. : 1.000 Min. : 1.0 Min. :1.000
## 1st Qu.:2015 1st Qu.: 3.000 1st Qu.: 8.0 1st Qu.:2.000
## Median :2015 Median : 6.000 Median :16.0 Median :4.000
## Mean :2015 Mean : 6.198 Mean :15.7 Mean :3.934
## 3rd Qu.:2015 3rd Qu.: 9.000 3rd Qu.:23.0 3rd Qu.:6.000
## Max. :2015 Max. :12.000 Max. :31.0 Max. :7.000
## AIRLINE FLIGHT_NUMBER
## Length:389369 Min. : 1
## Class :character 1st Qu.: 501
## Mode :character Median :1296
## Mean :1905
## 3rd Qu.:2617
## Max. :6896
This provides us already a lot of information about our data. We can for example see that all data is from 2015 as all values in the YEAR column are 2015. But for categorical data this is not very insightful, as we can see for the AIRLINE column.
To investigate categorical columns we can have a look at their unique elements using:
## [1] "AA" "US" "DL" "UA" "OO" "AS" "B6" "NK" "VX" "WN" "HA" "F9" "MQ"
This command provided us the airline identifiers present in the dataset. Another valuable information for categorical variables is how often each category occurs. This can be obtained using the following commands:
## AIRLINE
## AA AS B6 DL F9 HA MQ NK OO UA US VX WN
## 65483 16144 8216 50343 2770 3112 368 8688 73389 54862 7374 23598 75022
2.2 Row subsetting
As mentioned, the general basic form of the data.table syntax is:
DT[ i, j, by ] #
| | |
| | -------> grouped by what?
| -------> what to do with the columns?
---> on which rows?
Let us first look at the i
argument, i.e. row indexing. The parameter i
can be any vector of integers, corresponding to the row indices to select, or some logical vectors indicating which rows to select. Here are some typical examples.
2.2.1 Subsetting rows by indices
If we want to see the second element of the table, we can do the following:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 AA 258 N3HYAA LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1: MIA 15 258 2342 756
A shorter writing allows leaving out the comma:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 AA 258 N3HYAA LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1: MIA 15 258 2342 756
For accessing multiple consecutive rows we can use the start:stop
syntax as for example:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 AA 2336 N3KUAA LAX
## 2: 2015 1 1 4 AA 258 N3HYAA LAX
## 3: 2015 1 1 4 US 2013 N584UW LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1: PBI 2 263 2330 741
## 2: MIA 15 258 2342 756
## 3: CLT 44 228 2125 753
Accessing multiple rows that are not necessarily consecutive can be done by creating an index vector with c()
:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 US 2013 N584UW LAX
## 2: 2015 1 1 4 AA 115 N3CTAA LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME
## 1: CLT 44 228 2125 753
## 2: MIA 103 255 2342 839
2.2.2 Subsetting rows by logical conditions
Often, a more useful way to subset rows is using logical conditions, using for i
a logical vector. We can create such logical vectors using the following binary operators:
==
<
>
!=
%in%
For example, entries of flights operated by “AA” (American Airlines) can be extracted using:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
## 1: 2015 1 1 4 AA 2336 N3KUAA
## 2: 2015 1 1 4 AA 258 N3HYAA
## 3: 2015 1 1 4 AA 115 N3CTAA
## 4: 2015 1 1 4 AA 2410 N3BAAA
## 5: 2015 1 1 4 AA 1515 N3HMAA
## ---
## 65479: 2015 12 31 4 AA 1116 N029AA
## 65480: 2015 12 31 4 AA 246 N863AA
## 65481: 2015 12 31 4 AA 1927 N837AW
## 65482: 2015 12 31 4 AA 1538 N866AA
## 65483: 2015 12 31 4 AA 219 N3LYAA
## ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
## 1: LAX PBI 2 263 2330
## 2: LAX MIA 15 258 2342
## 3: LAX MIA 103 255 2342
## 4: LAX DFW 600 150 1235
## 5: LAX ORD 557 202 1744
## ---
## 65479: LAX ATL 2256 207 1947
## 65480: KOA LAX 2335 272 2504
## 65481: LAX IAD 2346 239 2288
## 65482: LAX MIA 2357 250 2342
## 65483: LAX ORD 2358 198 1744
## ARRIVAL_TIME
## 1: 741
## 2: 756
## 3: 839
## 4: 1052
## 5: 1139
## ---
## 65479: 553
## 65480: 635
## 65481: 712
## 65482: 731
## 65483: 544
Alternatively, if we are interested in all flights from any destination to the airports in NYC (“JFK” and “LGA”), we can subset the rows using the following command:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
## 1: 2015 1 1 4 B6 24 N923JB
## 2: 2015 1 1 4 DL 476 N196DN
## 3: 2015 1 1 4 AA 118 N788AA
## 4: 2015 1 1 4 VX 404 N621VA
## 5: 2015 1 1 4 UA 275 N598UA
## ---
## 12011: 2015 12 31 4 AA 180 N796AA
## 12012: 2015 12 31 4 B6 524 N934JB
## 12013: 2015 12 31 4 B6 624 N942JB
## 12014: 2015 12 31 4 DL 1262 N394DL
## 12015: 2015 12 31 4 B6 1124 N943JB
## ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
## 1: LAX JFK 620 279 2475
## 2: LAX JFK 650 274 2475
## 3: LAX JFK 650 268 2475
## 4: LAX JFK 728 268 2475
## 5: LAX JFK 806 277 2475
## ---
## 12011: LAX JFK 1640 259 2475
## 12012: LAX JFK 1645 261 2475
## 12013: LAX JFK 2107 280 2475
## 12014: LAX JFK 2244 256 2475
## 12015: LAX JFK 2349 274 2475
## ARRIVAL_TIME
## 1: 1413
## 2: 1458
## 3: 1436
## 4: 1512
## 5: 1606
## ---
## 12011: 18
## 12012: 18
## 12013: 513
## 12014: 625
## 12015: 748
Additionally, we can concatenate multiple conditions using the logical OR |
or the logical AND &
operator. For instance, if we want to inspect all flights departing between 6am and 7am operated by American Airlines we can use the following statement:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER
## 1: 2015 1 1 4 AA 1686 N4XXAA
## 2: 2015 1 1 4 AA 1361 N3KYAA
## 3: 2015 1 1 4 AA 2420 N3ERAA
## 4: 2015 1 1 4 AA 338 N3DJAA
## 5: 2015 1 1 4 AA 2424 N3DLAA
## ---
## 3820: 2015 12 31 4 AA 169 N787AA
## 3821: 2015 12 31 4 AA 1352 N7CAAA
## 3822: 2015 12 31 4 AA 146 N3MKAA
## 3823: 2015 12 31 4 AA 2453 N869AA
## 3824: 2015 12 31 4 AA 118 N791AA
## ORIGIN_AIRPORT DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
## 1: LAX STL 609 183 1592
## 2: BNA LAX 607 255 1797
## 3: LAX DFW 619 149 1235
## 4: LAX SFO 644 55 337
## 5: LAX DFW 641 146 1235
## ---
## 3820: SFO LAX 623 54 337
## 3821: MIA LAX 651 303 2342
## 3822: LAX BOS 650 268 2611
## 3823: LAX DFW 651 142 1235
## 3824: LAX JFK 659 272 2475
## ARRIVAL_TIME
## 1: 1134
## 2: 847
## 3: 1119
## 4: 803
## 5: 1149
## ---
## 3820: 740
## 3821: 913
## 3822: 1446
## 3823: 1134
## 3824: 1505
2.3 Column operations
You may have wondered why R correctly runs code such as flights[AIRLINE == "AA"]
although AIRLINE
is not a variable of the environment but a column of the data.table
flights
. Such a call would not execute properly with a data.frame
. The reason is that code entered inside the []
brackets of a data.table
is interpreted using the data.table
environment. Inside this environment, columns are seen as variables already. This makes the syntax very light and readable for row subsetting as we just saw. It becomes particularly powerful for column operations. We now look at this.
Although feasible, it is not advisable to access a column by its number since the ordering or number of columns can easily change. Also, if you have a data set with a large number of columns (e.g. 50), how do you know which one is column 18? Therefore, use the column name instead. Using column names prevents bugs.
2.3.1 Working with columns
Accessing columns by name also makes the code more readable: flights[, TAIL_NUMBER]
instead of flights[, 7]
.
## [1] "N3KUAA" "N3HYAA" "N584UW" "N547US" "N3CTAA" "N76517" "N925SW" "N719SK"
## [9] "N435SW" "N560SW"
For accessing a specific cell (i.e. specific column and specific row), we can use the following syntax:
## [1] "N547US"
This command for accessing multiple columns would return a vector:
## [1] "N3KUAA" "N3HYAA" "LAX" "LAX"
However, when accessing many columns, we probably want to return a data.table
instead of a vector. For that, we need to provide R with a list, so we use list(colA, colB)
or its simplified version .(colA, colB)
:
## TAIL_NUMBER ORIGIN_AIRPORT
## 1: N3KUAA LAX
## 2: N3HYAA LAX
## TAIL_NUMBER ORIGIN_AIRPORT
## 1: N3KUAA LAX
## 2: N3HYAA LAX
2.3.2 Column operations
We already saw that inside the []
environment, columns are seen as variables, so we can apply functions to them.
## [1] 162.1379
## [1] 68.02261
To compute operations in multiple columns, we must provide a list (unless we want the result to be a vector).
# Same as flights[, .(mean(AIR_TIME), median(AIR_TIME))]
flights[, list(mean(AIR_TIME, na.rm=TRUE), median(AIR_TIME, na.rm=TRUE))]
## V1 V2
## 1: 162.1379 150
To give meaningful names to the computations from before, we can use the following command:
flights[, .(mean_AIR_TIME = mean(AIR_TIME, na.rm=TRUE), median_AIR_TIME = median(AIR_TIME, na.rm=TRUE))]
## mean_AIR_TIME median_AIR_TIME
## 1: 162.1379 150
Any operation can be applied to the columns, just as with variables. This code computes the average speed as the ratio of AIR_TIME
over DISTANCE
for the 5 first entries of the table flights
:
## [1] 0.1128755 0.1101623 0.1072941 0.1224756 0.1088813
2.3.3 Advanced commands: *apply() over columns
The columns of a data.table
are exposed as a list to the environment. Therefore functions applying to a list can be applied to them, including those of the *apply
family such as sapply()
. For example:
## YEAR MONTH DAY DAY_OF_WEEK
## "integer" "integer" "integer" "integer"
## AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## "character" "integer" "character" "character"
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE
## "character" "integer" "integer" "integer"
## ARRIVAL_TIME
## "integer"
2.4 The ‘by’ option
The by
option allows executing the j
command by groups. For example, we can useby =
to compute the mean flight time per airline:
## AIRLINE mean_AIRTIME
## 1: AA 219.48133
## 2: US 210.39488
## 3: DL 207.07201
## 4: UA 211.62008
## 5: OO 68.02261
## 6: AS 141.01870
## 7: B6 309.79568
## 8: NK 179.55828
## 9: VX 185.36374
## 10: WN 105.19976
## 11: HA 307.95961
## 12: F9 159.94041
## 13: MQ 102.15210
This way we can easily spot that one airline conducts on average shorter flights. Moreover, we can compute the mean and standard deviation of the air time of every airline:
flights[, .(mean_AIRTIME = mean(AIR_TIME, na.rm=TRUE), sd_AIR_TIME = sd(AIR_TIME, na.rm=TRUE)), by = AIRLINE]
## AIRLINE mean_AIRTIME sd_AIR_TIME
## 1: AA 219.48133 92.889719
## 2: US 210.39488 105.224833
## 3: DL 207.07201 88.908566
## 4: UA 211.62008 94.832456
## 5: OO 68.02261 41.065036
## 6: AS 141.01870 51.806424
## 7: B6 309.79568 28.457740
## 8: NK 179.55828 78.194706
## 9: VX 185.36374 113.504572
## 10: WN 105.19976 69.257334
## 11: HA 307.95961 23.905491
## 12: F9 159.94041 61.412379
## 13: MQ 102.15210 8.531046
Although we could write flights[i = 5, j = AIRLINE]
, we usually omit the i =
and j =
from the syntax, and write flights[5, AIRLINE]
instead. However, for clarity we usually include the by =
in the syntax.
2.5 Counting occurences with .N
The .N
is a special in-built variable that counts the number observations within a table. Evaluating .N
alone is equal to nrow()
of a table.
## [1] 389369
## [1] 389369
But the .N
command becomes a lot more powerful when used with grouping or conditioning. We already saw earlier how we can use it to count the number of occurrences of elements in categorical columns. For instance, we can get the number of flights for each airline:
## AIRLINE N
## 1: AA 65483
## 2: US 7374
## 3: DL 50343
## 4: UA 54862
## 5: OO 73389
## 6: AS 16144
## 7: B6 8216
## 8: NK 8688
## 9: VX 23598
## 10: WN 75022
## 11: HA 3112
## 12: F9 2770
## 13: MQ 368
Remembering the data.table
definition: “Take DT, subset rows using i, then select or calculate j, grouped by by”,
we can build even more powerful statements using all three elements. For example, we can, for each airline, get the number of flights arriving to the airport JFK:
## AIRLINE N
## 1: B6 2488
## 2: DL 2546
## 3: AA 3804
## 4: VX 1652
## 5: UA 1525
2.6 Extending tables
2.6.1 Creating new columns (the := command)
The :=
operator updates the data.table
we are working with in place, so writing DT <- DT[,... := ...]
is redundant. This operator, plus all set
functions (e.g. setnames
for column name renaming, setcolorder
for changing the positions in which the columns are positioned inside the DT
, etc), change their input by reference. No copy of the object is made, which makes the operation faster and less memory-consuming.
As an example, we can add a new column called SPEED
(in miles per hour) whose value is the DISTANCE
divided by AIR_TIME
times 60:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 AA 2336 N3KUAA LAX
## 2: 2015 1 1 4 AA 258 N3HYAA LAX
## 3: 2015 1 1 4 US 2013 N584UW LAX
## 4: 2015 1 1 4 DL 1434 N547US LAX
## 5: 2015 1 1 4 AA 115 N3CTAA LAX
## 6: 2015 1 1 4 UA 1545 N76517 LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME SPEED
## 1: PBI 2 263 2330 741 531.5589
## 2: MIA 15 258 2342 756 544.6512
## 3: CLT 44 228 2125 753 559.2105
## 4: MSP 35 188 1535 605 489.8936
## 5: MIA 103 255 2342 839 551.0588
## 6: IAH 112 156 1379 607 530.3846
Having computed a new column using the :=
operator, we can use it for further analyses. For instance, we can compute the average speed, air time and distance for each airline:
flights[, .(mean_AIR_TIME = mean(AIR_TIME, na.rm=TRUE),
mean_SPEED = mean(SPEED, na.rm=TRUE),
mean_DISTANCE = mean(DISTANCE, na.rm=TRUE)
), by=AIRLINE]
## AIRLINE mean_AIR_TIME mean_SPEED mean_DISTANCE
## 1: AA 219.48133 461.2839 1739.2331
## 2: US 210.39488 452.1641 1658.2581
## 3: DL 207.07201 466.0330 1656.2165
## 4: UA 211.62008 464.2928 1693.5504
## 5: OO 68.02261 349.5549 437.2337
## 6: AS 141.01870 439.0120 1040.0340
## 7: B6 309.79568 484.8242 2486.1489
## 8: NK 179.55828 450.0221 1402.1591
## 9: VX 185.36374 433.0870 1432.5384
## 10: WN 105.19976 409.3803 760.2593
## 11: HA 307.95961 497.3118 2537.8107
## 12: F9 159.94041 461.0684 1235.6664
## 13: MQ 102.15210 435.5580 737.0000
Now we can see that the flights by the carrier “OO” are not just shorter, but also slow. This could for example lead us to the hypothesis, that “OO” is a small regional carrier, which operates slower planes.
Additionally we can use the :=
operator to remove columns. If we for example observe that tail numbers are not important for our analysis we can remove them with the following statement:
## YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 1: 2015 1 1 4 AA 2336 LAX
## 2: 2015 1 1 4 AA 258 LAX
## 3: 2015 1 1 4 US 2013 LAX
## 4: 2015 1 1 4 DL 1434 LAX
## 5: 2015 1 1 4 AA 115 LAX
## 6: 2015 1 1 4 UA 1545 LAX
## DESTINATION_AIRPORT DEPARTURE_TIME AIR_TIME DISTANCE ARRIVAL_TIME SPEED
## 1: PBI 2 263 2330 741 531.5589
## 2: MIA 15 258 2342 756 544.6512
## 3: CLT 44 228 2125 753 559.2105
## 4: MSP 35 188 1535 605 489.8936
## 5: MIA 103 255 2342 839 551.0588
## 6: IAH 112 156 1379 607 530.3846
Here we observe, that the tail numbers are gone from the data.table
.
2.6.2 Advanced: Multiple assignments
With the following syntax we can assign multiple new columns at once. We load the base R dataset iris
2, which we first transform into a data.table
.
# load the Iris data table
iris_dt <- as.data.table(iris)
# Add columns with sepal and petal area. Note the syntax of multiple assignment.
iris_dt[, `:=` (Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width)][1:3]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1: 5.1 3.5 1.4 0.2 setosa 17.85
## 2: 4.9 3.0 1.4 0.2 setosa 14.70
## 3: 4.7 3.2 1.3 0.2 setosa 15.04
## Petal.Area
## 1: 0.28
## 2: 0.28
## 3: 0.26
You can also delete columns by using the :=
command.
# Let's assume setosa flowers are orange, versicolor purple and virginica pink.
# Add a column with these colors.
iris_dt[Species == "setosa", color := "orange"]
iris_dt[Species == "versicolor", color := "purple"]
iris_dt[Species == "virginica", color := "pink"]
unique(iris_dt[, .(Species, color)])
## Species color
## 1: setosa orange
## 2: versicolor purple
## 3: virginica pink
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "Sepal.Area" "Petal.Area"
2.6.3 Copying tables
What do we mean when we say that data.table
modifies columns by reference?
It means that no new copy of the object is made in the memory, unless we actually create one using copy()
.
or_dt <- data.table(a = 1:10, b = 11:20)
# No new object is created, both new_dt and or_dt point to the same memory chunk.
new_dt <- or_dt
new_dt[, ab := a*b]
colnames(or_dt) # or_dt was also affected by changes in new_dt
## [1] "a" "b" "ab"
or_dt <- data.table(a = 1:10, b = 11:20)
copy_dt <- copy(or_dt) # By creating a copy, we have 2 objects in memory
copy_dt[, ab := a*b]
colnames(or_dt) # Changes in the copy don't affect the original
## [1] "a" "b"
2.7 Summary
By now, you should be able to answer the following questions:
- How to subset by rows or columns? Remember: DT[i, j, by].
- How to add columns?
- How to make operations with different columns?
2.8 Data.table resources
The help page for data.table
.
https://cran.r-project.org/web/packages/data.table/
https://s3.amazonaws.com/../assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf