Chapter 3 Tidy data and combining tables

3.1 Introduction

3.1.1 Motivation

Without good practices, much of the time of a data analyst can be wasted in data wrangling rather than visualization or analysis. The concept of tidy data (Wickham 2014) addresses this issue by offering a standard representation of data, that is easy to manipulate, model and visualize. This chapter introduces the notion of tidy data and operations for tidying up messy datasets. Moreover, we describe how to easily concatenate tables with the same format and merge tables with common variables. This will set us ready for data visualization and analytics.

This chapter is partially adopted from “Introduction to Data Science” by Rafael A. Irizarry (https://rafalab.github.io/dsbook/) and uses concepts and material introduced by the developers of the tidyr package.

3.1.2 Datasets used in this chapter

The following code chunks load libraries and tables used throughout this chapter.

  library(data.table) # melt, dcast, ...
  library(tidyr) # separate, unite, ...

DATADIR <- "extdata"

election_results <- fread(
  file.path(DATADIR, "US-pres16results.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  )

election_results <- election_results[
  is.na(county) & st != "US",
  .(cand, st, votes, total_votes)
  ]

setnames(election_results, "cand", "candidate")
setnames(election_results, "st", "state")

table1 <- fread(
  file.path(DATADIR, "table1_alternate.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  ) 

table2 <- fread(
  file.path(DATADIR, "table2_alternate.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  ) 

table3 <- fread(
  file.path(DATADIR, "table3_alternate.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  )

table4 <- fread(
  file.path(DATADIR, "table4_alternate.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  ) 

table5 <- fread(
  file.path(DATADIR, "table5_alternate.csv"), 
  na.strings=c("NULL", "NA"), encoding = "UTF-8", sep = ","
  )

3.2 Tidy and untidy data

3.2.1 Definition of tidy data

Tidy data table layout. Each variable has a column, each observation a row and each value a cell.

We say that a data table is in tidy format if:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

The following dataset from the 2016 US presidential vote³ is an example of a tidy dataset:

head(election_results)

##          candidate state   votes total_votes
## 1: Hillary Clinton    CA 5931283     9631972
## 2:    Donald Trump    CA 3184721     9631972
## 3:    Gary Johnson    CA  308392     9631972
## 4:      Jill Stein    CA  166311     9631972
## 5:  Gloria La Riva    CA   41265     9631972
## 6:    Donald Trump    FL 4605515     9386750

Each row represents a state and a candidate with each of the four values related to these states stored in the four variables: candidate, state, votes, and total_votes.

3.2.2 Advantages of tidy data

Organizing data in a tidy fashion reduces the burden to frequently reorganize the data. In particular, the advantages are:

Easier manipulation using data.table commands such as sub-setting by rows and columns, as well as by operations
Vectorized operations become easier to use
Many other tools work better with tidy data, including plotting functions, hypothesis testing functions, and modeling functions such as linear regression. These advantages will become striking in the following chapters.

3.2.3 Common signs of untidy datasets

Often, untidy datasets can be identified by one or more of the following issues (Wickham 2014):

Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
A single observational unit is stored in multiple tables

Wickham (2014) furthermore mentions “Multiple types of observational units stored in the same table” as a sign of untidy data. This point is discussed in Section 3.6.

3.3 Tidying up datasets

In this part of the chapter, we show how to transform untidy datasets into tidy ones. To this end, we will present some of the most often encountered untidy formats and present specific solutions to each of them.

3.3.1 Melting (wide to long)

One of the most used operations to obtain tidy data is to transform a wide table into a long table. This operation, which transforms a wide table into a long table is called melting, by analogy with melting a piece of metal. It is useful in particular when data is untidy because column headers are values, and not variable names.

As an example, consider the table below which reports vote counts for two US states, California and Florida. In this table, the column names CA and FL are values of the variable state. Therefore, we can say that this table is in an untidy format:

table4

##          candidate      CA      FL
## 1: Hillary Clinton 5931283 4485745
## 2:    Donald Trump 3184721 4605515
## 3:    Gary Johnson  308392  206007
## 4:      Jill Stein  166311   64019

Melting the election dataset This can be achieved by using the data.table function melt():

melt(table4,
     id.vars = "candidate",
     measure.vars = c("CA", "FL"),
     variable.name = "state",
     value.name = "votes")

##          candidate state   votes
## 1: Hillary Clinton    CA 5931283
## 2:    Donald Trump    CA 3184721
## 3:    Gary Johnson    CA  308392
## 4:      Jill Stein    CA  166311
## 5: Hillary Clinton    FL 4485745
## 6:    Donald Trump    FL 4605515
## 7:    Gary Johnson    FL  206007
## 8:      Jill Stein    FL   64019

We remark that the previous chunk of code would work as well without specifying either measure.vars OR id.vars. However, specifying neither will not work.

When melting, all values in the columns specified by the measure.vars argument are gathered into one column whose name can be specified using the value.name argument. Additionally, a new column, which can be named using the argument variable.name, is created containing all values which were previously stored in the column names.

Now we have a table in a tidy format where a row represents the number of votes for a candidate in a state. The new table also makes clear that the quantities are numbers of votes thanks to the column name.

3.3.2 Casting (long to wide)

The other way around also happens frequently. It is helpful when multiple variables are stored in one column. In the table below, multiple values, namely the number of votes for a candidate and the total number of votes, are reported in one column. It is not easy to compute the percentage of votes given to a candidate in this format. To tidy up this table we have to separate those values into two columns:

table2

##          candidate state        type   value
## 1: Hillary Clinton    CA       votes 5931283
## 2: Hillary Clinton    CA total_votes 9631972
## 3:    Donald Trump    CA       votes 3184721
## 4:    Donald Trump    CA total_votes 9631972
## 5:    Gary Johnson    CA       votes  308392
## 6:    Gary Johnson    CA total_votes 9631972

Casting the election dataset

This operation, which transforms a long table into a wide table is called casting, following up with the metal forging analogy employed with the term “melting”.

Data table casting can be achieved using the dcast() function whose most frequent usage is:

dcast(data, formula, value.var = guess(data))

Casting requires specifying which column contains the categories by which the new columns should be created. This is provided via the formula argument. Setting formula to be ... ~ type instructs dcast to create new columns in the table containing the categories named in the column “type” and that all other columns will get rearranged accordingly. Furthermore, we force the argument value.var, which refers to which column the values have to be extracted from, to be “value” rather than letting dcast guessing as by default. The call is then:

dcast(table2, ... ~ type,  value.var = "value")

##          candidate state total_votes   votes
## 1:    Donald Trump    CA     9631972 3184721
## 2:    Gary Johnson    CA     9631972  308392
## 3: Hillary Clinton    CA     9631972 5931283

The function dcast has many more arguments. Also the formula can allow for more sophisticated ways to handle the columns. We refer to the “datatable-reshape” vignette (see section Resources) and the help of dcast() for more details.

3.3.3 Separating columns

Sometimes single variables can be spread across multiple columns as in the following table.

## One column contains multiple variables
print(table3)

##          candidate state      proportion
## 1: Hillary Clinton    CA 5931283/9631972
## 2:    Donald Trump    CA 3184721/9631972
## 3:    Gary Johnson    CA  308392/9631972

The number of votes per candidate is displayed in the numerator of the proportion column and the total number in the denominator.

We can solve both problems using the separate() function from the tidyr package. The code below splits up the proportion column into two columns, one containing the votes and the other one containing the total votes. By default, columns are separated by any non-alphanumerical character (such as “,”, “;”, “/”,…).

separate(table3, col = proportion,
         into = c("votes", "total_votes"))

##         candidate state   votes total_votes
## 1 Hillary Clinton    CA 5931283     9631972
## 2    Donald Trump    CA 3184721     9631972
## 3    Gary Johnson    CA  308392     9631972

Separated election dataset The drawing above visualizes the operation performed above.

3.3.4 Uniting columns

In this example the first and last names are separated columns without a real need for it (we will not be interested in computing any statistics over all Hillary’s):

table5

##       name surname state   votes total_votes
## 1: Hillary Clinton    CA 5931283     9631972
## 2:  Donald   Trump    CA 3184721     9631972
## 3:    Gary Johnson    CA  308392     9631972
## 4:    Jill   Stein    CA  166311     9631972
## 5:  Gloria La Riva    CA   41265     9631972

We unite multiple variables into a single variable with the function unite() from the tidyr package:

unite(table5, col = candidate, name, surname, sep = " ")

##         candidate state   votes total_votes
## 1 Hillary Clinton    CA 5931283     9631972
## 2    Donald Trump    CA 3184721     9631972
## 3    Gary Johnson    CA  308392     9631972
## 4      Jill Stein    CA  166311     9631972
## 5  Gloria La Riva    CA   41265     9631972

The sep argument defines the separating character(s) used to unite the different column values into one.

United election dataset

3.3.5 Advanced: Columns containing sets of values

Kaggle, a machine learning platform, conducts a yearly survey among its users. Below are a few columns of the answers from the 2017 survey. In those columns, we observe another type of untidy data. In this survey, multiple choice questions were asked from which multiple answers could be selected. For each individual the selected answers are concatenated into a string.

options(width = 60)
survey <- fread('extdata/kaggle-survey-2017/multipleChoiceResponses.csv')
survey[, .(LanguageRecommendationSelect, LearningPlatformSelect, PastJobTitlesSelect)]

##        LanguageRecommendationSelect
##     1:                           F#
##     2:                       Python
##     3:                            R
##     4:                       Python
##     5:                       Python
##    ---                             
## 16712:                             
## 16713:                       Python
## 16714:                             
## 16715:                             
## 16716:                             
##                                                                                              LearningPlatformSelect
##     1:                                                           College/University,Conferences,Podcasts,Trade book
##     2:                                                                                                       Kaggle
##     3:                                                Arxiv,College/University,Kaggle,Online courses,YouTube Videos
##     4: Blogs,College/University,Conferences,Friends network,Official documentation,Online courses,Personal Projects
##     5:                                                                            Arxiv,Conferences,Kaggle,Textbook
##    ---                                                                                                             
## 16712:                                                                                                             
## 16713:                        Kaggle,Non-Kaggle online communities,Online courses,Stack Overflow Q&A,YouTube Videos
## 16714:                                                                                                             
## 16715:                                                                                                             
## 16716:                                                                                                             
##                                                                                                                                                               PastJobTitlesSelect
##     1:                                                                                                                                   Predictive Modeler,Programmer,Researcher
##     2:                                                                                                                                       Software Developer/Software Engineer
##     3:                                                                                                                                   Data Scientist,Machine Learning Engineer
##     4:                                                                                      Business Analyst,Operations Research Practitioner,Predictive Modeler,Programmer,Other
##     5: Computer Scientist,Data Analyst,Data Miner,Data Scientist,Engineer,Machine Learning Engineer,Predictive Modeler,Programmer,Researcher,Software Developer/Software Engineer
##    ---                                                                                                                                                                           
## 16712:                                                                                                                                   Researcher,I haven't started working yet
## 16713:                                                                                                                            Programmer,Software Developer/Software Engineer
## 16714:                                                                                                                                                                           
## 16715:                                                                                                                                                                           
## 16716:

Below is one solution of how the LearningPlatformSelect column could be transformed into a tidy format. We make use here of the handy pipe operator denoted %>% from the magrittr package (See Appendix B.1).

survey_split <- survey[,tstrsplit(LearningPlatformSelect, ',')]
survey_split[, individual := 1:nrow(survey)]
LearningPlatformMelt <- melt(survey_split, 
                             id.vars = 'individual',
                             na.rm = TRUE)[, variable := NULL]

LearningPlatformMelt[order(individual)] %>% head(n=5)

##    individual              value
## 1:          1 College/University
## 2:          1        Conferences
## 3:          1           Podcasts
## 4:          1         Trade book
## 5:          2             Kaggle

3.4 Concatenating tables

One frequently has to concatenate (i.e. append) tables with a same format. Such tables may already be loaded into a list or shall be read from multiple files.

For instance, assume a service generates a new file of data per day in a given directory. One is interested in analyzing the files of multiple days jointly. This requires to list all files of the directory, to read each file and to concatenate them into one.

Here is an example with daily COVID-19 data. We first get all file names of the directory into a character vector called files:

files <- list.files('path_to_your_directory')

files <- list.files('extdata/cov_concatenate', full.names = TRUE)
head(files)

## [1] "extdata/cov_concatenate/covid_cases_01_03_2020.csv"
## [2] "extdata/cov_concatenate/covid_cases_02_03_2020.csv"
## [3] "extdata/cov_concatenate/covid_cases_03_03_2020.csv"
## [4] "extdata/cov_concatenate/covid_cases_04_03_2020.csv"
## [5] "extdata/cov_concatenate/covid_cases_05_03_2020.csv"
## [6] "extdata/cov_concatenate/covid_cases_06_03_2020.csv"

Next, we load all file contents with fread using lapply, which passes the function fread to every element in the list files and returns a list of data.tables called tables.

# name the list elements by the filenames 
names(files) <- basename(files)

# read all files at once into a list of data.tables
tables <- lapply(files, fread)

Let us now look at the first table:

head(tables[[1]])

##    cases deaths countriesAndTerritories geoId
## 1:    54      0                 Germany    DE
## 2:   240      8                   Italy    IT
##    countryterritoryCode popData2019 continentExp
## 1:                  DEU    83019213       Europe
## 2:                  ITA    60359546       Europe
##    Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
## 1:                                                  0.1156359
## 2:                                                  1.8638311

We notice that the variable date is only encoded in the file path so that we additionally need to introduce a new variable in the new table, which defines, from which list the original table came from. We do this to avoid losing information. In this manner, we can state which case / population numbers came from which country.

To do so, we can use the data.table function rbindlist() which gives us the option to introduce a new column idcol containing the list names:

# bind all tables into one using rbindlist, 
# keeping the list names (the filenames) as an id column. 
dt <- rbindlist(tables, idcol = 'filepath')
head(dt)

##                      filepath cases deaths
## 1: covid_cases_01_03_2020.csv    54      0
## 2: covid_cases_01_03_2020.csv   240      8
## 3: covid_cases_02_03_2020.csv    18      0
## 4: covid_cases_02_03_2020.csv   561      6
## 5: covid_cases_03_03_2020.csv    28      0
## 6: covid_cases_03_03_2020.csv   347     17
##    countriesAndTerritories geoId countryterritoryCode
## 1:                 Germany    DE                  DEU
## 2:                   Italy    IT                  ITA
## 3:                 Germany    DE                  DEU
## 4:                   Italy    IT                  ITA
## 5:                 Germany    DE                  DEU
## 6:                   Italy    IT                  ITA
##    popData2019 continentExp
## 1:    83019213       Europe
## 2:    60359546       Europe
## 3:    83019213       Europe
## 4:    60359546       Europe
## 5:    83019213       Europe
## 6:    60359546       Europe
##    Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
## 1:                                                  0.1156359
## 2:                                                  1.8638311
## 3:                                                  0.1373176
## 4:                                                  2.7932616
## 5:                                                  0.1710447
## 6:                                                  3.3681499

3.5 Merging tables

Merging two data tables into one by common column(s) is frequently needed. This can be achieved using the merge function whose core signature is:

merge(
  x, y,                                  # tables to merge
  by = NULL, by.x = NULL, by.y = NULL,   # by which columns
  all = FALSE, all.x = all, all.y = all  # types of merge
)

The four types of merges (also commonly called joins) are:

Inner (default): consider only rows with matching values in the by columns.
Outer or full (all): return all rows and columns from x and y. If there are no matching values, return NAs.
Left (all.x): consider all rows from x, even if they have no matching row in y.
Right (all.y): consider all rows from y, even if they have no matching row in x.

We now provide examples of each type using the following made up tables:

dt1 <- data.table(p_id = c("G008", "F027", "L051"), 
                  value = rnorm(3)) 
dt1

##    p_id      value
## 1: G008  0.1671914
## 2: F027 -1.1082661
## 3: L051 -0.7501615

dt2 <- data.table(p_id = c("G008", "F027", "U093"), 
                  country = c("Germany", "France", "USA")) 
dt2

##    p_id country
## 1: G008 Germany
## 2: F027  France
## 3: U093     USA

3.5.1 Inner merge

An inner merge returns only rows with matching values in the by columns and discards all other rows:

# Inner merge, default one, all = FALSE
m <- merge(dt1, dt2, by = "p_id", all = FALSE)
m

##    p_id      value country
## 1: F027 -1.1082661  France
## 2: G008  0.1671914 Germany

Note that the row order got changed after the merging. To prevent this and, therefore, to keep the original ordering we can use the argument sort and set it to FALSE:

m <- merge(dt1, dt2, by = "p_id", all = FALSE, sort = FALSE)
m

##    p_id      value country
## 1: G008  0.1671914 Germany
## 2: F027 -1.1082661  France

Note that the column order is preserved after merging.

3.5.2 Outer (full) merge

An outer merge returns all rows and columns from x and y. If there are no matching values in p_id, it yields missing values (NA):

# Outer (full) merge, all = TRUE
merge(dt1, dt2, by = "p_id", all = TRUE)

##    p_id      value country
## 1: F027 -1.1082661  France
## 2: G008  0.1671914 Germany
## 3: L051 -0.7501615    <NA>
## 4: U093         NA     USA

3.5.3 Left merge

Returns all rows from x, even if they have no matching row in y. Rows from x with no matching p_id in y lead to missing values (NA).

# Left merge, all.x = TRUE
merge(dt1, dt2, by = "p_id", all.x = TRUE)

##    p_id      value country
## 1: F027 -1.1082661  France
## 2: G008  0.1671914 Germany
## 3: L051 -0.7501615    <NA>

3.5.4 Right merge

Returns all rows from y, even if they have no matching row in x. Rows from y with no matching p_id in x lead to missing values (NA).

# Right, all.y = TRUE
merge(dt1, dt2, by = "p_id", all.y = TRUE)

##    p_id      value country
## 1: F027 -1.1082661  France
## 2: G008  0.1671914 Germany
## 3: U093         NA     USA

3.5.5 Merging by several columns

Merging can also be done using several columns. Here are two made-up tables to illustrate this use case:

dt1 <- data.table(firstname = c("Alice", "Alice", "Bob"), 
                  lastname = c("Coop", "Smith", "Smith"), x=1:3)
dt1

##    firstname lastname x
## 1:     Alice     Coop 1
## 2:     Alice    Smith 2
## 3:       Bob    Smith 3

dt2 <- data.table(firstname = c("Alice", "Bob", "Bob"), 
                  lastname = c("Coop", "Marley", "Smith"),
                  y=LETTERS[1:3])
dt2

##    firstname lastname y
## 1:     Alice     Coop A
## 2:       Bob   Marley B
## 3:       Bob    Smith C

We merge now dt1 and dt2 by first name and last name:

merge(dt1, dt2, by=c("firstname", "lastname"))

##    firstname lastname x y
## 1:     Alice     Coop 1 A
## 2:       Bob    Smith 3 C

Notice that merging by first name only gives a different result (as expected):

merge(dt1, dt2, by="firstname")

##    firstname lastname.x x lastname.y y
## 1:     Alice       Coop 1       Coop A
## 2:     Alice      Smith 2       Coop A
## 3:       Bob      Smith 3     Marley B
## 4:       Bob      Smith 3      Smith C

Also notice that in this case the merge tables has a column lastname.x and a column lastname.y. This is because the two original data tables have a column named the same way (“lastname”), but this column was not part of the “by” argument. Hence, it is assumed that they do not necessarily correspond to the same variable. Hence, they receive distinct names in the returned table.

3.6 Tidy representations are not unique

While untidy data should be avoided, there can be multiple tidy representations for a particular dataset. We explain this regarding i) alternative forms of a single table and ii) the practical utility of non-normalized representations (i.e. with redundant information).

3.6.1 Alternative tidy forms of a table

There can be alternative tidy representations for a same table. Here is an example based on Fisher’s Iris dataset. This classic dataset contains measurements of 4 different attributes for 150 iris flowers from 3 different species. See https://en.wikipedia.org/wiki/Iris_flower_data_set.

Here is one tidy representation where each row represents one flower:

# Iris dataset, usual representation
iris_dt[1:3,]

##    Flower Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1:    F_1          5.1         3.5          1.4         0.2
## 2:    F_2          4.9         3.0          1.4         0.2
## 3:    F_3          4.7         3.2          1.3         0.2
##    Species
## 1:  setosa
## 2:  setosa
## 3:  setosa

Here is another tidy representation where each row represents one measurement:

# Another tidy representation
iris_melt[1:3,]

##    Flower Species    Attribute value
## 1:    F_1  setosa Sepal.Length   5.1
## 2:    F_2  setosa Sepal.Length   4.9
## 3:    F_3  setosa Sepal.Length   4.7

Both representations are tidy and can be more or less useful depending on the downstream analysis. For instance the first representation, which is wide, is handy to assess the relationship between sepal length and sepal width, say by plotting one against the other one or by computing their correlations. The second representation, which is long, can be useful to compute means by attributes or by attributes and species. In the wide form, computing those group means would require to select columns by names which is tedious and leads to not-easily maintainable code. The decisive criteria between using one or the other tidy representation is the definition on what is considered as an observation in the use case.

3.6.2 On multiple types of observational units in the same table

Another important remark for handling tidy data in practice relates to the last common sign of messy datasets according to Wickham (2014), i.e. “Multiple types of observational units are stored in the same table”. Applying this criteria actually depends on the context.

Consider the following table which combines product and customer data:

##    productCode quantOrdered price customerNumber
## 1:        p018            1   450           c001
## 2:        p030            2   600           c001
## 3:        p018            1   450           c002
##    customerName state
## 1:        Smith    CA
## 2:        Smith    CA
## 3:        Lewis    AZ

This table is tidy. Each row corresponds to an order. The columns are variables. However, it contains repetitive information: the product code, product name and its price on the one hand, the customer number, name and state on the other hand. The information could be stored in separate tables without data repetitions, namely:

a consumer table:

##    customerNumber customerName state
## 1:           c001        Smith    CA
## 2:           c002        Lewis    AZ

a product table:

##    productCode price
## 1:        p018   450
## 2:        p030   600

an order table:

##    productCode customerNumber quantOrdered
## 1:        p018           c001            1
## 2:        p030           c001            2
## 3:        p018           c002            1

The three-table representation, where each table has unique entries is called a normalized representation. Normalized representations ensure that no multiple types of observational units are stored in the same table. It is a good habit to have normalized representations for database back-ends because it facilitates maintenance of the data consistency by reducing redundancy. One should not enter all customer details at each order but do it one central place and link the information with a customer number.

However, on the data analysis side (front-end), we are not interested in maintaining a database (back-end), rather in having the desired data in a ready-to-use format which depends on our needs. To this end, the merge table is very handy and can be the common denominator of multiple analyses like:

# vectorized operations e.g. total price of each order
prod_dt[, totalPrice := quantOrdered * price]

# group by operations, e.g. number of products per states 
prod_dt[, N_prod := .N, by = state]

Hence, the choice of the representation (normalized or not) depends on the context: back-end or front-end.

3.7 Summary

By now, you should be able to:

define what a tidy dataset is
recognize untidy data
perform the operations of melting and casting
perform the operations of uniting and splitting
append tables with the same format by rows
understand and perform the 4 merging operations

3.8 Tidy data resources

Tidy data: H. Wickham, Journal of Statistical Software, 2014, Volume 59, Issue 10 https://www.jstatsoft.org/v59/i10/paper

Melt and cast: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

https://www.kaggle.com/stevepalley/2016uspresidentialvotebycounty?select=pres16results.csv ↩︎