Data Wrangling with R: Combining Tables

Use different techniques of combining tables such as mutating join, filtering join and biding.

Thierry Warin https://www.nuance-r.com/principalInvestigator.html (SKEMA Business School (Raleigh, NC))https://www.skema-bs.fr/campus/campus-raleighl
05-27-2019

Table of Contents


By using SKEMA Quantum Studio framework (Warin 2019), this course will teach you how to manipulate data with R.

Set up

To show you how the following functions work, let’s create two dataframes:


# Create a dataframe called "group1"
group1 <- data.frame(name=c("Laura", "Paul", "Ashley", "Sarah", "Mark", "Lori"),
                     age=c(22, 28, 19, 25, 27, 18),
                     gender=c("female","male","female","female","male", "female"))

    name age gender
1  Laura  22 female
2   Paul  28   male
3 Ashley  19 female
4  Sarah  25 female
5   Mark  27   male
6   Lori  18 female

# Create a dataframe called "group2"
group2 <- data.frame(name=c("Laura", "Paul", "Ashley", "Sarah", "Mark", "Sacha"),
                     car=c("Mazda", "Toyota", "Nissan", "Toyota", "Chevrolet", "Honda"))

    name       car
1  Laura     Mazda
2   Paul    Toyota
3 Ashley    Nissan
4  Sarah    Toyota
5   Mark Chevrolet
6  Sacha     Honda

Mutating join

Left join

To join matching rows from group2 to group1:


dplyr::left_join(group1, group2, by = "name")

    name age gender       car
1  Laura  22 female     Mazda
2   Paul  28   male    Toyota
3 Ashley  19 female    Nissan
4  Sarah  25 female    Toyota
5   Mark  27   male Chevrolet
6   Lori  18 female      <NA>

Right join

To join matching rows from group1 to group2:


dplyr::right_join(group1, group2, by = "name")

    name age gender       car
1  Laura  22 female     Mazda
2   Paul  28   male    Toyota
3 Ashley  19 female    Nissan
4  Sarah  25 female    Toyota
5   Mark  27   male Chevrolet
6  Sacha  NA   <NA>     Honda

Inner join

To join data by retaining only rows in both sets:


dplyr::inner_join(group1, group2, by = "name")

    name age gender       car
1  Laura  22 female     Mazda
2   Paul  28   male    Toyota
3 Ashley  19 female    Nissan
4  Sarah  25 female    Toyota
5   Mark  27   male Chevrolet

Full join

To join data by retaining all values, all rows:


dplyr::full_join(group1, group2, by = "name")

    name age gender       car
1  Laura  22 female     Mazda
2   Paul  28   male    Toyota
3 Ashley  19 female    Nissan
4  Sarah  25 female    Toyota
5   Mark  27   male Chevrolet
6   Lori  18 female      <NA>
7  Sacha  NA   <NA>     Honda

Filtering join

Semi join

To join all rows in group1 that have a match in group2.


dplyr::semi_join(group1, group2, by = "name")

    name age gender
1  Laura  22 female
2   Paul  28   male
3 Ashley  19 female
4  Sarah  25 female
5   Mark  27   male

Anti join

To join all rows in group1 that do not have a match in group2.


dplyr::anti_join(group1, group2, by = "name")

  name age gender
1 Lori  18 female

To join all rows in group2 that do not have a match in group1, reverse the names.


dplyr::anti_join(group2, group1, by = "name")

   name   car
1 Sacha Honda

Set operations

To use these functions you need to have same column names in each data frames, unless it won’t work.

Let’s create others dataframes:


# Create a dataframe called "group1"

y <- data.frame(letter=c("A", "B", "C"), 
                          number=c(1, 2, 3))
y

  letter number
1      A      1
2      B      2
3      C      3

# Create a dataframe called "group2"
z <- data.frame(letter=c("B", "C", "D"), 
                          number=c(2, 3, 3))
z

  letter number
1      B      2
2      C      3
3      D      3

Intersect

To get rows that appear in both y and z:


dplyr::intersect(y, z)

  letter number
1      B      2
2      C      3

Union

To get rows that appear in either or both y and z.


dplyr::union(y, z)

  letter number
1      A      1
2      B      2
3      C      3
4      D      3

Setdiff

To get rows that appear in y but not z.


dplyr::setdiff(y, z)

  letter number
1      A      1

Binding

Rows

To append z to y as new rows:


dplyr::bind_rows(y, z)

  letter number
1      A      1
2      B      2
3      C      3
4      B      2
5      C      3
6      D      3

Columns

To append z to y as new columns:


dplyr::bind_cols(y, z)

  letter number letter1 number1
1      A      1       B       2
2      B      2       C       3
3      C      3       D       3

Caution: matches rows by position!


Warin, Thierry. 2019. “SKEMA Quantum Studio: A Technological Framework for Data Science in Higher Education.” https://doi.org/10.6084/m9.figshare.8204195.v2.

Citation

For attribution, please cite this work as

Warin (2019, May 27). Virtual Campus: Data Wrangling with R: Combining Tables. Retrieved from https://virtualcampus.skemagloballab.io/posts/datawranglingwithr3/

BibTeX citation

@misc{warin2019data,
  author = {Warin, Thierry},
  title = {Virtual Campus: Data Wrangling with R: Combining Tables},
  url = {https://virtualcampus.skemagloballab.io/posts/datawranglingwithr3/},
  year = {2019}
}