Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

By using SKEMA Quantum Studio framework (Warin 2019), this course will teach you how to manipulate data with R.

In the course, we will introduced you a new way to refer a function to its package. In fact, we will use the code line `Library(package)`

to mention the needed package and use any function of that package anywhere under that line without having to refer to it everytime. It will be easier as we will use the piping syntax in this session.

This package will be used for this course:

```
library(dplyr)
```

In this course we will use the “piping” syntax to compute summary statistics.

A pipe is represented by this sign: %>%.

The point of the pipe is to help you write code in a way that is easier to read and understand. In fact, “piping” makes your code more readable. Pipes are a powerful tool for clearly expressing a sequence of multiple operations.

To see why the pipe is so useful, let’s take a look at the most basic example that use the operator before you go into the more advanced usages of the operator in the following sections.

`log(x)`

can be rewritten as `x %>% log()`

In short, this means that functions that take one argument, `function(argument)`

, can be rewritten as follows: `argument %>% function()`

.

First, load the data on which summary statistics will be done.

```
library(gsheet)
dataFull <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1uLaXke-KPN28-ESPPoihk8TiXVWp5xuNGHW7w7yqLCc/edit#gid=329712902")
```

rank | company | country | industrial.sector | RD.usd | sales.usd | year |
---|---|---|---|---|---|---|

1 | VOLKSWAGEN | Germany | Automobiles Parts | 14712470286 | NA | 2015 |

2 | DAIMLER | Germany | Automobiles Parts | 6335781792 | 145635513797 | 2015 |

3 | ROBERT BOSCH | Germany | Automobiles Parts | 5653984389 | 54892540624 | 2015 |

4 | SANOFI | France | Pharmaceuticals Biotechnology | 5396067608 | 37868911705 | 2015 |

5 | BMW | Germany | Automobiles Parts | 5120208790 | 90159856973 | 2015 |

6 | SIEMENS | Germany | Electronic Electrical Equipment | 4908268479 | 80649456021 | 2015 |

Here some data on R&D and Sales in USD of companies per country and industrial sector from 2004 to 2015.

To compute summary statistics on the data, the `aggregate`

function is used. The function allow you to split the data into subsets, compute summary statistics for each such as the mean and the sum, and returns the result in a convenient form. We will also use another way to compute summary statistics by piping the following functions: `group_by()`

, `summarize()`

, `mean()`

, `sum()`

and `ungroup()`

.

To calculate the sum of R&D per Country:

```
sum_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, sum)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`country`

: To obtain the average RD per country

`dataFull`

: Data used

`sum`

: To calculate the sum

country | RD.usd |
---|---|

Australia | 31701141962 |

Austria | 15967871461 |

Belgium | 50598161005 |

Bermuda | 25757237590 |

Brazil | 27585939295 |

Bulgaria | 13681470 |

To calculate the mean of R&D per Country:

```
mean_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, mean)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`country`

: To obtain the average RD per country

`dataFull`

: Data used

`mean`

: To calculate the average

country | RD.usd |
---|---|

Australia | 320213555 |

Austria | 44110142 |

Belgium | 109996002 |

Bermuda | 429287293 |

Brazil | 452228513 |

Bulgaria | 6840735 |

To calculate the sum of R&D per Country and Year:

```
sum_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, sum)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`country + year`

: To obtain the sum of R&D per country and per year

`dataFull`

: Data used

`sum`

: To calculate the sum

country | year | RD.usd |
---|---|---|

Australia | 2004 | 1326329413 |

Austria | 2004 | 815106978 |

Belgium | 2004 | 3918844424 |

Bermuda | 2004 | 2855248522 |

Brazil | 2004 | 936977342 |

Canada | 2004 | 7392941158 |

To calculate the mean of R&D per Country and Year:

```
mean_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, mean)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`country + year`

: To obtain the sum of R&D per country per year

`dataFull`

: Data used

`mean`

: To calculate the average

country | year | RD.usd |
---|---|---|

Australia | 2004 | 331582353 |

Austria | 2004 | 42900367 |

Belgium | 2004 | 122463888 |

Bermuda | 2004 | 475874754 |

Brazil | 2004 | 234244336 |

Canada | 2004 | 568687781 |

To calculate the sum of R&D and Sales per Country, using the cbind() function:

```
sum_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, sum)
```

Detailed arguments of the function:

`cbind()`

: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will sum the two columns.

`RD.usd, sales.usd`

: Values on which to average.

`country`

: To obtain the average of RD per country.

`dataFull`

: Data used

`sum`

: To calculate the sum

country | RD.usd | sales.usd |
---|---|---|

Australia | 28738126700 | 8.908038e+11 |

Austria | 6877398846 | 6.357453e+11 |

Belgium | 21422944979 | 1.013106e+12 |

Bermuda | 25757237590 | 6.820938e+11 |

Brazil | 25515764646 | 1.906141e+12 |

Canada | 51566211570 | 1.101636e+12 |

To calculate the sum of R&D and Sales per Country, without the cbind() function:

```
sum2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, sum)
```

Detailed arguments of the function:

`RD.usd + sales.usd`

: Values on which to average.

`country`

: To obtain the average of RD per country.

`dataFull`

: Data used

`sum`

: To calculate the sum

country | RD.usd + sales.usd |
---|---|

Australia | 9.195419e+11 |

Austria | 6.426227e+11 |

Belgium | 1.034529e+12 |

Bermuda | 7.078511e+11 |

Brazil | 1.931657e+12 |

Canada | 1.153202e+12 |

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and sum together.

To calculate the mean of R&D and Sales per Country, using the cbind() function:

```
mean_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, mean)
```

Detailed arguments of the function:

`cbind()`

: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will do the mean of the two columns

`RD.usd, sales.usd`

: Values on which to average

`country`

: To obtain the average of RD per country

`dataFull`

: Data used

`mean`

: To calculate the average

country | RD.usd | sales.usd |
---|---|---|

Australia | 312370942 | 9682650226 |

Austria | 54152747 | 5005868443 |

Belgium | 132240401 | 6253742481 |

Bermuda | 429287293 | 11368230610 |

Brazil | 439926977 | 32864497385 |

Canada | 280251150 | 5987153638 |

To calculate the mean of R&D and Sales per Country, without the cbind() function:

```
mean2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, mean)
```

Detailed arguments of the function:

`RD.usd + sales.usd`

: Values on which to average

`country`

: To obtain the average of RD per country

`dataFull`

: Data used

`mean`

: To calculate the average

country | RD.usd + sales.usd |
---|---|

Australia | 9995021168 |

Austria | 5060021189 |

Belgium | 6385982882 |

Bermuda | 11797517903 |

Brazil | 33304424361 |

Canada | 6267404788 |

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and the average is done on both of them together.

To calculate the sum of R&D and Sales per Country and Year:

```
sum_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, sum)
```

Detailed arguments of the function:

`cbind(RD.usd, sales.usd)`

: Values on which to average

`country + year`

: To obtain the average RD per country and per year

`dataFull`

: Data used

`sum`

: To calculate the sum

country | year | RD.usd | sales.usd |
---|---|---|---|

Australia | 2004 | 1326329413 | 59001083381 |

Bermuda | 2004 | 2855248522 | 109528219651 |

Brazil | 2004 | 936977342 | 82666092199 |

Canada | 2004 | 7392941158 | 117503606183 |

China | 2004 | 1187048998 | 191124385020 |

Croatia | 2004 | 272229903 | 2703306251 |

To calculate the mean of R&D and Sales per Country and Year:

```
mean_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, mean)
```

Detailed arguments of the function:

`cbind(RD.usd, sales.usd)`

: Values on which to average

`country + year`

: To obtain the average RD per country and per year

`dataFull`

: Data used

`mean`

: To calculate the average

country | year | RD.usd | sales.usd |
---|---|---|---|

Australia | 2004 | 331582353 | 14750270845 |

Bermuda | 2004 | 475874754 | 18254703275 |

Brazil | 2004 | 234244336 | 20666523050 |

Canada | 2004 | 568687781 | 9038738937 |

China | 2004 | 296762249 | 47781096255 |

Croatia | 2004 | 136114952 | 1351653125 |

To perform this last operation, do a subset of dataFull by keeping the columns 2 to 5 (company, country, industrial.sector and RD.usd).

```
datasubset <- dataFull[2:5]
```

company | country | industrial.sector | RD.usd |
---|---|---|---|

VOLKSWAGEN | Germany | Automobiles Parts | 14712470286 |

DAIMLER | Germany | Automobiles Parts | 6335781792 |

ROBERT BOSCH | Germany | Automobiles Parts | 5653984389 |

SANOFI | France | Pharmaceuticals Biotechnology | 5396067608 |

BMW | Germany | Automobiles Parts | 5120208790 |

SIEMENS | Germany | Electronic Electrical Equipment | 4908268479 |

To calculate the mean of R&D per Country, Company and Industrial Sector:

```
sum_RD_CountryCompanyIndustry <- aggregate(RD.usd ~ ., data = datasubset, sum)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`.`

: To obtain the sum of R&D on every column

`dataFull`

: Data used

`sum`

: To calculate the sum

company | country | industrial.sector | RD.usd |
---|---|---|---|

FACC | Austria | Aerospace Defence | 12567800 |

SABCA | Belgium | Aerospace Defence | 127281394 |

EMBRAER | Brazil | Aerospace Defence | 2684133551 |

BOMBARDIER | Canada | Aerospace Defence | 6920050267 |

CAE | Canada | Aerospace Defence | 309496291 |

AERO VODOCHODY | Czech Republic | Aerospace Defence | 17266478 |

To calculate the mean of R&D per Country, Company and Industrial Sector:

```
mean_RD_CountryCompanyIndustry<- aggregate(RD.usd ~ ., data = datasubset, mean)
```

Detailed arguments of the function:

`RD.usd`

: The value on which to average

`.`

: To obtain the average of R&D on every column

`dataFull`

: Data used

`mean`

: To calculate the average

company | country | industrial.sector | RD.usd |
---|---|---|---|

FACC | Austria | Aerospace Defence | 12567800 |

SABCA | Belgium | Aerospace Defence | 15910174 |

EMBRAER | Brazil | Aerospace Defence | 206471812 |

BOMBARDIER | Canada | Aerospace Defence | 532311559 |

CAE | Canada | Aerospace Defence | 103165430 |

AERO VODOCHODY | Czech Republic | Aerospace Defence | 8633239 |

To calculate the sum of R&D per Country:

```
sum_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

: The column on which to group_by

`summarize`

: To reduce multiple values down to a single value

`sum`

: To calculate the sum

`RD.usd`

: The value on which to sum

`ungroup`

: Remove grouping

country | RDusd_sum |
---|---|

Australia | 31701141962 |

Austria | 15967871461 |

Belgium | 50598161005 |

Bermuda | 25757237590 |

Brazil | 27585939295 |

Bulgaria | 13681470 |

To calculate the mean of R&D per Country:

```
mean_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

: The column on which to group_by

`summarize`

: To reduce multiple values down to a single value

`mean`

: To calculate the average

`RD.usd`

: The value on which to average

`ungroup`

: Remove grouping

country | RDusd_mean |
---|---|

Australia | 320213555 |

Austria | 44110142 |

Belgium | 109996002 |

Bermuda | 429287293 |

Brazil | 452228513 |

Bulgaria | 6840735 |

To calculate the sum of R&D per Country and Year:

```
sum_RD_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

and `year`

: Columns on which to group_by

`summarize`

: To reduce multiple values down to a single value

`sum`

: To calculate the sum

`RD.usd`

: The value on which to sum

`ungroup`

: Remove grouping

country | year | RDusd_sum |
---|---|---|

Australia | 2004 | 1326329413 |

Australia | 2005 | 409801916 |

Australia | 2006 | 756237929 |

Australia | 2007 | 1046598742 |

Australia | 2008 | 1960352411 |

Australia | 2009 | 1936156478 |

To calculate the mean of R&D per Country and Year:

```
mean_RD_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

and `year`

: Columns on which to group_by

`summarize`

: To reduce multiple values down to a single value

`mean`

: To calculate the average

`RD.usd`

: The value on which to average

`ungroup`

: Remove grouping

country | year | RDusd_mean |
---|---|---|

Australia | 2004 | 331582353 |

Australia | 2005 | 409801916 |

Australia | 2006 | 189059482 |

Australia | 2007 | 149514106 |

Australia | 2008 | 245044051 |

Australia | 2009 | 242019560 |

To calculate the sum of R&D and Sales per Country:

```
sum_RDSales_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

: The column on which to group_by

`summarize`

: To reduce multiple values down to a single value

`sum`

: To calculate the sum

`RD.usd`

and `sales.usd`

: Values on which to sum

`ungroup`

: Remove grouping

country | RDusd_sum | salesusd_sum |
---|---|---|

Australia | 31701141962 | 8.908038e+11 |

Austria | 15967871461 | 6.357453e+11 |

Belgium | 50598161005 | 1.013106e+12 |

Bermuda | 25757237590 | 6.820938e+11 |

Brazil | 27585939295 | 1.906141e+12 |

Bulgaria | 13681470 | 0.000000e+00 |

To calculate the mean of R&D and Sales per Country:

```
mean_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

: The column on which to group_by

`summarize`

: To reduce multiple values down to a single value

`mean`

: To calculate the average

`RD.usd`

and `sales.usd`

: Values on which to average

`ungroup`

: Remove grouping

country | RDusd_mean | salesusd_mean |
---|---|---|

Australia | 320213555 | 9682650226 |

Austria | 44110142 | 5005868443 |

Belgium | 109996002 | 6253742481 |

Bermuda | 429287293 | 11368230610 |

Brazil | 452228513 | 32864497385 |

Bulgaria | 6840735 | NaN |

To calculate summary statistics per country:

```
summary_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RD.usd_mean = mean(RD.usd, na.rm = TRUE),
RD.usd_median = median(RD.usd, na.rm = TRUE),
RD.usd_min = min(RD.usd, na.rm = TRUE),
RD.usd_max = max(RD.usd, na.rm = TRUE),
RD.usd_q0.25 = quantile(RD.usd, 0.25, na.rm = TRUE)) %>%
ungroup()
```

country | RD.usd_mean | RD.usd_median | RD.usd_min | RD.usd_max | RD.usd_q0.25 |
---|---|---|---|---|---|

Australia | 320213555 | 153538583 | 40988873 | 1404025144 | 100332737 |

Austria | 44110142 | 26177086 | 4383595 | 177854217 | 11918427 |

Belgium | 109996002 | 28034132 | 4082317 | 1302354760 | 13447074 |

Bermuda | 429287293 | 286303239 | 43881142 | 1144833914 | 108475235 |

Brazil | 452228513 | 216834284 | 41363693 | 1632838868 | 90044131 |

Bulgaria | 6840735 | 6840735 | 6715507 | 6965964 | 6778121 |

To calculate the sum of R&D and Sales per Country and Year:

```
sum_RDSales_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

and `year`

: Columns on which to group_by

`summarize`

: To reduce multiple values down to a single value

`sum`

: To calculate the sum

`RD.usd`

and `sales.usd`

: Values on which to sum

`ungroup`

: Remove grouping

country | year | RDusd_sum | salesusd_sum |
---|---|---|---|

Australia | 2004 | 1326329413 | 59001083381 |

Australia | 2005 | 409801916 | 18907890659 |

Australia | 2006 | 756237929 | 25341999677 |

Australia | 2007 | 1046598742 | 66264859636 |

Australia | 2008 | 1960352411 | 54140271458 |

Australia | 2009 | 1936156478 | 48945981160 |

To calculate the mean of R&D and Sales per Country and Year:

```
mean_RDSales_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
ungroup()
```

Detailed arguments of the function:

`dataFull`

: Data used

`group_by`

: Convert into a grouped tbl

`country`

and `year`

: Columns on which to group_by

`summarize`

: To reduce multiple values down to a single value

`mean`

: To calculate the average

`RD.usd`

and `sales.usd`

: Values on which to average

`ungroup`

: Remove grouping

country | year | RDusd_mean | salesusd_mean |
---|---|---|---|

Australia | 2004 | 331582353 | 14750270845 |

Australia | 2005 | 409801916 | 18907890659 |

Australia | 2006 | 189059482 | 6335499919 |

Australia | 2007 | 149514106 | 9466408519 |

Australia | 2008 | 245044051 | 6767533932 |

Australia | 2009 | 242019560 | 6118247645 |

Warin, Thierry. 2019. “SKEMA Quantum Studio: A Technological Framework for Data Science in Higher Education.” https://doi.org/10.6084/m9.figshare.8204195.v2.

For attribution, please cite this work as

Warin (2019, May 28). Virtual Campus: Data Wrangling with R: Computing Summary Statistics. Retrieved from https://virtualcampus.skemagloballab.io/posts/datawranglingwithr4/

BibTeX citation

@misc{warin2019data, author = {Warin, Thierry}, title = {Virtual Campus: Data Wrangling with R: Computing Summary Statistics}, url = {https://virtualcampus.skemagloballab.io/posts/datawranglingwithr4/}, year = {2019} }