# Data Wrangling with R: Computing Summary Statistics

Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

Thierry Warin https://www.nuance-r.com/principalInvestigator.html (SKEMA Business School (Raleigh, NC))https://www.skema-bs.fr/campus/campus-raleighl
05-28-2019 By using SKEMA Quantum Studio framework (Warin 2019), this course will teach you how to manipulate data with R.

## Set up

### Library

In the course, we will introduced you a new way to refer a function to its package. In fact, we will use the code line `Library(package)` to mention the needed package and use any function of that package anywhere under that line without having to refer to it everytime. It will be easier as we will use the piping syntax in this session.

This package will be used for this course:

``````
library(dplyr)``````

### Piping

In this course we will use the “piping” syntax to compute summary statistics.

A pipe is represented by this sign: %>%.

The point of the pipe is to help you write code in a way that is easier to read and understand. In fact, “piping” makes your code more readable. Pipes are a powerful tool for clearly expressing a sequence of multiple operations.

To see why the pipe is so useful, let’s take a look at the most basic example that use the operator before you go into the more advanced usages of the operator in the following sections.

`log(x)` can be rewritten as `x %>% log()`

In short, this means that functions that take one argument, `function(argument)`, can be rewritten as follows: `argument %>% function()`.

### Data

First, load the data on which summary statistics will be done.

``````
library(gsheet)

rank company country industrial.sector RD.usd sales.usd year
1 VOLKSWAGEN Germany Automobiles Parts 14712470286 NA 2015
2 DAIMLER Germany Automobiles Parts 6335781792 145635513797 2015
3 ROBERT BOSCH Germany Automobiles Parts 5653984389 54892540624 2015
4 SANOFI France Pharmaceuticals Biotechnology 5396067608 37868911705 2015
5 BMW Germany Automobiles Parts 5120208790 90159856973 2015
6 SIEMENS Germany Electronic Electrical Equipment 4908268479 80649456021 2015

Here some data on R&D and Sales in USD of companies per country and industrial sector from 2004 to 2015.

### Functions

To compute summary statistics on the data, the `aggregate` function is used. The function allow you to split the data into subsets, compute summary statistics for each such as the mean and the sum, and returns the result in a convenient form. We will also use another way to compute summary statistics by piping the following functions: `group_by()`, `summarize()`, `mean()`, `sum()` and `ungroup()`.

## One ~ one

### Sum

To calculate the sum of R&D per Country:

``````
sum_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, sum)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`country`: To obtain the average RD per country
`dataFull`: Data used
`sum`: To calculate the sum

country RD.usd
Australia 31701141962
Austria 15967871461
Belgium 50598161005
Bermuda 25757237590
Brazil 27585939295
Bulgaria 13681470

### Mean

To calculate the mean of R&D per Country:

``````
mean_RD_Country <- aggregate(RD.usd ~ country, data = dataFull, mean)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`country`: To obtain the average RD per country
`dataFull`: Data used
`mean`: To calculate the average

country RD.usd
Australia 320213555
Austria 44110142
Belgium 109996002
Bermuda 429287293
Brazil 452228513
Bulgaria 6840735

## One ~ many

### Sum

To calculate the sum of R&D per Country and Year:

``````
sum_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, sum)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`country + year` : To obtain the sum of R&D per country and per year
`dataFull`: Data used
`sum`: To calculate the sum

country year RD.usd
Australia 2004 1326329413
Austria 2004 815106978
Belgium 2004 3918844424
Bermuda 2004 2855248522
Brazil 2004 936977342

### Mean

To calculate the mean of R&D per Country and Year:

``````
mean_RD_CountryYear <- aggregate(RD.usd ~ country + year, data = dataFull, mean)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`country + year` : To obtain the sum of R&D per country per year
`dataFull`: Data used
`mean`: To calculate the average

country year RD.usd
Australia 2004 331582353
Austria 2004 42900367
Belgium 2004 122463888
Bermuda 2004 475874754
Brazil 2004 234244336

## Many ~ one

### Sum

To calculate the sum of R&D and Sales per Country, using the cbind() function:

``````
sum_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, sum)``````

Detailed arguments of the function:
`cbind()`: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will sum the two columns.
`RD.usd, sales.usd`: Values on which to average.
`country`: To obtain the average of RD per country.
`dataFull`: Data used
`sum`: To calculate the sum

country RD.usd sales.usd
Australia 28738126700 8.908038e+11
Austria 6877398846 6.357453e+11
Belgium 21422944979 1.013106e+12
Bermuda 25757237590 6.820938e+11
Brazil 25515764646 1.906141e+12

To calculate the sum of R&D and Sales per Country, without the cbind() function:

``````
sum2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, sum)``````

Detailed arguments of the function:
`RD.usd + sales.usd`: Values on which to average.
`country`: To obtain the average of RD per country.
`dataFull`: Data used
`sum`: To calculate the sum

country RD.usd + sales.usd
Australia 9.195419e+11
Austria 6.426227e+11
Belgium 1.034529e+12
Bermuda 7.078511e+11
Brazil 1.931657e+12

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and sum together.

### Mean

To calculate the mean of R&D and Sales per Country, using the cbind() function:

``````
mean_RDSales_Country <- aggregate(cbind(RD.usd, sales.usd) ~ country, data = dataFull, mean)``````

Detailed arguments of the function:
`cbind()`: To put the columns RD.usd and sales.usd next to each other, otherwise the aggregate function will do the mean of the two columns
`RD.usd, sales.usd`: Values on which to average
`country`: To obtain the average of RD per country
`dataFull`: Data used
`mean`: To calculate the average

country RD.usd sales.usd
Australia 312370942 9682650226
Austria 54152747 5005868443
Belgium 132240401 6253742481
Bermuda 429287293 11368230610
Brazil 439926977 32864497385

To calculate the mean of R&D and Sales per Country, without the cbind() function:

``````
mean2_RDSales_Country <- aggregate(RD.usd + sales.usd ~ country, data = dataFull, mean)``````

Detailed arguments of the function:
`RD.usd + sales.usd`: Values on which to average
`country`: To obtain the average of RD per country
`dataFull`: Data used
`mean`: To calculate the average

country RD.usd + sales.usd
Australia 9995021168
Austria 5060021189
Belgium 6385982882
Bermuda 11797517903
Brazil 33304424361

You can see the difference with the operation above. The columns RD.usd and sales.usd are put in the same column and the average is done on both of them together.

## Many ~ many

### Sum

To calculate the sum of R&D and Sales per Country and Year:

``````
sum_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, sum)``````

Detailed arguments of the function:
`cbind(RD.usd, sales.usd)`: Values on which to average
`country + year`: To obtain the average RD per country and per year
`dataFull`: Data used
`sum`: To calculate the sum

country year RD.usd sales.usd
Australia 2004 1326329413 59001083381
Bermuda 2004 2855248522 109528219651
Brazil 2004 936977342 82666092199
China 2004 1187048998 191124385020
Croatia 2004 272229903 2703306251

### Mean

To calculate the mean of R&D and Sales per Country and Year:

``````
mean_RDSales_CountryYear <- aggregate(cbind(RD.usd, sales.usd) ~ country + year, data = dataFull, mean)``````

Detailed arguments of the function:
`cbind(RD.usd, sales.usd)`: Values on which to average
`country + year`: To obtain the average RD per country and per year
`dataFull`: Data used
`mean`: To calculate the average

country year RD.usd sales.usd
Australia 2004 331582353 14750270845
Bermuda 2004 475874754 18254703275
Brazil 2004 234244336 20666523050
China 2004 296762249 47781096255
Croatia 2004 136114952 1351653125

## Dot notation

To perform this last operation, do a subset of dataFull by keeping the columns 2 to 5 (company, country, industrial.sector and RD.usd).

``````
datasubset <- dataFull[2:5]``````
company country industrial.sector RD.usd
VOLKSWAGEN Germany Automobiles Parts 14712470286
DAIMLER Germany Automobiles Parts 6335781792
ROBERT BOSCH Germany Automobiles Parts 5653984389
SANOFI France Pharmaceuticals Biotechnology 5396067608
BMW Germany Automobiles Parts 5120208790
SIEMENS Germany Electronic Electrical Equipment 4908268479

### Sum

To calculate the mean of R&D per Country, Company and Industrial Sector:

``````
sum_RD_CountryCompanyIndustry <- aggregate(RD.usd ~ ., data = datasubset, sum)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`.`: To obtain the sum of R&D on every column
`dataFull`: Data used
`sum`: To calculate the sum

company country industrial.sector RD.usd
FACC Austria Aerospace Defence 12567800
SABCA Belgium Aerospace Defence 127281394
EMBRAER Brazil Aerospace Defence 2684133551
AERO VODOCHODY Czech Republic Aerospace Defence 17266478

### Mean

To calculate the mean of R&D per Country, Company and Industrial Sector:

``````
mean_RD_CountryCompanyIndustry<- aggregate(RD.usd ~ ., data = datasubset, mean)``````

Detailed arguments of the function:
`RD.usd`: The value on which to average
`.`: To obtain the average of R&D on every column
`dataFull`: Data used
`mean`: To calculate the average

company country industrial.sector RD.usd
FACC Austria Aerospace Defence 12567800
SABCA Belgium Aerospace Defence 15910174
EMBRAER Brazil Aerospace Defence 206471812
AERO VODOCHODY Czech Republic Aerospace Defence 8633239

## Group one by one

### Sum

To calculate the sum of R&D per Country:

``````
sum_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country`: The column on which to group_by
`summarize`: To reduce multiple values down to a single value
`sum`: To calculate the sum
`RD.usd`: The value on which to sum
`ungroup`: Remove grouping

country RDusd_sum
Australia 31701141962
Austria 15967871461
Belgium 50598161005
Bermuda 25757237590
Brazil 27585939295
Bulgaria 13681470

### Mean

To calculate the mean of R&D per Country:

``````
mean_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country`: The column on which to group_by
`summarize`: To reduce multiple values down to a single value
`mean`: To calculate the average
`RD.usd`: The value on which to average
`ungroup`: Remove grouping

country RDusd_mean
Australia 320213555
Austria 44110142
Belgium 109996002
Bermuda 429287293
Brazil 452228513
Bulgaria 6840735

## Group one by many

### Sum

To calculate the sum of R&D per Country and Year:

``````
sum_RD_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country` and `year`: Columns on which to group_by
`summarize`: To reduce multiple values down to a single value
`sum`: To calculate the sum
`RD.usd`: The value on which to sum
`ungroup`: Remove grouping

country year RDusd_sum
Australia 2004 1326329413
Australia 2005 409801916
Australia 2006 756237929
Australia 2007 1046598742
Australia 2008 1960352411
Australia 2009 1936156478

### Mean

To calculate the mean of R&D per Country and Year:

``````
mean_RD_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country` and `year`: Columns on which to group_by
`summarize`: To reduce multiple values down to a single value
`mean`: To calculate the average
`RD.usd`: The value on which to average
`ungroup`: Remove grouping

country year RDusd_mean
Australia 2004 331582353
Australia 2005 409801916
Australia 2006 189059482
Australia 2007 149514106
Australia 2008 245044051
Australia 2009 242019560

## Group many by one

### Sum

To calculate the sum of R&D and Sales per Country:

``````
sum_RDSales_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country`: The column on which to group_by
`summarize`: To reduce multiple values down to a single value
`sum`: To calculate the sum
`RD.usd` and `sales.usd`: Values on which to sum
`ungroup`: Remove grouping

country RDusd_sum salesusd_sum
Australia 31701141962 8.908038e+11
Austria 15967871461 6.357453e+11
Belgium 50598161005 1.013106e+12
Bermuda 25757237590 6.820938e+11
Brazil 27585939295 1.906141e+12
Bulgaria 13681470 0.000000e+00

### Mean

To calculate the mean of R&D and Sales per Country:

``````
mean_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country`: The column on which to group_by
`summarize`: To reduce multiple values down to a single value
`mean`: To calculate the average
`RD.usd` and `sales.usd`: Values on which to average
`ungroup`: Remove grouping

country RDusd_mean salesusd_mean
Australia 320213555 9682650226
Austria 44110142 5005868443
Belgium 109996002 6253742481
Bermuda 429287293 11368230610
Brazil 452228513 32864497385
Bulgaria 6840735 NaN

### Summary statistics

To calculate summary statistics per country:

``````
summary_RD_Country <- dataFull %>%
group_by(country) %>%
summarize(RD.usd_mean = mean(RD.usd, na.rm = TRUE),
RD.usd_median = median(RD.usd, na.rm = TRUE),
RD.usd_min = min(RD.usd, na.rm = TRUE),
RD.usd_max = max(RD.usd, na.rm = TRUE),
RD.usd_q0.25 = quantile(RD.usd, 0.25, na.rm = TRUE)) %>%
ungroup()``````
country RD.usd_mean RD.usd_median RD.usd_min RD.usd_max RD.usd_q0.25
Australia 320213555 153538583 40988873 1404025144 100332737
Austria 44110142 26177086 4383595 177854217 11918427
Belgium 109996002 28034132 4082317 1302354760 13447074
Bermuda 429287293 286303239 43881142 1144833914 108475235
Brazil 452228513 216834284 41363693 1632838868 90044131
Bulgaria 6840735 6840735 6715507 6965964 6778121

## Group many by many

### Sum

To calculate the sum of R&D and Sales per Country and Year:

``````
sum_RDSales_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_sum = sum(RD.usd, na.rm=TRUE),
salesusd_sum = sum(sales.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country` and `year`: Columns on which to group_by
`summarize`: To reduce multiple values down to a single value
`sum`: To calculate the sum
`RD.usd` and `sales.usd`: Values on which to sum
`ungroup`: Remove grouping

country year RDusd_sum salesusd_sum
Australia 2004 1326329413 59001083381
Australia 2005 409801916 18907890659
Australia 2006 756237929 25341999677
Australia 2007 1046598742 66264859636
Australia 2008 1960352411 54140271458
Australia 2009 1936156478 48945981160

### Mean

To calculate the mean of R&D and Sales per Country and Year:

``````
mean_RDSales_CountryYear <- dataFull %>%
group_by(country, year) %>%
summarize(RDusd_mean = mean(RD.usd, na.rm=TRUE),
salesusd_mean = mean(sales.usd, na.rm=TRUE)) %>%
ungroup()``````

Detailed arguments of the function:
`dataFull`: Data used
`group_by`: Convert into a grouped tbl
`country` and `year`: Columns on which to group_by
`summarize`: To reduce multiple values down to a single value
`mean`: To calculate the average
`RD.usd` and `sales.usd`: Values on which to average
`ungroup`: Remove grouping

country year RDusd_mean salesusd_mean
Australia 2004 331582353 14750270845
Australia 2005 409801916 18907890659
Australia 2006 189059482 6335499919
Australia 2007 149514106 9466408519
Australia 2008 245044051 6767533932
Australia 2009 242019560 6118247645

Warin, Thierry. 2019. “SKEMA Quantum Studio: A Technological Framework for Data Science in Higher Education.” https://doi.org/10.6084/m9.figshare.8204195.v2.

### Citation

`Warin (2019, May 28). Virtual Campus: Data Wrangling with R: Computing Summary Statistics. Retrieved from https://virtualcampus.skemagloballab.io/posts/datawranglingwithr4/`
```@misc{warin2019data,