Jan 5, 2019

Data visualization (Introduction to statistics)

First we need to load the R libraries. If you get an error when running the codes below, it might mean that you don’t have the packages downloades, so you should run the commented lines below.

#install.packages('ggplot2')
#install.packages('dplyr')
library(ggplot2)
library(dplyr)

Now we define the dataset.

data = data.frame(Type = as.character(c(1, 1, 1, 1, 1, 1, 
                                        1, 1, 1, 2, 2, 2,
                                        2, 2, 2, 2, 2, 2,
                                        3, 3, 3, 3, 3, 3, 3, 3, 3)),
                  factory = c('A', 'A', 'A', 'B', 'B', 'B',
                              'C', 'C', 'C', 'A', 'A', 'A',
                              'B', 'B', 'B', 'C', 'C', 'C',
                              'A', 'A','A', 'B', 'B', 'B', 'C', 'C', 'C'),
                  strength = c(14, 15, 16, 16, 12, 17, 
                               17, 15, 13, 15, 17, 19, 
                               20, 21, 22, 19, 22, 25, 
                               26, 25, 24, 25, 27, 29, 22, 23, 24))

Now we make use of a very handy function from dplyr, called summarize.

summarizing <- 
  plyr::ddply(data, c("Type", "factory"), summarise, #here I define what variables I want to stack together
                              N    = length(strength), #gettint the N for each sample
                              mean = mean(strength, na.rm = TRUE), #getting the mean 
                              sd   = sd(strength, na.rm = TRUE), #getting the standar deviation
                              se   = sd / sqrt(N)) #getting the standard error

Now let us take a look at what dplyr returns. It is kinda like the pivot table function from excel (wich I don’t use anymore because R, or pandas, in Python is much cooler lol).

head(summarizing)

##   Type factory N mean       sd        se
## 1    1       A 3   15 1.000000 0.5773503
## 2    1       B 3   15 2.645751 1.5275252
## 3    1       C 3   15 2.000000 1.1547005
## 4    2       A 3   17 2.000000 1.1547005
## 5    2       B 3   21 1.000000 0.5773503
## 6    2       C 3   22 3.000000 1.7320508

Now let us visualize. Here I plot a somewhat out of date kind of visualization. Recend papers tend to plot something which allows us to get a good idea of the distribution behind our dataset, so take a look at what violine plots, boxplots, scatterplots, and densityplot can do for you.

Well, here we use te ggplot2 library.

ggplot(summarizing, aes(x = factory, y = mean, fill = Type))+
  geom_bar(stat = "identity", position="dodge")+
  ylab('Strength')+
  xlab('Factory')

I particularly like visualizing boxplots, since they allow us to have an idea about the distribution, as well as a measure of central tendency.

ggplot(data, aes(x = factory, y = strength, colour = Type, fill = Type))+
  geom_boxplot(colour = 'black')+
  ylab('Strength')+
  xlab('Factory')

A somewhat weird density plot, which shows the distribution for each facotry and glass type.

ggplot(data, aes(x = strength, fill = Type))+
  facet_wrap(~factory)+
  geom_density(alpha = 0.5)+
  ylab('Strength')+
  xlab('Factory')

The possibilities are infinite, and the only risk of learning to use ggplot is spending way too much time there, when you should be doing something else lol :D