Lets start practicing an anova with a random data generator. For 4 groups.

nGroup <- 4 # number of treatment groups
nName <- c("Cntrl","Treat1", "Treat2", "Treat3") # names of groups
nSize <- c(12,17,9,12) # number of observations in each group
nMean <- c(40,41,60,59) # mean of each group
nSD <- c(5,5,5,5) # standardd deviation of each group

ID <- 1:(sum(nSize)) # id vector for each row
resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]),
            rnorm(n=nSize[4],mean=nMean[4],sd=nSD[4]))
Groups <- rep(nName,nSize)
DataFrame <- data.frame(ID,Groups,resVar)
str(DataFrame)
## 'data.frame':    50 obs. of  3 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Groups: chr  "Cntrl" "Cntrl" "Cntrl" "Cntrl" ...
##  $ resVar: num  46.7 39.2 42 33.9 50 ...
head(DataFrame)
##   ID Groups   resVar
## 1  1  Cntrl 46.65354
## 2  2  Cntrl 39.22823
## 3  3  Cntrl 42.03358
## 4  4  Cntrl 33.91393
## 5  5  Cntrl 49.97025
## 6  6  Cntrl 31.92566

As you can see we have 4 groups so we will evalute the difference between groups using an ANOVA.

aov.data <- aov(resVar~Groups,data=DataFrame)
print(aov.data)
## Call:
##    aov(formula = resVar ~ Groups, data = DataFrame)
## 
## Terms:
##                   Groups Residuals
## Sum of Squares  4058.371  1211.128
## Deg. of Freedom        3        46
## 
## Residual standard error: 5.131167
## Estimated effects may be unbalanced
print(summary(aov.data))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Groups       3   4058  1352.8   51.38 9.98e-15 ***
## Residuals   46   1211    26.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have a according to the Pr(>F) (p value) there is a significant difference among groups.

Now lets graph it!

my_plot <- ggplot(data=DataFrame,aes(x=Groups,y=resVar))+
           geom_boxplot()
print(my_plot)

Loking at the graph. We are pretty confidence that there is a significant difference among groups. Therefore is possible to hyphotesis that the different groups are Treat2 and Treat3 compare to control and Treat1. But we will need to perform a post-Hoc test to adress this.

What happen if we derecese the sample size? All variable will have an n=3

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Groups       3 1113.4   371.1   8.756 0.00659 **
## Residuals    8  339.1    42.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is still a significant difference. Mainnly becouse the means are very different. But waht if they are not?

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Groups       3 163.57   54.52   8.673 0.00678 **
## Residuals    8  50.29    6.29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now there is no Significant difference. Lets try to increase the sample size again, keeping the same means.

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Groups        3  19782    6594   303.3 <2e-16 ***
## Residuals   217   4718      22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have a significance difference again. Meaning that when the means are very different a small sample size will be enough to evidence the difference. But the smaller the difference, you will need to increase the sample size, to actually evidence a difference.

Side Note: Be careful, this does not mean that just becouse we increase our sample size, we are always going to evidence a sigfificant difference.

Now lets try the same but with real data. Testing hyphtesis with my data. Were we evaluate the importance of seven intracellular tyrosine of DCBLD2, for its phposphorylation driven by PDGFR.

For this we have two group. DCBLD2 WT and Mutant DCBLD Y7F (Seven intracellular Y mutated to F).

my_data<-read.table("Homework7.csv", header = TRUE, sep = ",")
head(my_data)
##   ID  Treatment   pyfold
## 1  1  DCBLD2_WT 1.161288
## 2  2  DCBLD2_WT 1.486293
## 3  3 DCBLD2_Y7F 1.993074
## 4  4 DCBLD2_Y7F 2.220885
## 5  5  DCBLD2_WT 7.325243
## 6  6  DCBLD2_WT 5.580103

As you can see, I have 2 variable, therefore I will performe a t test to evaluate if there is a significant difference between the two groups.

T.data <- t.test(pyfold~Treatment,data=my_data)
print(T.data)
## 
##  Welch Two Sample t-test
## 
## data:  pyfold by Treatment
## t = 1.6746, df = 6.0674, p-value = 0.1445
## alternative hypothesis: true difference in means between group DCBLD2_WT and group DCBLD2_Y7F is not equal to 0
## 95 percent confidence interval:
##  -0.9225165  4.9576315
## sample estimates:
##  mean in group DCBLD2_WT mean in group DCBLD2_Y7F 
##                2.9573561                0.9397987
print(summary(aov.data))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Groups        3  19782    6594   303.3 <2e-16 ***
## Residuals   217   4718      22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant difference between groups. Lets graph this.

my_plot <- ggplot(data=my_data,aes(x=Treatment,y=pyfold))+
           geom_boxplot()
print(my_plot)

The results evidence a significant lower phosphorylation signal on the Y7F mutant. Highlighting the importance of this 7 intracellular tyrosines for the DCBLD2 phosphorylation driven by PDGFR.