library(ggplot2)
library(MASS)
library(dplyr)
# quick and dirty, a truncated normal distribution to work on the solution set
z<- rnorm(n=3000,mean=0.2)
z <- data.frame(1:3000,z)
names(z) <- list("ID","myVar")
z <- z[z$myVar>0,]
str(z)
## 'data.frame': 1680 obs. of 2 variables:
## $ ID : int 3 5 6 7 8 10 12 13 14 15 ...
## $ myVar: num 1.0635 0.0733 1.0548 2.0817 1.2607 ...
summary(z$myVar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000752 0.372211 0.767449 0.885587 1.266283 3.323885
#Plot the data.
p1 <- ggplot(data=z, aes(x=myVar, y=..density..)) +
geom_histogram(color="grey60",fill="cornsilk",size=0.2)
print(p1)
p1 <- p1 + geom_density(linetype="dotted",size=0.75)
print(p1)
normPars <- fitdistr(z$myVar,"normal")
print(normPars)
## mean sd
## 0.88558671 0.64391922
## (0.01571002) (0.01110866)
str(normPars)
## List of 5
## $ estimate: Named num [1:2] 0.886 0.644
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ sd : Named num [1:2] 0.0157 0.0111
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ vcov : num [1:2, 1:2] 0.000247 0 0 0.000123
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "mean" "sd"
## .. ..$ : chr [1:2] "mean" "sd"
## $ n : int 1680
## $ loglik : num -1644
## - attr(*, "class")= chr "fitdistr"
normPars$estimate["mean"] # note structure of getting a named attribute
## mean
## 0.8855867
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]
xval <- seq(0,max(z$myVar),len=length(z$myVar))
stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$myVar), args = list(mean = meanML, sd = sdML))
p1 + stat
It is possible to predict that the data does not follow a normal distribution (Red). It is definitely skew to the left, meaning that the density of values is higher for the lower values.
expoPars <- fitdistr(z$myVar,"exponential")
rateML <- expoPars$estimate["rate"]
stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$myVar), args = list(rate=rateML))
print(p1 + stat + stat2)
The exponential predicted curve (Blue) fits more to our data in contrast to normal distribution. Denietly the negative exponential model proposed higher density in the lower values range of the graph.
stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$myVar), args = list(min=min(z$myVar), max=max(z$myVar)))
p1 + stat + stat2 + stat3
The uniform probability reflect (Brown) that each value has the same possibility to apear in our data set. That is denietly not matching our data distribution. with moro density in the lower values range of the graph.
gammaPars <- fitdistr(z$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]
stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$myVar), args = list(shape=shapeML, rate=rateML))
p1 + stat + stat2 + stat3 + stat4
The gamma distribution is a type of normal distribution (Green), skew to the left side of the graph. Is use to model the continuous variable that should have a positive and skewed distribution, as our data. Therefore is possible to conclude that our randomize data has a gamma positive distribution.
pSpecial <- ggplot(data=z, aes(x=myVar/(max(myVar + 0.1)), y=..density..)) +
geom_histogram(color="grey60",fill="cornsilk",size=0.2) +
xlim(c(0,1)) +
geom_density(size=0.75,linetype="dotted")
betaPars <- fitdistr(x=z$myVar/max(z$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]
statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
print(pSpecial + statSpecial)
Both, Gamma and Beta (Purple) distributions adapt perfectly to our model . This is because a beta distribution is based on a gamma distribution with modified shape, therefore the model could adapt to the shape of our data distribution.
The data represent multiple organic matter values, measure in grams,of our samples, across our site of study.
z<-read.csv("Li_Soil_Dataset_DP.csv", sep=',', comment.char = '#')
str(z)
## 'data.frame': 60 obs. of 10 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Moisture : chr "Ambient" "Ambient" "Ambient" "Ambient" ...
## $ Plant_species_richness: int 1 1 1 1 1 1 1 1 1 1 ...
## $ moisture : num 0.1049 0.1122 0.1098 0.0679 0.105 ...
## $ organic_matter : num 9.39 8.5 12.09 10.97 14.57 ...
## $ Plant_biomass : num 33.2 36.4 39.2 91.3 110 ...
## $ Bacterial_Chao : num 4299 4428 4622 4346 4628 ...
## $ Bacterial_OUT_richness: int 2890 2953 3090 3058 2954 2944 2952 3046 3188 3184 ...
## $ Fungal_Chao.1 : num 1105 1190 933 1089 822 ...
## $ Fungal_OUT_richness : int 758 680 598 569 542 752 581 589 601 816 ...
z<-na.omit(z)
z$myVar<-z$organic_matter
summary(z)
## ID Moisture Plant_species_richness moisture
## Min. : 1.00 Length:60 Min. :1.0 Min. :0.02117
## 1st Qu.:15.75 Class :character 1st Qu.:1.0 1st Qu.:0.03125
## Median :30.50 Mode :character Median :2.0 Median :0.05804
## Mean :30.50 Mean :3.2 Mean :0.06808
## 3rd Qu.:45.25 3rd Qu.:4.0 3rd Qu.:0.10006
## Max. :60.00 Max. :8.0 Max. :0.13715
## organic_matter Plant_biomass Bacterial_Chao Bacterial_OUT_richness
## Min. : 0.6933 Min. : 7.30 Min. :3706 Min. :2441
## 1st Qu.: 5.8625 1st Qu.: 37.36 1st Qu.:4035 1st Qu.:2762
## Median :10.3308 Median : 69.89 Median :4320 Median :2950
## Mean : 9.5545 Mean : 68.36 Mean :4272 Mean :2904
## 3rd Qu.:12.3793 3rd Qu.: 92.45 3rd Qu.:4504 3rd Qu.:3058
## Max. :17.6395 Max. :189.21 Max. :4867 Max. :3198
## Fungal_Chao.1 Fungal_OUT_richness myVar
## Min. : 560.9 Min. :325.0 Min. : 0.6933
## 1st Qu.: 784.7 1st Qu.:485.8 1st Qu.: 5.8625
## Median : 891.2 Median :548.0 Median :10.3308
## Mean : 908.6 Mean :562.7 Mean : 9.5545
## 3rd Qu.:1024.9 3rd Qu.:629.0 3rd Qu.:12.3793
## Max. :1250.5 Max. :816.0 Max. :17.6395
p1<-ggplot(z, aes(x=myVar, y=after_stat(density))) +
geom_histogram(color="green", fill="cornsilk", size=0.2)
p1
We plot for a histogram to evaluate the distribution of the organic matter found in the soil. Were we will plot for different possible distribution and evaluate the one that fits the best. At the moment the higher density of the data is found close to the mean. Therefore, is possible to predict that our data will fit a normal distribution.
But, lets plot for the probability distribution to answer this question.
## mean sd
## 9.5545250 4.1431737
## (0.5348814) (0.3782183)
## List of 5
## $ estimate: Named num [1:2] 9.55 4.14
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ sd : Named num [1:2] 0.535 0.378
## ..- attr(*, "names")= chr [1:2] "mean" "sd"
## $ vcov : num [1:2, 1:2] 0.286 0 0 0.143
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "mean" "sd"
## .. ..$ : chr [1:2] "mean" "sd"
## $ n : int 60
## $ loglik : num -170
## - attr(*, "class")= chr "fitdistr"
## mean
## 9.554525
Starting for which probabiliy density does not fit our data distribution.
Exponential (Blue): Definetly our data does not have higer density for values of 0 a 1, there fore could not be exponential.
Uniform (green): Uniform represent a model in which every value has the same chance to appear. This is not the distribution or our data, there is definitely higher density in some regions of the graph.
Gamma (Brown): Gamma is a type of normal distribution, so it looks similar to our data. But our data is not skew to the left at it should behave a gamma distribution.
Lastly, the normal distribution (Red), is the one that mostly fits the model. With the higer density of values close to the mean. In comparation, the beta distribution (Purple), since is a type of normal distribution also adapts to our model, showing how the data skew a little to the higher values of the graph.
In conclusion, the model that best fits our data is a normal distribution.