library(ggplot2)
library(MASS)
library(dplyr)

Start with some random data

# quick and dirty, a truncated normal distribution to work on the solution set

z<- rnorm(n=3000,mean=0.2)
z <- data.frame(1:3000,z)
names(z) <- list("ID","myVar")
z <- z[z$myVar>0,]
str(z)

## 'data.frame':    1680 obs. of  2 variables:
##  $ ID   : int  3 5 6 7 8 10 12 13 14 15 ...
##  $ myVar: num  1.0635 0.0733 1.0548 2.0817 1.2607 ...

summary(z$myVar)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000752 0.372211 0.767449 0.885587 1.266283 3.323885

#Plot the data. 

p1 <- ggplot(data=z, aes(x=myVar, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 
print(p1)

p1 <-  p1 +  geom_density(linetype="dotted",size=0.75)
print(p1)

Now lets get maximum likelihood parameters for normal

normPars <- fitdistr(z$myVar,"normal")
print(normPars)

##       mean          sd    
##   0.88558671   0.64391922 
##  (0.01571002) (0.01110866)

str(normPars)

## List of 5
##  $ estimate: Named num [1:2] 0.886 0.644
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.0157 0.0111
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 0.000247 0 0 0.000123
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 1680
##  $ loglik  : num -1644
##  - attr(*, "class")= chr "fitdistr"

normPars$estimate["mean"] # note structure of getting a named attribute

##      mean 
## 0.8855867

Plot the normal probability density

meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(z$myVar),len=length(z$myVar))

 stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$myVar), args = list(mean = meanML, sd = sdML))
 p1 + stat

It is possible to predict that the data does not follow a normal distribution (Red). It is definitely skew to the left, meaning that the density of values is higher for the lower values.

Plot exponential probability density

expoPars <- fitdistr(z$myVar,"exponential")
rateML <- expoPars$estimate["rate"]

stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$myVar), args = list(rate=rateML))
 print(p1 + stat + stat2)

The exponential predicted curve (Blue) fits more to our data in contrast to normal distribution. Denietly the negative exponential model proposed higher density in the lower values range of the graph.

Plot uniform probability density

stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$myVar), args = list(min=min(z$myVar), max=max(z$myVar)))
 p1 + stat + stat2 + stat3

The uniform probability reflect (Brown) that each value has the same possibility to apear in our data set. That is denietly not matching our data distribution. with moro density in the lower values range of the graph.

Plot gamma probability density

gammaPars <- fitdistr(z$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]

stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$myVar), args = list(shape=shapeML, rate=rateML))
p1 + stat + stat2 + stat3 + stat4

The gamma distribution is a type of normal distribution (Green), skew to the left side of the graph. Is use to model the continuous variable that should have a positive and skewed distribution, as our data. Therefore is possible to conclude that our randomize data has a gamma positive distribution.

Plot beta probability density

pSpecial <- ggplot(data=z, aes(x=myVar/(max(myVar + 0.1)), y=..density..)) +
 geom_histogram(color="grey60",fill="cornsilk",size=0.2) + 
 xlim(c(0,1)) +
 geom_density(size=0.75,linetype="dotted")

betaPars <- fitdistr(x=z$myVar/max(z$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
print(pSpecial + statSpecial)

Both, Gamma and Beta (Purple) distributions adapt perfectly to our model . This is because a beta distribution is based on a gamma distribution with modified shape, therefore the model could adapt to the shape of our data distribution.

Now lets play with real data

The data represent multiple organic matter values, measure in grams,of our samples, across our site of study.

z<-read.csv("Li_Soil_Dataset_DP.csv", sep=',', comment.char = '#')

str(z)

## 'data.frame':    60 obs. of  10 variables:
##  $ ID                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Moisture              : chr  "Ambient" "Ambient" "Ambient" "Ambient" ...
##  $ Plant_species_richness: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ moisture              : num  0.1049 0.1122 0.1098 0.0679 0.105 ...
##  $ organic_matter        : num  9.39 8.5 12.09 10.97 14.57 ...
##  $ Plant_biomass         : num  33.2 36.4 39.2 91.3 110 ...
##  $ Bacterial_Chao        : num  4299 4428 4622 4346 4628 ...
##  $ Bacterial_OUT_richness: int  2890 2953 3090 3058 2954 2944 2952 3046 3188 3184 ...
##  $ Fungal_Chao.1         : num  1105 1190 933 1089 822 ...
##  $ Fungal_OUT_richness   : int  758 680 598 569 542 752 581 589 601 816 ...

z<-na.omit(z)
z$myVar<-z$organic_matter
summary(z)

##        ID          Moisture         Plant_species_richness    moisture      
##  Min.   : 1.00   Length:60          Min.   :1.0            Min.   :0.02117  
##  1st Qu.:15.75   Class :character   1st Qu.:1.0            1st Qu.:0.03125  
##  Median :30.50   Mode  :character   Median :2.0            Median :0.05804  
##  Mean   :30.50                      Mean   :3.2            Mean   :0.06808  
##  3rd Qu.:45.25                      3rd Qu.:4.0            3rd Qu.:0.10006  
##  Max.   :60.00                      Max.   :8.0            Max.   :0.13715  
##  organic_matter    Plant_biomass    Bacterial_Chao Bacterial_OUT_richness
##  Min.   : 0.6933   Min.   :  7.30   Min.   :3706   Min.   :2441          
##  1st Qu.: 5.8625   1st Qu.: 37.36   1st Qu.:4035   1st Qu.:2762          
##  Median :10.3308   Median : 69.89   Median :4320   Median :2950          
##  Mean   : 9.5545   Mean   : 68.36   Mean   :4272   Mean   :2904          
##  3rd Qu.:12.3793   3rd Qu.: 92.45   3rd Qu.:4504   3rd Qu.:3058          
##  Max.   :17.6395   Max.   :189.21   Max.   :4867   Max.   :3198          
##  Fungal_Chao.1    Fungal_OUT_richness     myVar        
##  Min.   : 560.9   Min.   :325.0       Min.   : 0.6933  
##  1st Qu.: 784.7   1st Qu.:485.8       1st Qu.: 5.8625  
##  Median : 891.2   Median :548.0       Median :10.3308  
##  Mean   : 908.6   Mean   :562.7       Mean   : 9.5545  
##  3rd Qu.:1024.9   3rd Qu.:629.0       3rd Qu.:12.3793  
##  Max.   :1250.5   Max.   :816.0       Max.   :17.6395

p1<-ggplot(z, aes(x=myVar, y=after_stat(density))) +
  geom_histogram(color="green", fill="cornsilk", size=0.2)
p1

We plot for a histogram to evaluate the distribution of the organic matter found in the soil. Were we will plot for different possible distribution and evaluate the one that fits the best. At the moment the higher density of the data is found close to the mean. Therefore, is possible to predict that our data will fit a normal distribution.

But, lets plot for the probability distribution to answer this question.

##      mean         sd    
##   9.5545250   4.1431737 
##  (0.5348814) (0.3782183)

## List of 5
##  $ estimate: Named num [1:2] 9.55 4.14
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.535 0.378
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 0.286 0 0 0.143
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 60
##  $ loglik  : num -170
##  - attr(*, "class")= chr "fitdistr"

##     mean 
## 9.554525

Starting for which probabiliy density does not fit our data distribution.

Exponential (Blue): Definetly our data does not have higer density for values of 0 a 1, there fore could not be exponential.

Uniform (green): Uniform represent a model in which every value has the same chance to appear. This is not the distribution or our data, there is definitely higher density in some regions of the graph.

Gamma (Brown): Gamma is a type of normal distribution, so it looks similar to our data. But our data is not skew to the left at it should behave a gamma distribution.

Lastly, the normal distribution (Red), is the one that mostly fits the model. With the higer density of values close to the mean. In comparation, the beta distribution (Purple), since is a type of normal distribution also adapts to our model, showing how the data skew a little to the higher values of the graph.

In conclusion, the model that best fits our data is a normal distribution.

Homework6. Probability Distributions

Daniel Penados-Richter

2024-02-21