Skip to contents

Compare some empirical data set against different distributions to help find the distribution that could be the best fit.

Usage

tidy_distribution_comparison(
  .x,
  .distribution_type = "continuous",
  .round_to_place = 3
)

Arguments

.x

The data set being passed to the function

.distribution_type

What kind of data is it, can be one of continuous or discrete

.round_to_place

How many decimal places should the parameter estimates be rounded off to for distibution construction. The default is 3

Value

An invisible list object. A tibble is printed.

Details

The purpose of this function is to take some data set provided and to try to find a distribution that may fit the best. A parameter of .distribution_type must be set to either continuous or discrete in order for this the function to try the appropriate types of distributions.

The following distributions are used:

Continuous:

  • tidy_beta

  • tidy_cauchy

  • tidy_exponential

  • tidy_gamma

  • tidy_logistic

  • tidy_lognormal

  • tidy_normal

  • tidy_pareto

  • tidy_uniform

  • tidy_weibull

Discrete:

  • tidy_binomial

  • tidy_geometric

  • tidy_hypergeometric

  • tidy_poisson

The function itself returns a list output of tibbles. Here are the tibbles that are returned:

  • comparison_tbl

  • deviance_tbl

  • total_deviance_tbl

  • aic_tbl

  • kolmogorov_smirnov_tbl

  • multi_metric_tbl

The comparison_tbl is a long tibble that lists the values of the density function against the given data.

The deviance_tbl and the total_deviance_tbl just give the simple difference from the actual density to the estimated density for the given estimated distribution.

The aic_tbl will provide the AIC for a lm model of the estimated density against the emprical density.

The kolmogorov_smirnov_tbl for now provides a two.sided estimate of the ks.test of the estimated density against the empirical.

The multi_metric_tbl will summarise all of these metrics into a single tibble.

Author

Steven P. Sanderson II, MPH

Examples

xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.

xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")

output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#>    sim_number     x     y    dx       dy     p     q dist_type
#>    <fct>      <int> <dbl> <dbl>    <dbl> <dbl> <dbl> <fct>    
#>  1 1              1  21    2.97 0.000114 0.625  10.4 Empirical
#>  2 1              2  21    4.21 0.000455 0.625  10.4 Empirical
#>  3 1              3  22.8  5.44 0.00142  0.781  13.3 Empirical
#>  4 1              4  21.4  6.68 0.00355  0.688  14.3 Empirical
#>  5 1              5  18.7  7.92 0.00721  0.469  14.7 Empirical
#>  6 1              6  18.1  9.16 0.0124   0.438  15   Empirical
#>  7 1              7  14.3 10.4  0.0192   0.125  15.2 Empirical
#>  8 1              8  24.4 11.6  0.0281   0.812  15.2 Empirical
#>  9 1              9  22.8 12.9  0.0395   0.781  15.5 Empirical
#> 10 1             10  19.2 14.1  0.0516   0.531  15.8 Empirical
#> # ℹ 342 more rows
#> 
#> $deviance_tbl
#> # A tibble: 352 × 2
#>    name                        value
#>    <chr>                       <dbl>
#>  1 Empirical                  0.451 
#>  2 Beta c(1.107, 1.577, 0)    0.287 
#>  3 Cauchy c(19.2, 7.375)     -0.0169
#>  4 Exponential c(0.05)        0.133 
#>  5 Gamma c(11.47, 1.752)     -0.254 
#>  6 Logistic c(20.091, 3.27)  -0.0146
#>  7 Lognormal c(2.958, 0.293)  0.315 
#>  8 Pareto c(10.4, 1.624)      0.412 
#>  9 Uniform c(8.341, 31.841)   0.143 
#> 10 Weibull c(3.579, 22.288)  -0.201 
#> # ℹ 342 more rows
#> 
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#>    dist_with_params          abs_tot_deviance
#>    <chr>                                <dbl>
#>  1 Beta c(1.107, 1.577, 0)              0.640
#>  2 Lognormal c(2.958, 0.293)            0.734
#>  3 Gaussian c(20.091, 5.932)            1.39 
#>  4 Cauchy c(19.2, 7.375)                1.56 
#>  5 Logistic c(20.091, 3.27)             2.79 
#>  6 Uniform c(8.341, 31.841)             2.99 
#>  7 Weibull c(3.579, 22.288)             3.34 
#>  8 Pareto c(10.4, 1.624)                3.77 
#>  9 Gamma c(11.47, 1.752)                4.06 
#> 10 Exponential c(0.05)                  7.08 
#> 
#> $aic_tbl
#> # A tibble: 10 × 3
#>    dist_type                 aic_value abs_aic
#>    <fct>                         <dbl>   <dbl>
#>  1 Beta c(1.107, 1.577, 0)       -22.4    22.4
#>  2 Pareto c(10.4, 1.624)          85.6    85.6
#>  3 Gamma c(11.47, 1.752)        -155.    155. 
#>  4 Logistic c(20.091, 3.27)     -161.    161. 
#>  5 Gaussian c(20.091, 5.932)    -172.    172. 
#>  6 Weibull c(3.579, 22.288)     -181.    181. 
#>  7 Cauchy c(19.2, 7.375)        -189.    189. 
#>  8 Exponential c(0.05)          -204.    204. 
#>  9 Lognormal c(2.958, 0.293)    -207.    207. 
#> 10 Uniform c(8.341, 31.841)     -208.    208. 
#> 
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#>    dist_type              ks_statistic ks_pvalue ks_method alternative dist_char
#>    <fct>                         <dbl>     <dbl> <chr>     <chr>       <chr>    
#>  1 Beta c(1.107, 1.577, …        0.75   0.000500 Monte-Ca… two-sided   Beta c(1…
#>  2 Cauchy c(19.2, 7.375)         0.469  0.00100  Monte-Ca… two-sided   Cauchy c…
#>  3 Exponential c(0.05)           0.5    0.00100  Monte-Ca… two-sided   Exponent…
#>  4 Gamma c(11.47, 1.752)         0.156  0.839    Monte-Ca… two-sided   Gamma c(…
#>  5 Logistic c(20.091, 3.…        0.125  0.970    Monte-Ca… two-sided   Logistic…
#>  6 Lognormal c(2.958, 0.…        0.219  0.444    Monte-Ca… two-sided   Lognorma…
#>  7 Pareto c(10.4, 1.624)         0.844  0.000500 Monte-Ca… two-sided   Pareto c…
#>  8 Uniform c(8.341, 31.8…        0.25   0.269    Monte-Ca… two-sided   Uniform …
#>  9 Weibull c(3.579, 22.2…        0.188  0.646    Monte-Ca… two-sided   Weibull …
#> 10 Gaussian c(20.091, 5.…        0.188  0.658    Monte-Ca… two-sided   Gaussian…
#> 
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#>    dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#>    <fct>                <dbl>     <dbl>   <dbl>        <dbl>     <dbl> <chr>    
#>  1 Beta c(1…            0.640     -22.4    22.4        0.75   0.000500 Monte-Ca…
#>  2 Lognorma…            0.734    -207.    207.         0.219  0.444    Monte-Ca…
#>  3 Gaussian…            1.39     -172.    172.         0.188  0.658    Monte-Ca…
#>  4 Cauchy c…            1.56     -189.    189.         0.469  0.00100  Monte-Ca…
#>  5 Logistic…            2.79     -161.    161.         0.125  0.970    Monte-Ca…
#>  6 Uniform …            2.99     -208.    208.         0.25   0.269    Monte-Ca…
#>  7 Weibull …            3.34     -181.    181.         0.188  0.646    Monte-Ca…
#>  8 Pareto c…            3.77       85.6    85.6        0.844  0.000500 Monte-Ca…
#>  9 Gamma c(…            4.06     -155.    155.         0.156  0.839    Monte-Ca…
#> 10 Exponent…            7.08     -204.    204.         0.5    0.00100  Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#> 
#> attr(,".x")
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32