--- title: "Post-run diagnostics and analyses" date: "2024-05-14" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{Post-run diagnostics and analyses} output: rmarkdown::html_vignette --- In this tutorial, we assume that you successfully ran an MCMC on a network model, and that it is now time to have a critical look at the output of the run. This tutorial will present how to check that the MCMC run went fine, but it will explain only very briefly how to check that the model is compatible with the observed data. More details about this important step is presented in the next vignette about [posterior predictive checks](tutorial-100-posterior-predictive-checks.html). This vignette is using the MCMC run from the [Quick start tutorial](tutorial-010-quick-start.html). Please go back and run the code of that vignette to generate the MCMC data if you haven't already! ```r # Load the tidyverse package for easier data manipulation library(tidyverse) ``` ### Reminder: what is the result format returned by `run_mcmc()`?

Note: This short section is a quick reminder of what you learnt in the previous tutorial [MCMC output format](tutorial-080-mcmc-output-format.html)).

In the **Quick Start** tutorial, we generated the `run` object by running `run <- run_mcmc(m, iter = 1000)`. The output from `run_mcmc()` is a well-behaved `mcmc.list` object as implemented in the `coda` package: ```r is(run, "mcmc.list") ``` ``` ## [1] TRUE ``` It makes it very easy to use the tools already available for this class of objects, such as those implemented in the packages `coda`, `bayesplot`, `ggmcmc` or `MCMCvis`. In addition to all those existing tools, the `isotracer` package also adds some extra methods to easily calculate derived parameters from an `mcmc.list`. You will learn more about how to do this in the vignette [Derived parameters](tutorial-110-derived-parameters.html).

Tip: `isotracer` uses Stan behind the scenes to run the MCMC. If you prefer to get the raw `stanfit` object instead of the processed `mcmc.list`, you can set `stanfit = TRUE` when you run the model: ```r run <- run_mcmc(model = m, stanfit = TRUE) ``` This might be especially useful if the MCMC sampling is difficult for your model and you need the `stanfit` object to perform some in-depth diagnostics.

## General diagnostics ### Trace plot You should always run **several chains** when performing a Bayesian MCMC. Trace plots allow to get a feeling for: - Chain convergence: did all chains reach the same region of the parameter space? - Mixing: is the mixing of good quality or should we modify the sampler settings? If you are not satisfied with the traces, you need to run a longer run or maybe to tweak the settings of the Stan run. This is the trace plot we obtained from the previous vignette: ```r plot(run) # Note: the figure below only shows a few of the traceplots for vignette concision ```

In this case, the chains have converged towards the same region, and the mixing looks good for all of them. There is no obvious problem visible in those traces. ### Gelman and Rubin's convergence diagnostic It can be useful to have a more formal test of the convergence of the chains. The Gelman and Rubin's convergence diagnostic is implemented in the `coda` package. We can run it with the `coda::gelman.diag()` function. See the coda documentation `?gelman.diag` for more details about this diagnostic. Let's have a look at the diagnostic for our chains: ```r library(coda) run %>% gelman.diag() ``` ``` ## Potential scale reduction factors: ## ## Point est. Upper C.I. ## eta 1.00 1.00 ## lambda_algae 1.01 1.02 ## lambda_daphnia 1.01 1.02 ## lambda_NH4 1.01 1.02 ## upsilon_algae_to_daphnia 1.02 1.03 ## upsilon_daphnia_to_NH4 1.00 1.00 ## upsilon_NH4_to_algae 1.01 1.02 ## zeta 1.00 1.01 ## ## Multivariate psrf ## ## 1.02 ``` The diagnostic values should be very, very close to 1: it looks good in this case! If some values were e.g. $>1.05$, this would already be enough to cause us to wonder about the sampling quality. ### Predicted trajectories In order to check the quality of the model fit, the consistency between the parameter posteriors and the observed data can be checked by plotting the credible envelopes for the estimated trajectories along with the observed data points. This is called a posterior predictive check and is very important to check that the model can actually predict the observed data reasonably well. If the observed data cannot be satisfactorily predicted from the fitted model, then our model is not a good model of the data! To do a posterior predictive check, the first step is to generate predictions for the model based on the MCMC posteriors with `predict()`: ```r # From the Quick Start tutorial: # 'm' is the network model we used when calling 'run <- run_mcmc(m, iter = 1000)' predictions <- predict(m, run, probs = 0.95) ``` We can then visualize the predictions along with the observations with the `plot()` method: ```r plot(predictions) ```

This plot enables to compare both the **size** and the **proportion** observations with the predictions. ## Post-run analyses ### Parameter estimates The quickest way to get parameter estimates is to use the `summary()` function on the posterior: ```r run %>% summary() ``` ``` ## ## Iterations = 501:1000 ## Thinning interval = 1 ## Number of chains = 4 ## Sample size per chain = 500 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## eta 0.12751 0.049964 0.0011172 0.0019548 ## lambda_algae 0.10723 0.067599 0.0015116 0.0021817 ## lambda_daphnia 0.03647 0.042346 0.0009469 0.0015642 ## lambda_NH4 0.09269 0.068981 0.0015425 0.0027648 ## upsilon_algae_to_daphnia 0.07785 0.023643 0.0005287 0.0009433 ## upsilon_daphnia_to_NH4 0.04894 0.007169 0.0001603 0.0002337 ## upsilon_NH4_to_algae 0.34347 0.045724 0.0010224 0.0017310 ## zeta 0.43612 0.237678 0.0053146 0.0099599 ## ## 2. Quantiles for each variable: ## ## 2.5% 25% 50% 75% 97.5% ## eta 0.0654264 0.094176 0.11691 0.14817 0.25969 ## lambda_algae 0.0085454 0.057541 0.09705 0.14604 0.26499 ## lambda_daphnia 0.0007375 0.009573 0.02209 0.04620 0.15846 ## lambda_NH4 0.0034180 0.041621 0.07744 0.13014 0.26348 ## upsilon_algae_to_daphnia 0.0451278 0.062348 0.07370 0.08860 0.13730 ## upsilon_daphnia_to_NH4 0.0361709 0.044109 0.04849 0.05325 0.06439 ## upsilon_NH4_to_algae 0.2609275 0.314829 0.34115 0.37062 0.43671 ## zeta 0.1844365 0.286386 0.36440 0.51207 1.09344 ``` If you need to store those values, for example to plot them, you can assign the output of `summary()` to an object: ```r estimates <- run %>% summary() names(estimates) ``` ``` ## [1] "statistics" "quantiles" "start" "end" "thin" "nchain" ``` The means and standard deviations are accessible in `$statistics`: ```r estimates$statistics ``` ``` ## Mean SD Naive SE Time-series SE ## eta 0.12751257 0.049963905 0.0011172269 0.0019547723 ## lambda_algae 0.10723050 0.067599025 0.0015115601 0.0021817165 ## lambda_daphnia 0.03647363 0.042345690 0.0009468784 0.0015642272 ## lambda_NH4 0.09269130 0.068980569 0.0015424524 0.0027648434 ## upsilon_algae_to_daphnia 0.07784906 0.023642981 0.0005286731 0.0009432531 ## upsilon_daphnia_to_NH4 0.04893872 0.007169036 0.0001603045 0.0002336529 ## upsilon_NH4_to_algae 0.34347224 0.045723698 0.0010224130 0.0017310105 ## zeta 0.43611523 0.237677637 0.0053146335 0.0099599011 ``` and the quantiles are in `$quantiles`: ```r estimates$quantiles ``` ``` ## 2.5% 25% 50% 75% 97.5% ## eta 0.0654264365 0.094176064 0.11691344 0.14816691 0.25969422 ## lambda_algae 0.0085454389 0.057540842 0.09704578 0.14603524 0.26498951 ## lambda_daphnia 0.0007375003 0.009573074 0.02208908 0.04620389 0.15846417 ## lambda_NH4 0.0034180067 0.041620929 0.07743919 0.13013928 0.26347716 ## upsilon_algae_to_daphnia 0.0451277873 0.062348037 0.07370339 0.08859513 0.13729757 ## upsilon_daphnia_to_NH4 0.0361709218 0.044109217 0.04849449 0.05324976 0.06439118 ## upsilon_NH4_to_algae 0.2609274862 0.314829360 0.34115256 0.37062017 0.43670569 ## zeta 0.1844365414 0.286385629 0.36440171 0.51207060 1.09343843 ``` ### Parameter correlations The dependencies between your model parameters might be of interest to you. If you would like to analyse the correlations between parameters during the MCMC run, you can use a few ready-made functions to get a quick overview of the correlation structure. The `isotracer` package comes with the minimalist function `mcmc_heatmap()` to draw the strength of parameter correlations: ```r mcmc_heatmap(run) ```

But of course you could use other functions provided by other packages, such as: - `ggmcmc` package + `ggs_crosscorrelation()` + `ggs_pairs()` - `bayesplot` package + `mcmc_pairs()` ### Extracting parameters, trajectories and flows If you are interested in getting detailed tables containing all the samples of the parameter posteriors, you can use the `tidy_mcmc()` function: ```r tidy_mcmc(run) ``` ``` ## # A tibble: 2,000 × 3 ## mcmc.chain mcmc.iteration mcmc.parameters ## ## 1 1 1 ## 2 1 2 ## 3 1 3 ## 4 1 4 ## 5 1 5 ## 6 1 6 ## 7 1 7 ## 8 1 8 ## 9 1 9 ## 10 1 10 ## # ℹ 1,990 more rows ``` By default the parameter values are nested into a list column, but you can also get a flat table with `spread = TRUE`: ```r tidy_mcmc(run, spread = TRUE) ``` ``` ## # A tibble: 2,000 × 10 ## mcmc.chain mcmc.iteration eta lambda_algae lambda_daphnia lambda_NH4 ## ## 1 1 1 0.113 0.0383 0.00535 0.0704 ## 2 1 2 0.242 0.0941 0.0627 0.0302 ## 3 1 3 0.130 0.140 0.0291 0.164 ## 4 1 4 0.0921 0.203 0.0176 0.278 ## 5 1 5 0.0901 0.299 0.0170 0.252 ## 6 1 6 0.103 0.142 0.0596 0.156 ## 7 1 7 0.113 0.160 0.0202 0.0660 ## 8 1 8 0.0978 0.0302 0.0200 0.0110 ## 9 1 9 0.142 0.0859 0.0325 0.0658 ## 10 1 10 0.117 0.0398 0.0456 0.0897 ## # ℹ 1,990 more rows ## # ℹ 4 more variables: upsilon_algae_to_daphnia , upsilon_daphnia_to_NH4 , ## # upsilon_NH4_to_algae , zeta ``` The above table only contains the primary parameters. If you are interested in getting the predicted trajectories for individual MCMC samples, you can use the `tidy_trajectories()` function: ```r # We have to also provide the original network model `m` tt <- tidy_trajectories(m, run, n = 200) tt ``` ``` ## # A tibble: 200 × 4 ## mcmc.chain mcmc.iteration mcmc.parameters trajectories ## ## 1 4 28 ## 2 2 87 ## 3 2 319 ## 4 4 295 ## 5 1 71 ## 6 2 184 ## 7 1 371 ## 8 2 257 ## 9 2 198 ## 10 1 307 ## # ℹ 190 more rows ``` As you can see, the `tt` object is a tidy table which contains the parameter values and the corresponding trajectories calculated for 200 randomly selected MCMC samples. The calculated trajectories are stored in the `trajectories` column and provide the quantities of unmarked and marked tracer (e.g. light and heavy isotope) for each compartment at each time step: ```r tt$trajectories[[1]] ``` ``` ## # A tibble: 1 × 5 ## timepoints unmarked marked sizes proportions ## ## 1 ``` Because each trajectory is itself a table containing a time series for each compartment, the output of `tidy_trajectories()` has several levels of nesting. This makes it a bit cumbersome to manipulate. Note that the output format of this function might change in the future. Here is an example of what can be done using the predicted trajectories: ```r algae <- tt %>% mutate(prop_algae = map(trajectories, function(tr) { tr[["proportions"]][[1]][, "algae"] })) %>% pull(prop_algae) %>% do.call(rbind, .) time <- tt$trajectories[[1]]$timepoints[[1]] plot(0, type = "n", xlim = range(time), ylim = range(algae), las = 1, xlab = "Time", ylab = "Proportion of marked tracer (algae)", main = "Posterior sample of trajectories (for 15N prop. in algae)") invisible(sapply(seq_len(nrow(algae)), function(i) { lines(time, algae[i,], col = adjustcolor("seagreen3", alpha.f = 0.2)) })) ```

Finally, if what you are interested in are not the trajectories per se but the actual flows of nutrient during the experiment, you can use the `tidy_flows()` function to extract flows in a similar way: ```r #' Again, note that we also provide the original network model `m` tf <- tidy_flows(m, run, n = 200) tf ``` ``` ## # A tibble: 200 × 4 ## mcmc.chain mcmc.iteration mcmc.parameters flows ## * ## 1 1 372 ## 2 2 215 ## 3 4 226 ## 4 1 172 ## 5 1 222 ## 6 2 221 ## 7 3 310 ## 8 2 205 ## 9 2 34 ## 10 2 110 ## # ℹ 190 more rows ``` The returned object is very similar to the output of `tidy_trajectories()`, except that the `trajectories` column is replaced by a `flows` column: ```r tf$flows[[1]] ``` ``` ## # A tibble: 6 × 3 ## # Groups: from [3] ## from to average_flow ## ## 1 NH4 algae 0.0747 ## 2 NH4 0.0172 ## 3 algae daphnia 0.0765 ## 4 algae 0.0322 ## 5 daphnia NH4 0.0884 ## 6 daphnia 0.00267 ``` The average flow values are given in flow per unit of time. See `?tidy_flows()` for more details, including the possibility of calculating steady state flows for network systems that admits a steady state equilibrium. The `tidy_trajectories()` and `tidy_flows()` functions are especially useful when you want to do some calculations related to some specific properties of the trajectories or of the nutrient flows over the whole MCMC posterior.