Title: | Miscellaneous Pedometric Tools |
---|---|
Description: | An R implementation of methods employed in the field of pedometrics, soil science discipline dedicated to studying the spatial, temporal, and spatio-temporal variation of soil using statistical and computational methods. The methods found here include the calibration of linear regression models using covariate selection strategies, computation of summary validation statistics for predictions, generation of summary plots, evaluation of the local quality of a geostatistical model of uncertainty, and so on. Other functions simply extend the functionalities of or facilitate the usage of functions from other packages that are commonly used for the analysis of soil data. Formerly available versions of suggested packages no longer available from CRAN can be obtained from the CRAN archive <https://cran.r-project.org/src/contrib/Archive/>. |
Authors: | Alessandro Samuel-Rosa [aut, cre] , Lucia Helena Cunha dos Anjos [ths] , Gustavo Vasques [ths] , Gerard B M Heuvelink [ths] , Juan Carlos Ruiz Cuetos [ctb], Maria Eugenia Polo Garcia [ctb], Pablo Garcia Rodriguez [ctb], Joshua French [ctb], Ken Kleinman [ctb], Dick Brus [ctb] , Frank Harrell Jr [ctb], Ruo Xu [ctb] |
Maintainer: | Alessandro Samuel-Rosa <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.12.1 |
Built: | 2024-11-12 05:11:59 UTC |
Source: | https://github.com/laboratorio-de-pedometria/pedometrics-package |
This package contains many tools and techniques used in the field of pedometrics (see https://en.wikipedia.org/wiki/Pedometric_mapping for a definition of pedometrics). These tools and techniques were developed to fulfill the demands created by the PhD research project (2012-2016) entitled “Contribution to the Construction of Models for Predicting Soil Properties”, developed by Alessandro Samuel-Rosa under the supervision of Dr Lúcia HC Anjos (Universidade Federal Rural do Rio de Janeiro, Brazil), Dr Gustavo M Vasques (Embrapa Solos, Brazil), and Dr Gerard B M Heuvelink (ISRIC - World Soil Information, the Netherlands). The project is/was funded by the CNPq Foundation (Process 140720/2012-0), Ministry of Science and Technology of Brazil, Brasília, DF, 70040-020, Brazil, phone +55 (61) 2022 6002, and the CAPES Foundation (Process ID BEX 11677/13-9), Ministry of Education of Brazil, Brasília, DF, 70040-020, Brazil, phone: +55 (61) 2022 6210.
Several functions simply extend the functionalities of other functions commonly used for the analysis of pedometric data. It should be noted that changes are likely to occur quite often and the use of this package as a dependency for other packages is strongly discouraged.
Author and Maintainer: Alessandro Samuel-Rosa [email protected].
Calculates the adjusted coefficient of determination of a multiple linear regression model.
adjR2(r2, n, p)
adjR2(r2, n, p)
r2 |
Numeric vector with the coefficient of determination to be adjusted. |
n |
Numeric vector providing the number of observations used to fit the multiple linear regression model. |
p |
Numeric vector providing the number of parameters included in the multiple linear regression model. |
A numeric vector with the adjusted coefficient of determination.
Alessandro Samuel-Rosa [email protected]
Coefficient of determination. Wikipedia, The Free Encyclopedia. Available at https://en.wikipedia.org/wiki/Coefficient_of_determination.
x <- adjR2(r2 = 0.95, n = 100, p = 80)
x <- adjR2(r2 = 0.95, n = 100, p = 80)
Take the bounding box of a Spatial* object and create a SpatialPoints* or SpatialPolygons* object from it.
bbox2sp(obj, sp = "SpatialPolygons", keep.crs = TRUE)
bbox2sp(obj, sp = "SpatialPolygons", keep.crs = TRUE)
obj |
Object of class Spatial*. |
sp |
Class of the resulting object with options |
keep.crs |
Logical for assigning the same coordinate reference system to the resulting
Spatial* object. Defaults to |
An object of class SpatialPoints* or SpatialPolygons*.
The sp package, provider of classes and methods for spatial data in R, is required for
bbox2sp()
to work. The development version of the sp package is available on
https://github.com/edzer/sp/ while its old versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/sp/.
Some of the solutions used to build this function were found in the source code of the R-package intamapInteractive. As such, the authors of that package, Edzer Pebesma [email protected] and Jon Skoien [email protected], are entitled ‘contributors’ to the R-package pedometrics.
Alessandro Samuel-Rosa [email protected]
Edzer Pebesma, Jon Skoien with contributions from Olivier Baume, A. Chorti, D.T. Hristopulos, S.J. Melles and G. Spiliopoulos (2013). intamapInteractive: procedures for automated interpolation - methods only to be used interactively, not included in intamap package. R package version 1.1-10. https://CRAN.R-project.org/package=intamapInteractive.
if (require(sp)) { data(meuse, package = "sp") sp::coordinates(meuse) <- ~ x + y bb <- bbox2sp(obj = meuse, keep.crs = FALSE) }
if (require(sp)) { data(meuse, package = "sp") sp::coordinates(meuse) <- ~ x + y bb <- bbox2sp(obj = meuse, keep.crs = FALSE) }
Build a series of linear models with stats::lm()
using one or more automated variable
selection methods implemented in the functions stepVIF()
and MASS::stepAIC()
.
buildModelSeries( formula, data, vif = FALSE, vif.threshold = 10, vif.verbose = FALSE, aic = FALSE, aic.direction = "both", aic.trace = FALSE, aic.steps = 5000, ... ) buildMS( formula, data, vif = FALSE, vif.threshold = 10, vif.verbose = FALSE, aic = FALSE, aic.direction = "both", aic.trace = FALSE, aic.steps = 5000, ... )
buildModelSeries( formula, data, vif = FALSE, vif.threshold = 10, vif.verbose = FALSE, aic = FALSE, aic.direction = "both", aic.trace = FALSE, aic.steps = 5000, ... ) buildMS( formula, data, vif = FALSE, vif.threshold = 10, vif.verbose = FALSE, aic = FALSE, aic.direction = "both", aic.trace = FALSE, aic.steps = 5000, ... )
formula |
A list containing one or several model formulas (a symbolic description of the model to be fitted). |
data |
Data frame containing the variables in the model formulas. |
vif |
Logical for performing backward variable selection using the Variance-Inflation Factor
(VIF). Defaults to |
vif.threshold |
Numeric value setting the maximum acceptable VIF value. Defaults to
|
vif.verbose |
Logical for printing iteration results of backward variable selection using
the VIF. Defaults to |
aic |
Logical for performing variable selection using Akaike's Information Criterion (AIC).
Defaults to |
aic.direction |
Character string setting the direction of variable selection when using AIC,
with options |
aic.trace |
Logical for printing iteration results of variable selection using the AIC.
Defaults to |
aic.steps |
Integer value setting the maximum number of steps to be considered for variable
selection using the AIC. Defaults to |
... |
Further arguments passed to |
buildModelSeries()
was devised to deal with a list of linear model formulas. The
main objective is to bring together several functions commonly used when building linear models,
such as automated variable selection. In the current implementation, variable selection can be
done using stepVIF()
or MASS::stepAIC()
or both.
stepVIF()
is a backward variable selection procedure, while MASS::stepAIC()
supports backward, forward, and bidirectional variable selection. For more information about
these functions, please visit their respective help pages.
An important feature of buildModelSeries()
is that it records the initial number
of candidate predictor variables and observations offered to the model, and adds this information
as an attribute to the final selected model. Such feature was included because variable selection
procedures result biased linear models (too optimistic), and the effective number of degrees of
freedom is close to the number of candidate predictor variables initially offered to the model
(Harrell, 2001). With the initial number of candidate predictor variables and observations
offered to the model, one can calculate penalized or adjusted measures of model performance. For
models built using buildModelSeries()
, this can be done using
statsModelSeries()
.
Some important details should be clear when using buildModelSeries()
:
this function was originally devised to deal with a list of formulas, but can also be used with a single formula;
in the current implementation, stepVIF()
runs before MASS::stepAIC()
;
function arguments imported from MASS::stepAIC()
and stepVIF()
were named as
in the original functions, and received a prefix (aic
or vif
) to help the user identifying
which function is affected by a given argument without having to go check the documentation.
A list containing the fitted linear models.
Add option to set the order in which MASS::stepAIC()
and stepVIF()
are run.
The MASS package, provider of support functions and datasets for Venables and Ripley's Modern
Applied Statistics with S, is required for buildModelSeries()
to work. The
development version of the MASS package is available on
https://www.stats.ox.ac.uk/pub/MASS4/ while its old versions are available on the CRAN archive
at https://cran.r-project.org/src/contrib/Archive/MASS/.
Alessandro Samuel-Rosa [email protected]
Harrell, F. E. (2001) Regression modelling strategies: with applications to linear models, logistic regression, and survival analysis. First edition. New York: Springer.
Venables, W. N. and Ripley, B. D. (2002) Modern applied statistics with S. Fourth edition. New York: Springer.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243–244, pp. 214–227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
if (interactive()) { # based on the second example of MASS::stepAIC() library("MASS") cpus1 <- cpus for(v in names(cpus)[2:7]) cpus1[[v]] <- cut(cpus[[v]], unique(stats::quantile(cpus[[v]])), include.lowest = TRUE) cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp <- sample(1:209, 100) cpus.form <- list(formula(log10(perf) ~ syct + mmin + mmax + cach + chmin + chmax + perf), formula(log10(perf) ~ syct + mmin + cach + chmin + chmax), formula(log10(perf) ~ mmax + cach + chmin + chmax + perf)) data <- cpus1[cpus.samp,2:8] cpus.ms <- buildModelSeries(cpus.form, data, vif = TRUE, aic = TRUE) }
if (interactive()) { # based on the second example of MASS::stepAIC() library("MASS") cpus1 <- cpus for(v in names(cpus)[2:7]) cpus1[[v]] <- cut(cpus[[v]], unique(stats::quantile(cpus[[v]])), include.lowest = TRUE) cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp <- sample(1:209, 100) cpus.form <- list(formula(log10(perf) ~ syct + mmin + mmax + cach + chmin + chmax + perf), formula(log10(perf) ~ syct + mmin + cach + chmin + chmax), formula(log10(perf) ~ mmax + cach + chmin + chmax + perf)) data <- cpus1[cpus.samp,2:8] cpus.ms <- buildModelSeries(cpus.form, data, vif = TRUE, aic = TRUE) }
Evaluate the local quality of a geostatistical model of uncertainty (GMU) using summary measures and graphical displays.
checkGMU( observed, simulated, pi = seq(0.01, 0.99, 0.01), symmetric = TRUE, plotit = TRUE )
checkGMU( observed, simulated, pi = seq(0.01, 0.99, 0.01), symmetric = TRUE, plotit = TRUE )
observed |
Vector of observed values at the validation points. See ‘Details’ for more information. |
simulated |
Data frame or matrix with simulated values (columns) for each validation point (rows). See ‘Details’ for more information. |
pi |
Vector defining the width of the series of probability intervals. Defaults to
|
symmetric |
Logical for choosing the type of probability interval. Defaults to
|
plotit |
Logical for plotting the results. Defaults to |
There is no standard way of evaluating the local quality of a GMU. The collection of summary measures and graphical displays presented here is far from being comprehensive. A few definitions are given bellow.
Error statistics measure how well the GMU predicts the measured values at the validation points. Four error statistics are presented:
Measures the bias of the predictions of the GMU, being defined as the mean of the differences between the average of the simulated values and the observed values, i.e. the average of all simulations is taken as the predicted value.
Measures the accuracy of the predictions of the GMU, being defined as the mean of the squared differences between the average of the simulated values and the observed values.
Measures how well the GMU estimate of the prediction error variance (PEV) approximates the observed prediction error variance, where the first is given by the variance of the simulated values, while the second is given by the squared differences between the average of the simulated values, i.e. the squared error (SE). The SRMSE is computed as the average of SE / PEV, where SRMSE > 1 indicates underestimation, while SRMSE < 1 indicates overestimation.
Measures how close the GMU predictions are to the observed values. A scatter plot of the observed values versus the average of the simulated values can be used to check for possible unwanted outliers and non-linearities. The square of the Pearson correlation coefficient measures the fraction of the overall spread of observed values that is explained by the GMU, that is, the amount of variance explained (AVE), also known as coefficient of determination or ratio of scatter.
The coverage probability of an interval is given by the number of times that that interval
contains its parameter over several replications of an experiment. For example, consider the
interquartile range of a Gaussian distributed variable with mean equal to
zero and variance equal to one. The nominal coverage probability of the IQR is 0.5, i.e. two
quarters of the data fall within the IQR. Suppose we generate a Gaussian distributed
random variable with the same mean and variance and count the number of values that fall within
the IQR defined above: about 0.5 of its values will fall within the IQR. If we continue
generating Gaussian distributed random variables with the same mean and variance, on average,
0.5 of the values will fall in that interval.
Coverage probabilities are very useful to evaluate the local quality of a GMU: the closer the observed coverage probabilities of a sequence of probability intervals (PI) are to the nominal coverage probabilities of those PIs, the better the modeling of the local uncertainty.
Two types of PIs can be used here: symmetric, median-centered PIs, and left-bounded PIs. Papritz & Dubois (1999) recommend using left-bounded PIs because they are better at evidencing deviations for both large and small PIs. The authors also point that the coverage probabilities of the symmetric, median-centered PIs can be read from the coverage probability plots produced using left-bounded PIs.
In both cases, the PIs are computed at each validation location using the quantiles of the conditional cumulative distribution function (ccdf) defined by the set of realizations at that validation location. For a sequence of PIs of increasing width, we check which of them contains the observed value at all validation locations. We then average the results over all validation locations to compute the proportion of PIs (with the same width) that contains the observed value: this gives the coverage probability of the PIs.
Deutsch (1997) proposed three summary measures of the coverage probabilities to assess the local goodness of a GMU: accuracy ($A$), precision ($P$), and goodness ($G$). According to Deutsch (1997), a GMU can be considered “good” if it is both accurate and precise. Although easy to compute, these measures seem not to have been explored by many geostatisticians, except for the studies developed by Pierre Goovaerts and his later software implementation (Goovaerts, 2009). Richmond (2001) suggests that they should not be used as the only measures of the local quality of a GMU.
An accurate GMU is that for which the proportion of true values falling within the $p$
PI is equal to or larger than the nominal probability $p$, that is, when
. In the
coverage probability plot, a GMU will be more accurate when all points are on or above the 1:1
line. The range of $A$ goes from 0 (lest accurate) to 1 (most accurate).
The precision, $P$, is defined only for an accurate GMU, and measures how close is to
$p$. The range of $P$ goes from 0 (lest precise) to 1 (most precise). Thus, a GMU will be more
accurate when all points in the PI-width plot are on or above the 1:1 line.
The goodness, $G$, is a measure of the departure of the points from the 1:1 line in the
coverage probability plot. $G$ ranges from 0 (minimum goodness) to 1 (maximum goodness), the
maximum $G$ being achieved when , that is, all points in both coverage probability
and interval width plots are exactly on the 1:1 line.
It is worth noting that the coverage probability and PI-width plots are relevant mainly to GMU created using conditional simulations, that is, simulations that are locally conditioned to the data observed at the validation locations. Conditioning the simulations locally serves the purposes of honoring the available data and reducing the variance of the output realizations. This is why one would like to find the points falling above the 1:1 line in both coverage probability and PI-width plots. For unconditional simulations, that is, simulations that are only globally conditioned to the histogram (and variogram) of the data observed at the validation locations, one would expect to find that, over a large number of simulations, the whole set of possible values (i.e. the global histogram) can be generated at any node of the simulation grid. In other words, it is expected to find all points on the 1:1 line in both coverage probability and PI-width plots. Deviations from the 1:1 line could then be used as evidence of problems in the simulation.
A list
of summary measures and plots of the coverage probability and width of probability
intervals.
Comments by Pierre Goovaerts [email protected] were important to describe how to use the coverage probability and PI-width plots when a GMU is created using unconditional simulations.
Alessandro Samuel-Rosa [email protected]
Deutsch, C. Direct assessment of local accuracy and precision. Baafi, E. Y. & Schofield, N. A. (Eds.) Geostatistics Wollongong '96. Dordrecht: Kinwer Academic Publishers, v. I, p. 115-125, 1997.
Papritz, A. & Dubois, J. R. Mapping heavy metals in soil by (non-)linear kriging: an empirical validation. Gómez-Hernández, J.; Soares, A. & Froidevaux, R. (Eds.) geoENV II – Geostatistics for Environmental Applications. Springer, p. 429-440, 1999.
Goovaerts, P. Geostatistical modelling of uncertainty in soil science. Geoderma. v. 103, p. 3 - 26, 2001.
Goovaerts, P. AUTO-IK: a 2D indicator kriging program for the automated non-parametric modeling of local uncertainty in earth sciences. Computers & Geosciences. v. 35, p. 1255-1270, 2009.
Richmond, A. J. Maximum profitability with minimum risk and effort. Xie, H.; Wang, Y. & Jiang, Y. (Eds.) Proceedings 29th APCOM. Lisse: A. A. Balkema, p. 45-50, 2001.
Ripley, B. D. Stochastic simulation. New York: John Wiley & Sons, p. 237, 1987.
if (interactive()) { set.seed(2001) observed <- round(rnorm(100), 3) simulated <- t( sapply(1:length(observed), function (i) round(rnorm(100), 3))) resa <- checkGMU(observed, simulated, symmetric = T) resb <- checkGMU(observed, simulated, symmetric = F) resa$error; resb$error resa$goodness; resb$goodness }
if (interactive()) { set.seed(2001) observed <- round(rnorm(100), 3) simulated <- t( sapply(1:length(observed), function (i) round(rnorm(100), 3))) resa <- checkGMU(observed, simulated, symmetric = T) resb <- checkGMU(observed, simulated, symmetric = F) resa$error; resb$error resa$goodness; resb$goodness }
Create break points, compute strata proportions, and stratify numerical variable(s) to create categorical variable(s).
cont2cat(x, breaks, integer = FALSE) breakPoints(x, n, type = "area", prop = FALSE) stratify(x, n, type = "area", integer = FALSE)
cont2cat(x, breaks, integer = FALSE) breakPoints(x, n, type = "area", prop = FALSE) stratify(x, n, type = "area", integer = FALSE)
x |
Vector, data frame or matrix with data on the numerical variable(s) to be categorized/stratified. |
breaks |
Vector or list containing the lower and upper limits that should be used to break the numerical variable(s) into categories. See ‘Details’ for more information. |
integer |
Logical value indicating if the categorical variable(s) should be returned as
|
n |
Integer value indicating the number of categories/strata that should be created. |
type |
Character value indicating the type of categories/strata that should be used, with
options |
prop |
Logical value indicating if the strata proportions should be returned? Defaults to
|
Argument breaks
must be a vector if x
is a vector, but a list if x
is a data frame or
matrix. Using a list allows breaking each column of x
into different number of categories.
A vector, data frame, or matrix, depending on the class of x
.
The SpatialTools package, provider of tools for spatial data analysis in R, is required for
breakPoints()
and stratify()
to work. The development version of
the SpatialTools package is available on https://github.com/jfrench/SpatialTools while its
old versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/SpatialTools/.
The spsann package, provider of methods for the optimization of sample configurations using
spatial simulated annealing in R, requires breakPoints()
,
cont2cat()
and stratify()
for some of its functions to work. The
development version of the spsann package is available on
https://github.com/Laboratorio-de-Pedometria/spsann-package.
Alessandro Samuel-Rosa [email protected]
B. Minasny and A. B. McBratney. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, vol. 32, no. 9, pp. 1378–1388, Nov. 2006, doi: 10.1016/j.cageo.2005.12.009.
T. Hengl, D. G. Rossiter, and A. Stein. Soil sampling strategies for spatial prediction by correlation with auxiliary maps. Australian Journal of Soil Research, vol. 41, no. 8, pp. 1403–1422, 2003, doi: 10.1071/SR03005.
if (require(SpatialTools)) { ## Compute the break points of marginal strata x <- data.frame(x = round(rnorm(10), 1), y = round(rlnorm(10), 1)) x <- breakPoints(x = x, n = 4, type = "area", prop = TRUE) ## Convert numerical data into categorical data # Matrix x <- y <- c(1:10) x <- cbind(x, y) breaks <- list(c(1, 2, 4, 8, 10), c(1, 5, 10)) y <- cont2cat(x, breaks) # Data frame x <- y <- c(1:10) x <- data.frame(x, y) breaks <- list(c(1, 2, 4, 8, 10), c(1, 5, 10)) y <- cont2cat(x, breaks, integer = TRUE) # Vector x <- c(1:10) breaks <- c(1, 2, 4, 8, 10) y <- cont2cat(x, breaks, integer = TRUE) ## Stratification x <- data.frame(x = round(rlnorm(10), 1), y = round(rnorm(10), 1)) x <- stratify(x = x, n = 4, type = "area", integer = TRUE) x }
if (require(SpatialTools)) { ## Compute the break points of marginal strata x <- data.frame(x = round(rnorm(10), 1), y = round(rlnorm(10), 1)) x <- breakPoints(x = x, n = 4, type = "area", prop = TRUE) ## Convert numerical data into categorical data # Matrix x <- y <- c(1:10) x <- cbind(x, y) breaks <- list(c(1, 2, 4, 8, 10), c(1, 5, 10)) y <- cont2cat(x, breaks) # Data frame x <- y <- c(1:10) x <- data.frame(x, y) breaks <- list(c(1, 2, 4, 8, 10), c(1, 5, 10)) y <- cont2cat(x, breaks, integer = TRUE) # Vector x <- c(1:10) breaks <- c(1, 2, 4, 8, 10) y <- cont2cat(x, breaks, integer = TRUE) ## Stratification x <- data.frame(x = round(rlnorm(10), 1), y = round(rnorm(10), 1)) x <- stratify(x = x, n = 4, type = "area", integer = TRUE) x }
The functions listed here are no longer part of the pedometrics package. If you need to use any of these functions, you can still find them at https://github.com/samuel-rosa/ASRtools or https://cran.r-project.org/src/contrib/Archive/pedometrics/.
coordenadas(...) cdfPlot(...) cdfStats(...) cdfTable(...) gcpDiff(...) trend.terms(...) trend.matrix(...)
coordenadas(...) cdfPlot(...) cdfStats(...) cdfTable(...) gcpDiff(...) trend.terms(...) trend.matrix(...)
... |
Not used. |
No return value, called for side effects.
Alessandro Samuel-Rosa [email protected]
Tony Olsen [email protected]
Tom Kincaid [email protected]
Compute the Cramer's V, a descriptive statistic that measures the association between categorical variables.
cramer(x)
cramer(x)
x |
Data frame or matrix with a set of categorical variables. |
Any integer variable is internally converted to a factor.
A matrix with the Cramer's V between the categorical variables.
The spsann package, provider of methods for the optimization of sample configurations using
spatial simulated annealing in R, requires cramer()
for some of its functions to
work. The development version of the spsann package is available on
https://github.com/Laboratorio-de-Pedometria/spsann-package.
The original code is available at https://sas-and-r.blogspot.com/, Example 8.39: calculating Cramer's V, posted by Ken Kleinman on Friday, June 3, 2011. As such, Ken Kleinman [email protected] is entitled a ‘contributor’ to the R-package pedometrics.
Alessandro Samuel-Rosa [email protected]
Cramér, H. Mathematical methods of statistics. Princeton: Princeton University Press, p. 575, 1946.
Everitt, B. S. The Cambridge dictionary of statistics. Cambridge: Cambridge University Press, p. 432, 2006.
if (interactive()) { data(meuse, package = "sp") str(meuse) test <- cramer(meuse[, c("ffreq", "soil", "lime", "landuse")]) }
if (interactive()) { data(meuse, package = "sp") str(meuse) test <- cramer(meuse[, c("ffreq", "soil", "lime", "landuse")]) }
Calculate the module and azimuth of the difference on x and y coordinates between two sets of ground control points (GCP). It is suited to perform calculations for topographical coordinates only. The origin is set in the y coordinate, and rotation performed clockwise.
gcpVector(dx, dy)
gcpVector(dx, dy)
dx |
Numeric vector containing the difference on the ‘x’ coordinate between two sets of GCP. |
dy |
Numeric vector containing the difference on the ‘y’ coordinate between two sets of GCP. |
A data frame containing the module, its square and azimuth. These three columns are named ‘module’, ‘sq.module’ and ‘azimuth’.
This function was adapted from package's VecStatGraphs2D function LoadData()
.
Juan Carlos Ruiz Cuetos [email protected]
Maria Eugenia Polo Garcia
[email protected]
Pablo Garcia Rodriguez [email protected]
Alessandro
Samuel-Rosa [email protected]
Ruiz-Cuetos J.C., Polo M.E. and Rodriguez P.G. (2012). VecStatGraphs2D: Vector analysis using graphical and analytical methods in 2D. R package version 1.6. https://CRAN.R-project.org/package=VecStatGraphs2D.
x <- gcpVector(dx = rnorm(3, 5, 10), dy = rnorm(3, 5, 10))
x <- gcpVector(dx = rnorm(3, 5, 10), dy = rnorm(3, 5, 10))
Evaluate the data type contained in an object.
isNumint(x) allNumint(x) anyNumint(x) whichNumint(x) allInteger(x) anyInteger(x) whichInteger(x) allFactor(x) anyFactor(x) whichFactor(x) allNumeric(x) anyNumeric(x) whichNumeric(x) uniqueClass(x)
isNumint(x) allNumint(x) anyNumint(x) whichNumint(x) allInteger(x) anyInteger(x) whichInteger(x) allFactor(x) anyFactor(x) whichFactor(x) allNumeric(x) anyNumeric(x) whichNumeric(x) uniqueClass(x)
x |
Object to be tested. |
TRUE
or FALSE
depending on whether x
contains a given data type.
Alessandro Samuel-Rosa [email protected]
base::is.numeric()
, base::is.integer()
, base::is.factor()
.
# Vector of integers x <- 1:10 isNumint(x) # FALSE # Vector of numeric integers x <- as.numeric(x) isNumint(x) # TRUE # Vector of numeric values x <- c(1.1, 1, 1, 1, 2) isNumint(x) # FALSE allNumint(x) # FALSE anyNumint(x) # TRUE whichNumint(x) # Single numeric integer isNumint(1) # TRUE # Single numeric value isNumint(1.1) # FALSE
# Vector of integers x <- 1:10 isNumint(x) # FALSE # Vector of numeric integers x <- as.numeric(x) isNumint(x) # TRUE # Vector of numeric values x <- c(1.1, 1, 1, 1, 2) isNumint(x) # FALSE allNumint(x) # FALSE anyNumint(x) # TRUE whichNumint(x) # Single numeric integer isNumint(1) # TRUE # Single numeric value isNumint(1.1) # FALSE
Plotting correlation matrices.
plotCor(r, r2, col, breaks, col.names, ...)
plotCor(r, r2, col, breaks, col.names, ...)
r |
A square matrix with correlation values. |
r2 |
(optional) A second square matrix with correlation values. |
col |
(optional) Color table to use for |
breaks |
(optional) Break points in sorted order to indicate the intervals for assigning the
colors. See |
col.names |
(optional) Character vector with short (up to 5 characters) column names. |
... |
(optional) Additional parameters passed to plotting functions. |
A correlation plot in an alternative way of showing the strength of the empirical correlations between variables. This is done by using a diverging color palette, where the darker the color, the stronger the absolute correlation value.
plotCor()
can also be used to compare correlations between the same variables at
different points in time or space or for different observations. This is done by passing two
square correlation matrices using arguments r
and r2
. The lower triangle of the resulting
correlation plot will contain correlations from r
, correlations from r2
will be in the upper
triangle, and the diagonal will be empty.
A correlation plot.
The fields package, provider of tools for spatial data in R, is required for
plotCor()
to work. The development version of the fields package is
available on https://github.com/dnychka/fieldsRPackage while its old versions are available on
the CRAN archive at https://cran.r-project.org/src/contrib/Archive/fields/.
Alessandro Samuel-Rosa [email protected]
Neuwirth E (2022). RColorBrewer: ColorBrewer Palettes. R package version 1.1-3, https://CRAN.R-project.org/package=RColorBrewer.
if (all(c(require(sp), require(fields)))) { data(meuse, package = "sp") cols <- c("cadmium", "copper", "lead", "zinc", "elev", "dist", "om") # A single correlation matrix r <- cor(meuse[1:20, cols], use = "complete") r <- round(r, 2) plotCor(r) # Two correlation matrices: r2 goes in the upper triangle r2 <- cor(meuse[21:40, cols], use = "complete") r2 <- round(r2, 2) plotCor(r, r2) }
if (all(c(require(sp), require(fields)))) { data(meuse, package = "sp") cols <- c("cadmium", "copper", "lead", "zinc", "elev", "dist", "om") # A single correlation matrix r <- cor(meuse[1:20, cols], use = "complete") r <- round(r, 2) plotCor(r) # Two correlation matrices: r2 goes in the upper triangle r2 <- cor(meuse[21:40, cols], use = "complete") r2 <- round(r2, 2) plotCor(r, r2) }
Create four plots for exploratory spatial data analysis (ESDA): histogram + density plot, bubble plot, variogram plot, and variogram map.
plotESDA( z, lat, lon, lags = NULL, cutoff = NULL, width = c(cutoff/20), leg.pos = "right" )
plotESDA( z, lat, lon, lags = NULL, cutoff = NULL, width = c(cutoff/20), leg.pos = "right" )
z |
Vector of numeric values of the variable for with ESDA plots should be created. |
lat |
Vector of numeric values containing the y coordinate (latitude) of the point locations
where the |
lon |
Vector of numeric values containing the x coordinate (longitude) of the point
locations where the |
lags |
(optional) Numerical vector; upper boundaries of lag-distance classes. See argument
|
cutoff |
(optional) Integer value defining the spatial separation distance up to which point pairs are included in semi-variance estimates. Defaults to the length of the diagonal of the box spanning the data divided by three. |
width |
Integer value specifying the width of subsequent distance intervals into which data
point pairs are grouped for semi-variance estimates. Defaults to |
leg.pos |
(optional) Character value indication the location of the legend of the bubble
plot. Defaults to |
The user should visit the help pages of gstat::variogram()
, plotHD()
,
sp::bubble()
and sp::spplot()
to obtain more details about the main functions used to built
plotESDA()
.
Four plots: histogram and density plot, bubble plot, empirical variogram, and variogram map.
The sp package, provider of classes and methods for spatial data in R, is required for
plotESDA()
to work. The development version of the sp package is available on
https://github.com/edzer/sp/ while its old versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/sp/.
The gstat package, provider of methods for spatial and spatio-temporal geostatistical
modelling, prediction and simulation in R, is required for plotESDA()
to work. The
development version of the sp package is available on https://github.com/r-spatial/gstat
while its old versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/gstat/.
Alessandro Samuel-Rosa [email protected]
Cressie, N.A.C. (1993) Statistics for Spatial Data. New York: John Wiley and Sons, p.900, 1993.
Pebesma, E.J. (2004) Multivariable geostatistics in S: the gstat package. Computers and Geosciences, 30:683-691, 2004.
Webster, R., Oliver, M.A. Geostatistics for environmental scientists. Chichester: John Wiley and Sons, p.315, 2007.
if (all(require(sp), require(gstat))) { data(meuse, package = "sp") p <- plotESDA(z = meuse$zinc, lat = meuse$y, lon = meuse$x) }
if (all(require(sp), require(gstat))) { data(meuse, package = "sp") p <- plotESDA(z = meuse$zinc, lat = meuse$y, lon = meuse$x) }
Plot a histogram and a density plot of a single variable using the R-package lattice.
plotHist( x, HD = "over", nint = 20, digits = 2, stats = TRUE, BoxCox = FALSE, col = c("lightgray", "black"), lwd = c(1, 1), lty = "dashed", xlim, ylim, ... ) plotHD( x, HD = "over", nint = 20, digits = 2, stats = TRUE, BoxCox = FALSE, col = c("lightgray", "black"), lwd = c(1, 1), lty = "dashed", xlim, ylim, ... )
plotHist( x, HD = "over", nint = 20, digits = 2, stats = TRUE, BoxCox = FALSE, col = c("lightgray", "black"), lwd = c(1, 1), lty = "dashed", xlim, ylim, ... ) plotHD( x, HD = "over", nint = 20, digits = 2, stats = TRUE, BoxCox = FALSE, col = c("lightgray", "black"), lwd = c(1, 1), lty = "dashed", xlim, ylim, ... )
x |
Vector of numeric values of the variable for which the histogram and density plot should be created. |
HD |
Character value indicating the type of plot to be created. Available options are
|
nint |
Integer specifying the number of histogram bins. Defaults to |
digits |
Integer indicating the number of decimal places to be used
when printing the statistics of the variable |
stats |
Logical to indicate if descriptive statistics of the variable |
BoxCox |
Logical to indicate if the variable |
col |
Vector of two elements, the first indicating the color of the histogram, the second
indicating the color of the density plot. Defaults to |
lwd |
Vector of two elements, the first indicating the line width of the histogram, the
second indicating the line width of the density plot. Defaults to |
lty |
Character value indicating the line type for the density plot. Defaults to
|
xlim |
Vector of two elements defining the limits of the x axis. The function automatically
optimizes |
ylim |
Vector of two elements defining the limits of the y axis. The function automatically
optimizes |
... |
Other arguments that can be passed to lattice functions. There is no guarantee that they will work. |
The user should visit the help pages of lattice::histogram()
, lattice::densityplot()
,
lattice::panel.mathdensity()
, car::powerTransform()
, and car::bcPower()
to obtain more
details about the main functions used to built plotHD()
.
An object of class "trellis"
. The lattice::update.trellis()
method can be used to update
components of the object and the lattice::print.trellis()
print method (usually called by
default) will plot it on an appropriate plotting device.
The car package, provider of functions to accompany Fox and Weisberg's An R Companion to
Applied Regression, is required for plotHist()
to work. The development version of
the car package is available on https://r-forge.r-project.org/projects/car/ while its old
versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/car/.
Alessandro Samuel-Rosa [email protected]
Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R, Springer. http://lmdvr.r-forge.r-project.org/
lattice::histogram()
, lattice::densityplot()
, lattice::panel.mathdensity()
if (all(c(require(car), require(lattice), require(latticeExtra)))) { x <- rnorm(100, 10, 2) p1 <- plotHist(x, HD = "stack") p2 <- plotHist(x, HD = "over") }
if (all(c(require(car), require(lattice), require(latticeExtra)))) { x <- rnorm(100, 10, 2) p1 <- plotHist(x, HD = "stack") p2 <- plotHist(x, HD = "over") }
Produce a graphical output to examine the effect of using different model specifications (design)
on the predictive performance of these models (a model series). Devised to access the results of
buildModelSeries()
and statsMS()
, but can be easily adapted to
work with any model structure and performance measure.
plotModelSeries( obj, grid, line, ind, type = c("b", "g"), pch = c(20, 2), size = 0.5, arrange = "desc", color = NULL, xlim = NULL, ylab = NULL, xlab = NULL, at = NULL, ... ) plotMS( obj, grid, line, ind, type = c("b", "g"), pch = c(20, 2), size = 0.5, arrange = "desc", color = NULL, xlim = NULL, ylab = NULL, xlab = NULL, at = NULL, ... )
plotModelSeries( obj, grid, line, ind, type = c("b", "g"), pch = c(20, 2), size = 0.5, arrange = "desc", color = NULL, xlim = NULL, ylab = NULL, xlab = NULL, at = NULL, ... ) plotMS( obj, grid, line, ind, type = c("b", "g"), pch = c(20, 2), size = 0.5, arrange = "desc", color = NULL, xlim = NULL, ylab = NULL, xlab = NULL, at = NULL, ... )
obj |
Object of class
See ‘Details’ for more information. |
grid |
Vector of integer values or character strings indicating the columns of the
|
line |
Character string or integer value indicating which of the performance statistics
(usually calculated by |
ind |
Integer value indicating for which group of models the mean rank is to be calculated. See ‘Details’ for more information. |
type |
Vector of character strings indicating some of the effects to be used when plotting
the performance statistics using |
pch |
Vector with two integer values specifying the symbols to be used to plot points. The
first sets the symbol used to plot the performance statistic, while the second sets the symbol
used to plot the mean rank of the indicator set using argument |
size |
Numeric value specifying the size of the symbols used for plotting the mean rank of
the indicator set using argument |
arrange |
Character string indicating how the model series should be arranged, which can be
in ascending ( |
color |
Vector defining the colors to be used in the grid produced by function
|
xlim |
Numeric vector of length 2, giving the x coordinates range. If |
ylab |
Character vector of length 2, giving the y-axis labels. When |
xlab |
Character vector of unit length, the x-axis label. Defaults |
at |
Numeric vector indicating the location of tick marks along the x axis (in native coordinates). |
... |
Other arguments for plotting, although most of these have no been tested. Argument
|
This section gives more details about arguments obj
, grid
, line
, arrange
, and ind
.
The argument obj
usually constitutes a data.frame
returned by statsMS()
.
However, the user can use any data.frame
object as far as it contains the two basic units of
information needed:
design data passed with argument grid
performance statistic passed with argument line
The argument grid
indicates the design data which is used to produce the grid output in the
top of the model series plot. By design we mean the data that specify the structure of each
model and how they differ from each other. Suppose that eight linear models were fit using three
types of predictor variables (a
, b
, and c
). Each of these predictor variables is available
in two versions that differ by their accuracy, where 0
means a less accurate predictor
variable, while 1
means a more accurate predictor variable. This yields 2^3 = 8 total possible
combinations. The design data would be of the following form:
> design
a b c
1 0 0 0
2 0 0 1
3 0 1 0
4 1 0 0
5 0 1 1
6 1 0 1
7 1 1 0
8 1 1 1
The argument line
corresponds to the performance statistic that is used to arrange the models
in ascending or descending order, and to produce the line output in the bottom of the model
series plot. For example, it can be a series of values of adjusted coefficient of determination,
one for each model:
adj_r2 <- c(0.87, 0.74, 0.81, 0.85, 0.54, 0.86, 0.90, 0.89)
The argument arrange
automatically arranges the model series according to the performance
statistics selected with argument line
. If obj
is a data.frame
returned by
statsMS()
, then the function uses standard arranging approaches. For most
performance statistics, the models are arranged in descending order. The exception is when
"r2"
, "adj_r2"
, or "ADJ_r2"
are used, in which case the models are arranged in ascending
order. This means that the model with lowest value appears in the leftmost side of the model
series plot, while the models with the highest value appears in the rightmost side of the plot.
> arrange(obj, adj_r2)
id a b c adj_r2
1 5 1 0 1 0.54
2 2 0 0 1 0.74
3 3 1 0 0 0.81
4 4 0 1 0 0.85
5 6 0 1 1 0.86
6 1 0 0 0 0.87
7 8 1 1 1 0.89
8 7 1 1 0 0.90
This results suggest that the best performing model is that of id = 7
, while the model of
id = 5
is the poorest one.
The model series plot allows to see how the design influences model performance. This is achieved mainly through the use of different colors in the grid output, where each unique value in the design data is represented by a different color. For the example given above, one could try to see if the models built with the more accurate versions of the predictor variables have a better performance by identifying their relative distribution in the model series plot. The models placed at the rightmost side of the plot are those with the best performance.
The argument ind
provides another tool to help identifying how the design, more specifically
how each variable in the design data, influences model performance. This is done by simply
calculating the mean ranking of the models that were built using the updated version of each
predictor variable. This very same mean ranking is also used to rank the predictor variables and
thus identify which of them is the most important.
After arranging the design
data described above using the adjusted coefficient of
determination, the following mean rank is obtained for each predictor variable:
> rank_center
a b c
1 5.75 6.25 5.25
This result suggests that the best model performance is obtained when using the updated version
of the predictor variable b
. In the model series plot, the predictor variable b
appears in
the top row, while the predictor variable c
appears in the bottom row.
An object of class "trellis"
consisting of a model series plot.
The grDevices package, provider of graphics devices and support for colours and fonts in R,
is required for plotModelSeries()
to work.
The grid package, a rewrite of the graphics layout capabilities in R, is required for
plotModelSeries()
to work.
Use the original functions lattice::xyplot()
and lattice::levelplot()
for higher
customization.
Some of the solutions used to build this function were found in the source code of the R-package mvtsplot. As such, the author of that package, Roger D. Peng [email protected], is entitled ‘contributors’ to the R-package pedometrics.
Alessandro Samuel-Rosa [email protected]
Deepayan Sarkar (2008). Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5.
Roger D. Peng (2008). A method for visualizing multivariate time series data. Journal of Statistical Software. v. 25 (Code Snippet), p. 1-17.
Roger D. Peng (2012). mvtsplot: Multivariate Time Series Plot. R package version 1.0-1. https://CRAN.R-project.org/package=mvtsplot.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243–244, pp. 214–227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
lattice::xyplot()
lattice::levelplot()
if (all(require(grDevices), require(grid))) { # This example follows the discussion in section "Details" # Note that the data.frame is created manually id <- c(1:8) design <- data.frame(a = c(0, 0, 1, 0, 1, 0, 1, 1), b = c(0, 0, 0, 1, 0, 1, 1, 1), c = c(0, 1, 0, 0, 1, 1, 0, 1)) adj_r2 <- c(0.87, 0.74, 0.81, 0.85, 0.54, 0.86, 0.90, 0.89) obj <- cbind(id, design, adj_r2) p <- plotModelSeries(obj, grid = c(2:4), line = "adj_r2", ind = 1, color = c("lightyellow", "palegreen"), main = "Model Series Plot") }
if (all(require(grDevices), require(grid))) { # This example follows the discussion in section "Details" # Note that the data.frame is created manually id <- c(1:8) design <- data.frame(a = c(0, 0, 1, 0, 1, 0, 1, 1), b = c(0, 0, 0, 1, 0, 1, 1, 1), c = c(0, 1, 0, 0, 1, 1, 0, 1)) adj_r2 <- c(0.87, 0.74, 0.81, 0.85, 0.54, 0.86, 0.90, 0.89) obj <- cbind(id, design, adj_r2) p <- plotModelSeries(obj, grid = c(2:4), line = "adj_r2", ind = 1, color = c("lightyellow", "palegreen"), main = "Model Series Plot") }
This function returns the minimum value in each row of a numeric matrix.
rowMinCpp(x)
rowMinCpp(x)
x |
Numeric matrix with two or more rows and/or columns. |
This function is implemented in C++ to speed-up the computation time for large matrices.
A numeric vector with the minimum value of each row if the matrix.
Alessandro Samuel-Rosa [email protected]
rowMins()
in https://cran.r-project.org/package=matrixStats.
x <- matrix(rnorm(20), nrow = 5) rowMinCpp(x)
x <- matrix(rnorm(20), nrow = 5) rowMinCpp(x)
Compute the moment coefficient of skewness of a continuous, possibly non-normal variable.
skewness(x)
skewness(x)
x |
Numeric vector, the values of the variable of interest. |
A numerical value: the moment coefficient of skewness of x
.
Alessandro Samuel-Rosa [email protected]
B. S. Everitt, The Cambridge Dictionary of Statistics, 3rd ed. Cambridge: Cambridge University Press, 2006, p. 432.
D. N. Joanes and C. A. Gill, Comparing measures of sample skewness and kurtosis, J Royal Statistical Soc D, vol. 47, no. 1, pp. 183–189, Mar. 1998, doi: 10.1111/1467-9884.00122.
H. Cramér, Mathematical Methods of Statistics. Princeton: Princeton University Press, 1946, p. 575.
x <- rlnorm(10) skw <- skewness(x)
x <- rlnorm(10) skw <- skewness(x)
Compute several statistics measuring the performance of a series of linear models built using
buildModelSeries()
, with an option to rank the models based on one of the returned
performance statistics.
statsModelSeries(model, design.info, arrange.by, digits) statsMS(model, design.info, arrange.by, digits)
statsModelSeries(model, design.info, arrange.by, digits) statsMS(model, design.info, arrange.by, digits)
model |
A list of linear models returned by |
design.info |
Extra information about the linear models in the series. |
arrange.by |
Character string defining if the table with the performance statistics of the
linear models should be arranged, and which column should be used. Available options are
|
digits |
Integer or vector with six integers indicating the number of decimal places to be used to round the performance statistics. If a vector is passed to the function, the number of decimal places should be in the following order:
|
This function was devised to deal with a list of linear models generated by the function
buildModelSeries()
. The main objective is to compare several linear models using
several performance statistics. Such statistics can then be used to rank the linear models and
identify, for example, the best performing model, given the selected performance statistics.
An important feature of buildModelSeries()
is that it uses the information about
the initial number of candidate predictor variables offered to the build the model to calculate
penalized or adjusted measures of model performance. Such information is recorded as an attribute
of the final model selected by buildModelSeries()
. This feature was included in
statsModelSeries()
because data-driven variable selection results biased linear
models (too optimistic), and the effective number of degrees of freedom is close to the number of
candidate predictor variables initially offered to the model (Harrell, 2001).
A data frame with several performance statistics:
Identification of the model.
Number of candidate predictor variables initially offered to the model.
Number of degrees of freedom of the final selected model.
Akaike's Information Criterion (AIC). Obtained using
extractAIC
.
Root-mean squared error, calculated based on the number of candidate predictor variables initially offered to the model.
Normalized Root-mean squared error, calculated as the ratio between the RMSE and the standard deviation of the observed values of the dependent variable.
Multiple coefficient of determination.
Adjusted multiple coefficient of determination.
Adjusted multiple coefficient of determination. Calculations are done based on the number of candidate predictor variables initially offered to the model.
Include other performance statistics such as: PRESS, BIC, Mallow's Cp, max(VIF);
Add option to select which performance statistics should be returned.
Alessandro Samuel-Rosa [email protected]
Harrell, F. E. (2001) Regression modelling strategies: with applications to linear models, logistic regression, and survival analysis. First edition. New York: Springer.
Venables, W. N. and Ripley, B. D. (2002) Modern applied statistics with S. Fourth edition. New York: Springer.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243–244, pp. 214–227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
buildModelSeries()
, plotModelSeries()
if (interactive()) { # based on the second example of function MASS:stepAIC() require(MASS) cpus1 <- cpus for(v in names(cpus)[2:7]) cpus1[[v]] <- cut(cpus[[v]], unique(quantile(cpus[[v]])), include.lowest = TRUE) cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp <- sample(1:209, 100) cpus.form <- list(formula(log10(perf) ~ syct + mmin + mmax + cach + chmin + chmax + perf), formula(log10(perf) ~ syct + mmin + cach + chmin + chmax), formula(log10(perf) ~ mmax + cach + chmin + chmax + perf)) data <- cpus1[cpus.samp,2:8] cpus.ms <- buildModelSeries(cpus.form, data, vif = TRUE, aic = TRUE) cpus.des <- data.frame(a = c(0, 1, 0), b = c(1, 0, 1), c = c(1, 1, 0)) stats <- statsModelSeries(cpus.ms, design.info = cpus.des, arrange.by = "aic") }
if (interactive()) { # based on the second example of function MASS:stepAIC() require(MASS) cpus1 <- cpus for(v in names(cpus)[2:7]) cpus1[[v]] <- cut(cpus[[v]], unique(quantile(cpus[[v]])), include.lowest = TRUE) cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions cpus.samp <- sample(1:209, 100) cpus.form <- list(formula(log10(perf) ~ syct + mmin + mmax + cach + chmin + chmax + perf), formula(log10(perf) ~ syct + mmin + cach + chmin + chmax), formula(log10(perf) ~ mmax + cach + chmin + chmax + perf)) data <- cpus1[cpus.samp,2:8] cpus.ms <- buildModelSeries(cpus.form, data, vif = TRUE, aic = TRUE) cpus.des <- data.frame(a = c(0, 1, 0), b = c(1, 0, 1), c = c(1, 1, 0)) stats <- statsModelSeries(cpus.ms, design.info = cpus.des, arrange.by = "aic") }
This function takes a linear model and selects the subset of predictor variables that meet a user-specific collinearity threshold measured by the (generalized) variance-inflation factor (VIF).
stepVIF(model, threshold = 10, verbose = FALSE)
stepVIF(model, threshold = 10, verbose = FALSE)
model |
Linear model (object of class 'lm') containing collinear predictor variables. |
threshold |
Positive number defining the maximum allowed VIF. Defaults to |
verbose |
Logical indicating if iteration results should be printed. Defaults to
|
stepVIF
starts computing the VIF of all predictor variables in the linear model. If the linear
model contains categorical predictor variables, generalized variance-inflation factors (GVIF)
(Fox and Monette, 1992), are calculated instead using car::vif()
. GVIF is interpretable as the
inflation in size of the confidence ellipse or ellipsoid for the coefficients of the predictor
variable in comparison with what would be obtained for orthogonal, uncorrelated data. Since
categorical predictors have more than one degree of freedom, df, the confidence ellipsoid will
have df dimensions, and GVIF will need to be adjusted so that it can be comparable across
predictor variables. The adjustment is made using the following equation:
The next step consists of evaluating if any of the predictor variables has a (G)VIF larger than
the specified threshold, the function default being threshold = 10
. For, GVIF^(1/(2*df)), the
threshold will be sqrt(threshold)
.
If there is only one predictor variable that does not meet the VIF threshold, it is automatically
removed from the model and no further processing occurs. When there are two or more predictor
variables that do not meet the (G)VIF threshold, stepVIF()
fits a linear model
between each of them and the dependent variable. The predictor variable with the lowest adjusted
coefficient of determination is dropped from the model and new coefficients are calculated,
resulting in a new linear model.
This process lasts until all predictor variables included in the new model meet the (G)VIF threshold.
Nothing is done if all predictor variables have a (G)VIF value lower that the threshold, and
stepVIF()
returns the original linear model.
A linear model (object of class 'lm') with low collinearity.
The car package, provider of functions to accompany Fox and Weisberg's An R Companion to
Applied Regression, is required for plotHist()
to work. The development version of
the car package is available on https://r-forge.r-project.org/projects/car/ while its old
versions are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/car/.
More on the use of GVIF to measure the collinearity in linear models containing categorical predictor variables can be found on StackExchange.
Alessandro Samuel-Rosa [email protected]
Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.
Fox, J. (2008) Applied Regression Analysis and Generalized Linear Models, Second Edition. Sage.
Fox, J. and Weisberg, S. (2011) An R Companion to Applied Regression, Second Edition. Thousand Oaks: Sage.
Hair, J. F., Black, B., Babin, B. and Anderson, R. E. (2010) Multivariate data analysis. New Jersey: Pearson Prentice Hall.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243–244, pp. 214–227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
if (require(car)) { fit <- lm(prestige ~ income + education + type, data = Duncan) fit <- stepVIF(fit, threshold = 10, verbose = TRUE) }
if (require(car)) { fit <- lm(prestige ~ income + education + type, data = Duncan) fit <- stepVIF(fit, threshold = 10, verbose = TRUE) }
Computation of bins for sample (experimental) variograms.
variogramBins( coords, n.lags = 7, type = "exp", cutoff = 0.5, base = 2, zero = 0.001, count = "pairs" ) vgmLags( coords, n.lags = 7, type = "exp", cutoff = 0.5, base = 2, zero = 0.001, count = "pairs" )
variogramBins( coords, n.lags = 7, type = "exp", cutoff = 0.5, base = 2, zero = 0.001, count = "pairs" ) vgmLags( coords, n.lags = 7, type = "exp", cutoff = 0.5, base = 2, zero = 0.001, count = "pairs" )
coords |
Data frame or matrix with the projected x- and y-coordinates. |
n.lags |
Integer value defining the number of variogram bins (distance classes) that should
be computed. Defaults to |
type |
Character value defining the type of variogram bins that should be computed,
with options |
cutoff |
Numeric value defining the fraction of the diagonal of the rectangle that spans the
data (bounding box) that should be used to set the maximum distance up to which variogram bins
should be computed. Defaults to |
base |
Numeric value defining the base of the exponential expression used to create
exponentially growing variogram bins. Used only when |
zero |
Numeric value setting the minimum pair-wise separation distance that should be used
to compute the variogram bins. Defaults to |
count |
Should the number of points ( |
Vector of numeric values with the lower and upper boundaries of the variogram bins. The number of points or point-pairs per variogram bin is returned as an attribute.
The SpatialTools package, provider of tools for spatial data analysis in R, is required for
variogramBins()
to work. The development version of the SpatialTools package
is available on https://github.com/jfrench/SpatialTools while its old versions are available
on the CRAN archive at https://cran.r-project.org/src/contrib/Archive/SpatialTools/.
The spsann package, provider of methods for the optimization of sample configurations using
spatial simulated annealing in R, requires variogramBins()
for some of its
functions to work. The development version of the spsann package is available on
https://github.com/Laboratorio-de-Pedometria/spsann-package.
Alessandro Samuel-Rosa [email protected]
Truong, P. N.; Heuvelink, G. B. M.; Gosling, J. P. Web-based tool for expert elicitation of the variogram. Computers and Geosciences. v. 51, p. 390-399, 2013.
if (all(c(require(SpatialTools), require(sp)))) { data(meuse, package = "sp") lags_points <- variogramBins(coords = meuse[, 1:2], count = "points") lags_pairs <- variogramBins(coords = meuse[, 1:2], count = "pairs") }
if (all(c(require(SpatialTools), require(sp)))) { data(meuse, package = "sp") lags_points <- variogramBins(coords = meuse[, 1:2], count = "points") lags_pairs <- variogramBins(coords = meuse[, 1:2], count = "pairs") }
Guess the parameters of the spatial covariance function of a random, regionalized variable. A guess of such parameters is required to start the fitting functions of many geostatistical packages such as gstat, geoR, and georob.
variogramGuess( z, coords, lags, cutoff = 0.5, method = "a", min.npairs = 30, model = "matern", nu = 0.5, estimator = "qn", plotit = FALSE ) vgmICP( z, coords, lags, cutoff = 0.5, method = "a", min.npairs = 30, model = "matern", nu = 0.5, estimator = "qn", plotit = FALSE )
variogramGuess( z, coords, lags, cutoff = 0.5, method = "a", min.npairs = 30, model = "matern", nu = 0.5, estimator = "qn", plotit = FALSE ) vgmICP( z, coords, lags, cutoff = 0.5, method = "a", min.npairs = 30, model = "matern", nu = 0.5, estimator = "qn", plotit = FALSE )
z |
Numeric vector with the values of the regionalized variable for which the values for the spatial covariance parameters should be guessed. |
coords |
Data frame or matrix with the projected x- and y-coordinates. |
lags |
Numeric scalar defining the width of the variogram bins or a numeric vector with the
lower and upper bounds of the variogram bins. If missing, the variogram bins are computed using
|
cutoff |
Numeric value defining the fraction of the diagonal of the rectangle that spans the
data (bounding box) that should be used to set the maximum distance up to which variogram bins
should be computed. Defaults to |
method |
Character keyword defining the method used for guessing the spatial covariance
parameters of the regionalized variable. Defaults to |
min.npairs |
Positive integer defining the minimum number of point-pairs required so that a
variogram bin is used to guessing the spatial covariance parameters of the of the regionalized
variable. Defaults to |
model |
Character keyword defining the spatial covariance function that will be fitted to
the data of the regionalized variable. Currently, most of the basic spatial covariance function
are accepted. See |
nu |
Numerical value for the additional smoothness parameter |
estimator |
Character keyword defining the estimator for computing the sample (experimental)
variogram of the regionalized variable, with options |
plotit |
Should the guessed spatial covariance parameters be plotted along with the sample
(experimental) variogram of the regionalized variable? Defaults to |
There are five methods two guess the covariance parameters. Two of them, "a"
and "c"
, rely on
a sample variogram with exponentially growing variogram bins, while the other three, "b"
,
"d"
, and "e"
, use equal-width variogram bins (see variogramBins()
). All of
them are heuristic.
Method "a"
was developed in-house and is the most elaborated of them, specially for guessing
the nugget variance.
Method "b"
was proposed by doi:10.1016/0098-3004(95)00095-XJian et al. (1996) and
is implemented in SAS/STAT(R) 9.22.
Method "c"
is implemented in the automap-package and was developed by
doi:10.1016/j.cageo.2008.10.011Hiemstra et al. (2009).
Method "d"
was developed by doi:10.1007/s11004-012-9434-1Desassis & Renard (2012).
Method "e"
was proposed by Larrondo et al. (2003)
http://www.ccgalberta.com/ccgresources/report05/2003-122-varfit.pdf and is implemented in the
VARFIT module of GSLIB http://www.gslib.com/.
A vector of numerical values, the guesses for the spatial covariance parameters of the regionalized variable:
nugget
partial sill
range
The georob package, provider of functions for the robust geostatistical analysis of spatial
data in R, is required for variogramGuess()
to work. The old versions of the
georob package are available on the CRAN archive at
https://cran.r-project.org/src/contrib/Archive/georob/.
Alessandro Samuel-Rosa [email protected]
Desassis, N. & Renard, D. Automatic variogram modelling by iterative least squares: univariate and multivariate cases. Mathematical Geosciences. Springer Science + Business Media, v. 45, p. 453-470, 2012.
Hiemstra, P. H.; Pebesma, E. J.; Twenhöfel, C. J. & Heuvelink, G. B. Real-time automatic interpolation of ambient gamma dose rates from the Dutch radioactivity monitoring network. Computers & Geosciences. Elsevier BV, v. 35, p. 1711-1721, 2009.
Jian, X.; Olea, R. A. & Yu, Y.-S. Semivariogram modelling by weighted least squares. Computers & Geosciences. Elsevier BV, v. 22, p. 387-397, 1996.
Larrondo, P. F.; Neufeld, C. T. & Deutsch, C. V. VARFIT: a program for semi-automatic variogram modelling. Edmonton: Department of Civil and Environmental Engineering, University of Alberta, p. 17, 2003.
if (all(c(require(sp), require(georob)))) { data(meuse, package = "sp") icp <- variogramGuess(z = log(meuse$copper), coords = meuse[, 1:2]) }
if (all(c(require(sp), require(georob)))) { data(meuse, package = "sp") icp <- variogramGuess(z = log(meuse$copper), coords = meuse[, 1:2]) }
Compute the proportion of the variance that is spatially correlated.
vgmSCV(obj, digits = 4) ## S3 method for class 'variomodel' vgmSCV(obj, digits = 4) ## S3 method for class 'variogramModel' vgmSCV(obj, digits = 4) ## S3 method for class 'georob' vgmSCV(obj, digits = 4)
vgmSCV(obj, digits = 4) ## S3 method for class 'variomodel' vgmSCV(obj, digits = 4) ## S3 method for class 'variogramModel' vgmSCV(obj, digits = 4) ## S3 method for class 'georob' vgmSCV(obj, digits = 4)
obj |
Variogram model fitted with available function in geostatistical packages such as gstat, geoR, and georob. |
digits |
Integer indicating the number of decimal places to be used. |
Numeric value indicating the proportion of the variance that is spatially correlated.
Alessandro Samuel-Rosa [email protected]
if (all(c(require(gstat), require(sp)))) { library(gstat) library(sp) data(meuse) sp::coordinates(meuse) <- ~ x + y vgm1 <- gstat::variogram(log(zinc) ~ 1, meuse) fit <- gstat::fit.variogram(vgm1, gstat::vgm(1, "Sph", 300, 1)) res <- vgmSCV(fit) }
if (all(c(require(gstat), require(sp)))) { library(gstat) library(sp) data(meuse) sp::coordinates(meuse) <- ~ x + y vgm1 <- gstat::variogram(log(zinc) ~ 1, meuse) fit <- gstat::fit.variogram(vgm1, gstat::vgm(1, "Sph", 300, 1)) res <- vgmSCV(fit) }