Title: | Genetic Analysis of Populations with Mixed Reproduction |
---|---|
Description: | Population genetic analyses for hierarchical analysis of partially clonal populations built upon the architecture of the 'adegenet' package. Originally described in Kamvar, Tabima, and Grünwald (2014) <doi:10.7717/peerj.281> with version 2.0 described in Kamvar, Brooks, and Grünwald (2015) <doi:10.3389/fgene.2015.00208>. |
Authors: | Zhian N. Kamvar [cre, aut] , Javier F. Tabima [aut] , Sydney E. Everhart [ctb, dtc] , Jonah C. Brooks [aut], Stacy A. Krueger-Hadfield [ctb] , Erik Sotka [ctb], Brian J. Knaus [ctb] , Patrick G. Meirmans [ctb] , Frédéric D. Chevalier [ctb] , David Folarin [aut], Niklaus J. Grünwald [ths] |
Maintainer: | Zhian N. Kamvar <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 2.9.6 |
Built: | 2024-11-12 04:11:35 UTC |
Source: | https://github.com/grunwaldlab/poppr |
Poppr provides tools for population genetic analysis that include genotypic diversity measures, genetic distances with bootstrap support, native organization and handling of population hierarchies, and clone correction.
To cite poppr, please use citation("poppr")
. When referring to
poppr in your manuscript, please use lower case unless it occurs at the
beginning of a sentence.
This package relies on the adegenet package. It is built around the genind and genlight object. Genind objects store genetic information in a table of allele frequencies while genlight objects store SNP data efficiently by packing binary allele calls into single bits. Poppr has extended these object into new objects called genclone and snpclone, respectively. These objects are designed for analysis of clonal organisms as they add the @mlg slot for keeping track of multilocus genotypes and multilocus lineages.
Documentation is available for any function by
typing ?function_name
in the R console. Detailed topic explanations
live in the package vignettes:
Vignette | command |
Data import and manipulation | vignette("poppr_manual", "poppr")
|
Algorithms and Equations | vignette("algo", "poppr")
|
Multilocus Genotype Analysis | vignette("mlg", "poppr")
|
Essential functions for importing and manipulating data are detailed within the Data import and manipulation vignette, details on algorithms used in poppr are within the Algorithms and equations vignette, and details for working with multilocus genotypes are in Multilocus Genotype Analysis.
Examples of analyses are available in a primer written by Niklaus J. Grünwald, Zhian N. Kamvar, and Sydney E. Everhart at https://grunwaldlab.github.io/Population_Genetics_in_R/.
If you have a specific question or issue with poppr, feel free to contribute to the google group at https://groups.google.com/d/forum/poppr. If you find a bug and are a github user, you can submit bug reports at https://github.com/grunwaldlab/poppr/issues. Otherwise, leave a message on the groups. Personal emails are highly discouraged as they do not allow others to learn.
Below are descriptions and links to functions found in poppr. Be aware that all functions in adegenet are also available. The functions are documented as:
function_name()
(data type) - Description
Where ‘data type’ refers to the type of data that can be used:
m | a genclone or genind object |
s | a snpclone or genlight object |
x | a different data type (e.g. a matrix from mlg.table() )
|
getfile()
(x) - Provides a quick GUI to grab files for import
read.genalex()
(x) - Reads GenAlEx formatted csv files to a genind object
genind2genalex()
(m) - Converts genind objects to GenAlEx formatted csv files
genclone2genind()
(m) - Removes the @mlg slot from genclone objects
as.genambig()
(m) - Converts genind data to polysat's genambig data structure.
bootgen2genind()
(x) - see aboot()
for details)
Data structures "genclone" (based off of adegenet's genind) and "snpclone" (based off of adegenet's genlight for large SNP data sets). Both of these data structures are defined by the presence of an extra MLG slot representing multilocus genotype assignments, which can be a numeric vector or a MLG class object.
genclone - Handles microsatellite, presence/absence, and small SNP data sets
snpclone - Designed to handle larger binary SNP data sets.
MLG - An internal class holding a data frame of multilocus genotype assignments that acts like a vector, allowing the user to easily switch between different MLG definitions.
bootgen - An internal class used explicitly for aboot()
that
inherits the gen-class virtual object. It is
designed to allow for sampling loci with replacement.
bruvomat - An internal class designed to handle bootstrapping for Bruvo's distance where blocks of integer loci can be shuffled.
as.genclone()
(m) - Converts genind objects to genclone objects
missingno()
(m) - Handles missing data
clonecorrect()
(m | s) - Clone-censors at a specified population hierarchy
informloci()
(m) - Detects and removes phylogenetically uninformative loci
popsub()
(m | s) - Subsets genind objects by population
shufflepop()
(m) - Shuffles genotypes at each locus using four different shuffling algorithms
recode_polyploids()
(m | x) - Recodes polyploid data sets with missing alleles imported as "0"
make_haplotypes()
(m | s) - Splits data into pseudo-haplotypes. This is mainly used in AMOVA.
test_replen()
(m) - Tests for inconsistent repeat lengths in microsatellite data. For use in bruvo.dist()
functions.
fix_replen()
(m) - Fixes inconsistent repeat lengths. For use in bruvo.dist()
functions.
bruvo.dist()
(m) - Bruvo's distance (see also: fix_replen()
)
diss.dist()
(m) - Absolute genetic distance (see prevosti.dist()
)
nei.dist()
(m | x) - Nei's 1978 genetic distance
rogers.dist()
(m | x) - Rogers' euclidean distance
reynolds.dist()
(m | x) - Reynolds' coancestry distance
edwards.dist()
(m | x) - Edwards' angular distance
prevosti.dist()
(m | x) - Prevosti's absolute genetic distance
bitwise.dist()
(s) - Calculates fast pairwise distances for genlight objects.
aboot()
(m | s | x) - Creates a bootstrapped dendrogram for any distance measure
bruvo.boot()
(m) - Produces dendrograms with bootstrap support based on Bruvo's distance
diversity_boot()
(x) - Generates boostrap distributions of diversity statistics for multilocus genotypes
diversity_ci()
(m | s | x) - Generates confidence intervals for multilocus genotype diversity.
resample.ia()
(m) - Calculates the index of association over subsets of data.
mlg()
(m | s) - Calculates the number of multilocus genotypes
mll()
(m | s) - Displays the current multilocus lineages (genotypes) defined.
mlg.crosspop()
(m | s) - Finds all multilocus genotypes that cross populations
mlg.table()
(m | s) - Returns a table of populations by multilocus genotypes
mlg.vector()
(m | s) - Returns a vector of a numeric multilocus genotype assignment for each individual
mlg.id()
(m | s) - Finds all individuals associated with a single multilocus genotype
mlg.filter()
(m | s) - Collapses MLGs by genetic distance
filter_stats()
(m | s) - Calculates mlg.filter for all algorithms and plots
cutoff_predictor()
(x) - Predicts cutoff threshold from mlg.filter.
mll.custom()
(m | s) - Allows for the custom definition of multilocus lineages
mll.levels()
(m | s) - Allows the user to change levels of custom MLLs.
mll.reset()
(m | s) - Reset multilocus lineages.
diversity_stats()
(x) - Creates a table of diversity indices for multilocus genotypes.
Analysis of multilocus linkage disequilibrium.
ia()
(m) - Calculates the index of association
pair.ia()
(m) - Calculates the index of association for all loci pairs.
win.ia()
(s) - Index of association windows for genlight objects.
samp.ia()
(s) - Index of association on random subsets of loci for genlight objects.
poppr.amova()
(m | s) - Analysis of Molecular Variance (as implemented in ade4)
poppr()
(m | x) - Returns a diversity table by population
poppr.all()
(m | x) - Returns a diversity table by population for all compatible files specified
private_alleles()
(m) - Tabulates the occurrences of alleles that only occur in one population.
locus_table()
(m) - Creates a table of summary statistics per locus.
rrmlg()
(m | x) - Round-robin multilocus genotype estimates.
rraf()
(m) - Round-robin allele frequency estimates.
pgen()
(m) - Probability of genotypes.
psex()
(m) - Probability of observing a genotype more than once.
rare_allele_correction (m) - rules for correcting rare alleles for round-robin estimates.
incomp()
(m) - Check data for incomparable samples.
imsn()
(m | s) - Interactive construction and visualization of minimum spanning networks
plot_poppr_msn()
(m | s | x) - Plots minimum spanning networks produced in poppr with scale bar and legend
greycurve()
(x) - Helper to determine the appropriate parameters for adjusting the grey level for msn functions
bruvo.msn()
(m) - Produces minimum spanning networks based off Bruvo's distance colored by population
poppr.msn()
(m | s | x) - Produces a minimum spanning network for any pairwise distance matrix related to the data
info_table()
(m) - Creates a heatmap representing missing data or observed ploidy
genotype_curve()
(m | x) - Creates a series of boxplots to demonstrate how many markers are needed to represent the diversity of your data.
Aeut()
- (AFLP) Oomycete root rot pathogen Aphanomyces euteiches (Grünwald and Hoheisel, 2006)
monpop()
- (SSR) Peach brown rot pathogen Monilinia fructicola (Everhart and Scherm, 2015)
partial_clone()
- (SSR) partially-clonal data simulated via simuPOP (Peng and Amos, 2008)
Pinf()
- (SSR) Potato late blight pathogen Phytophthora infestans (Goss et. al., 2014)
Pram()
- (SSR) Sudden Oak Death pathogen Phytophthora ramorum (Kamvar et. al., 2015; Goss et. al., 2009)
Zhian N. Kamvar, Jonah C. Brooks, Sydney E. Everhart, Javier F. Tabima, Stacy Krueger-Hadfield, Erik Sotka, Niklaus J. Grünwald
Maintainer: Zhian N. Kamvar
——— Papers announcing poppr ———
Kamvar ZN, Tabima JF, Grünwald NJ. (2014) Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2:e281 doi:10.7717/peerj.281
Kamvar ZN, Brooks JC and Grünwald NJ (2015) Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Front. Genet. 6:208. doi:10.3389/fgene.2015.00208
——— Papers referencing data sets ———
Grünwald, NJ and Hoheisel, G.A. 2006. Hierarchical Analysis of Diversity, Selfing, and Genetic Differentiation in Populations of the Oomycete Aphanomyces euteiches. Phytopathology 96:1134-1141 doi: doi:10.1094/PHYTO-96-1134
SE Everhart, H Scherm, (2015) Fine-scale genetic structure of Monilinia fructicola during brown rot epidemics within individual peach tree canopies. Phytopathology 105:542-549 doi: doi:10.1094/PHYTO-03-14-0088-R
Bo Peng and Christopher Amos (2008) Forward-time simulations of nonrandom mating populations using simuPOP. bioinformatics, 24 (11): 1408-1409.
Goss, Erica M., Javier F. Tabima, David EL Cooke, Silvia Restrepo, William E. Fry, Gregory A. Forbes, Valerie J. Fieland, Martha Cardenas, and Niklaus J. Grünwald. (2014) "The Irish potato famine pathogen Phytophthora infestans originated in central Mexico rather than the Andes." Proceedings of the National Academy of Sciences 111:8791-8796. doi: doi:10.1073/pnas.1401884111
Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015). Spatial and temporal analysis of populations of the sudden oak death pathogen in Oregon forests. Phytopathology 105:982-989. doi: doi:10.1094/PHYTO-12-14-0350-FI
Goss, E. M., Larsen, M., Chastagner, G. A., Givens, D. R., and Grünwald, N. J. 2009. Population genetic analysis infers migration pathways of Phytophthora ramorum in US nurseries. PLoS Pathog. 5:e1000583. doi: doi:10.1371/journal.ppat.1000583
Calculate a dendrogram with bootstrap support using any distance applicable to genind or genclone objects.
aboot( x, strata = NULL, tree = "upgma", distance = "nei.dist", sample = 100, cutoff = 0, showtree = TRUE, missing = "mean", mcutoff = 0, quiet = FALSE, root = NULL, ... )
aboot( x, strata = NULL, tree = "upgma", distance = "nei.dist", sample = 100, cutoff = 0, showtree = TRUE, missing = "mean", mcutoff = 0, quiet = FALSE, root = NULL, ... )
x |
a genind-class, genpop-class, genclone-class, genlight, snpclone or matrix object. |
strata |
a formula specifying the strata to be used to convert x to a
genclone object if x is a genind object. Defaults to |
tree |
a text string or function that can calculate a tree from a distance matrix. Defaults to "upgma". Note that you must load the package with the function for it to work. |
distance |
a character or function defining the distance to be applied
to x. Defaults to |
sample |
An integer representing the number of bootstrap replicates Default is 100. |
cutoff |
An integer from 0 to 100 setting the cutoff value to return the bootstrap values on the nodes. Default is 0. |
showtree |
If |
missing |
any method to be used by |
mcutoff |
a value between 0 (default) and 1 defining the percentage of
tolerable missing data if the |
quiet |
if |
root |
is the tree rooted? This is a parameter passed off to
|
... |
any parameters to be passed off to the distance method. |
This function automates the process of bootstrapping genetic data to
create a dendrogram with bootstrap support on the nodes. It will randomly
sample with replacement the loci of a gen
(genind/genpop) object or the columns of a
numeric matrix, assuming that all loci/columns are independent. The
process of randomly sampling gen
objects with replacement is carried out
through the use of an internal class called
bootgen. This is necessary due to the fact that columns in
the genind matrix are defined as alleles and are thus interrelated. This
function will specifically bootstrap loci so that results are biologically
relevant. With this function, the user can also define a custom distance to
be performed on the genind or genclone object. If you have a data frame-like
object where all of the columns are independent or pairs of columns are
independent, then it may be simpler to use ape::boot.phylo()
to calculate
your bootstrap support values.
There is an argument called strata
. This argument is useful for when
you want to bootstrap by populations from a adegenet::genind()
object. When you specify strata, the genind object will be converted to
adegenet::genpop()
with the specified strata.
an object of class ape::phylo()
.
prevosti.dist()
and diss.dist()
are exactly the
same, but diss.dist()
scales better for large numbers of
individuals (n > 125) at the cost of required memory.
Missing data is not allowed by many of the distances. Thus, one of
the first steps of this function is to treat missing data by setting it to
the average allele frequency in the data set. If you are using a distance
that can handle missing data (Prevosti's distance), you can set
missing = "ignore"
to allow the distance function to handle any
missing data. See missingno()
for details on missing
data.
While calculation of Bruvo's distance
is possible with this function, it is optimized in the function
bruvo.boot()
.
Kamvar ZN, Brooks JC and Grünwald NJ (2015) Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Frontiers in Genetics 6:208. doi:10.3389/fgene.2015.00208
nei.dist()
edwards.dist()
rogers.dist()
reynolds.dist()
prevosti.dist()
diss.dist()
bruvo.boot()
ape::boot.phylo()
adegenet::dist.genpop()
dist()
bootgen2genind()
bootgen
data(nancycats) nan9 <- popsub(nancycats, 9) set.seed(9999) # Generate a tree using nei's distance neinan <- aboot(nan9, dist = nei.dist) set.seed(9999) # Generate a tree using custom distance bindist <- function(x) dist(tab(x), method = "binary") binnan <- aboot(nan9, dist = bindist) ## Not run: # Distances from other packages. # # Sometimes, distance functions from other packages will have the constraint # that the incoming data MUST be genind. Internally, aboot uses the # bootgen class ( class?bootgen ) to shuffle loci, and will throw an error # The function bootgen2genind helps fix that. Here's an example of a function # that expects a genind class from above bindist <- function(x){ stopifnot(is.genind(x)) dist(tab(x), method = "binary") } # # Fails: # aboot(nan9, dist = bindist) ## Error: is.genind(x) is not TRUE # # Add bootgen2genind to get it working! # Works: aboot(nan9, dist = function(x) bootgen2genind(x) %>% bindist) # AFLP data data(Aeut) # Nei's distance anei <- aboot(Aeut, dist = nei.dist, sample = 1000, cutoff = 50) # Rogers' distance arog <- aboot(Aeut, dist = rogers.dist, sample = 1000, cutoff = 50) # This can also be run on genpop objects strata(Aeut) <- other(Aeut)$population_hierarchy[-1] Aeut.gc <- as.genclone(Aeut) setPop(Aeut.gc) <- ~Pop/Subpop Aeut.pop <- genind2genpop(Aeut.gc) set.seed(5000) aboot(Aeut.pop, sample = 1000) # compare to Grunwald et al. 2006 # You can also use the strata argument to convert to genpop inside the function. set.seed(5000) aboot(Aeut.gc, strata = ~Pop/Subpop, sample = 1000) # And genlight objects # From glSim: ## 1,000 non structured SNPs, 100 structured SNPs x <- glSim(100, 1e3, n.snp.struc=100, ploid=2) aboot(x, distance = bitwise.dist) # Utilizing other tree methods library("ape") aboot(Aeut.pop, tree = fastme.bal, sample = 1000) # Utilizing options in other tree methods myFastME <- function(x) fastme.bal(x, nni = TRUE, spr = FALSE, tbr = TRUE) aboot(Aeut.pop, tree = myFastME, sample = 1000) ## End(Not run)
data(nancycats) nan9 <- popsub(nancycats, 9) set.seed(9999) # Generate a tree using nei's distance neinan <- aboot(nan9, dist = nei.dist) set.seed(9999) # Generate a tree using custom distance bindist <- function(x) dist(tab(x), method = "binary") binnan <- aboot(nan9, dist = bindist) ## Not run: # Distances from other packages. # # Sometimes, distance functions from other packages will have the constraint # that the incoming data MUST be genind. Internally, aboot uses the # bootgen class ( class?bootgen ) to shuffle loci, and will throw an error # The function bootgen2genind helps fix that. Here's an example of a function # that expects a genind class from above bindist <- function(x){ stopifnot(is.genind(x)) dist(tab(x), method = "binary") } # # Fails: # aboot(nan9, dist = bindist) ## Error: is.genind(x) is not TRUE # # Add bootgen2genind to get it working! # Works: aboot(nan9, dist = function(x) bootgen2genind(x) %>% bindist) # AFLP data data(Aeut) # Nei's distance anei <- aboot(Aeut, dist = nei.dist, sample = 1000, cutoff = 50) # Rogers' distance arog <- aboot(Aeut, dist = rogers.dist, sample = 1000, cutoff = 50) # This can also be run on genpop objects strata(Aeut) <- other(Aeut)$population_hierarchy[-1] Aeut.gc <- as.genclone(Aeut) setPop(Aeut.gc) <- ~Pop/Subpop Aeut.pop <- genind2genpop(Aeut.gc) set.seed(5000) aboot(Aeut.pop, sample = 1000) # compare to Grunwald et al. 2006 # You can also use the strata argument to convert to genpop inside the function. set.seed(5000) aboot(Aeut.gc, strata = ~Pop/Subpop, sample = 1000) # And genlight objects # From glSim: ## 1,000 non structured SNPs, 100 structured SNPs x <- glSim(100, 1e3, n.snp.struc=100, ploid=2) aboot(x, distance = bitwise.dist) # Utilizing other tree methods library("ape") aboot(Aeut.pop, tree = fastme.bal, sample = 1000) # Utilizing options in other tree methods myFastME <- function(x) fastme.bal(x, nni = TRUE, spr = FALSE, tbr = TRUE) aboot(Aeut.pop, tree = myFastME, sample = 1000) ## End(Not run)
The Aeut dataset consists of 187 isolates of the Oomycete root rot pathogen, *Aphanomyces euteiches* collected from two different fields in NW Oregon and W Washington, USA.
data(Aeut)
data(Aeut)
a [genind()] object with two populations containing a data frame in the 'other' slot called 'population_hierarchy'. This data frame gives indices of the populations and subpopulations for the data set.
Grunwald, NJ and Hoheisel, G.A. 2006. Hierarchical Analysis of Diversity, Selfing, and Genetic Differentiation in Populations of the Oomycete *Aphanomyces euteiches*. Phytopathology 96:1134-1141 doi: doi:10.1094/PHYTO-96-1134
Wrapper for snpclone initializer.
as.snpclone(x, ..., parallel = FALSE, n.cores = NULL, mlg, mlgclass = TRUE)
as.snpclone(x, ..., parallel = FALSE, n.cores = NULL, mlg, mlgclass = TRUE)
x |
|
... |
arguments to be passed on to the genlight constructor. These are not used if x is not missing. |
parallel |
should the parallel package be used to construct the object? |
n.cores |
how many cores should be utilized? See documentation for
|
mlg |
a vector of multilocus genotypes or an object of class MLG for the new snpclone object. |
mlgclass |
if |
Zhian N. Kamvar
(x <- as.snpclone(glSim(100, 1e3, ploid=2))) ## Not run: # Without parallel processing system.time(x <- as.snpclone(glSim(1000, 1e5, ploid=2))) # With parallel processing... doesn't really save you much time. system.time(x <- as.snpclone(glSim(1000, 1e5, ploid=2, parallel = TRUE), parallel = TRUE)) ## End(Not run)
(x <- as.snpclone(glSim(100, 1e3, ploid=2))) ## Not run: # Without parallel processing system.time(x <- as.snpclone(glSim(1000, 1e5, ploid=2))) # With parallel processing... doesn't really save you much time. system.time(x <- as.snpclone(glSim(1000, 1e5, ploid=2, parallel = TRUE), parallel = TRUE)) ## End(Not run)
This function calculates both dissimilarity and Euclidean distances for genlight or snpclone objects.
bitwise.dist( x, percent = TRUE, mat = FALSE, missing_match = TRUE, scale_missing = FALSE, euclidean = FALSE, differences_only = FALSE, threads = 0L )
bitwise.dist( x, percent = TRUE, mat = FALSE, missing_match = TRUE, scale_missing = FALSE, euclidean = FALSE, differences_only = FALSE, threads = 0L )
x |
|
percent |
|
mat |
|
missing_match |
|
scale_missing |
A logical. If |
euclidean |
|
differences_only |
|
threads |
The maximum number of parallel threads to be used within this function. A value of 0 (default) will attempt to use as many threads as there are available cores/CPUs. In most cases this is ideal. A value of 1 will force the function to run serially, which may increase stability on some systems. Other values may be specified, but should be used with caution. |
The default distance calculated here is quite simple and goes by many names depending on its application. The most familiar name might be the Hamming distance, or the number of differences between two strings.
As of poppr version 2.8.0, this function now also calculates Euclidean
distance and is considerably faster and more memory-efficient than the
standard dist()
function.
A dist object containing pairwise distances between samples.
This function is optimized for genlight and snpclone objects. This does not mean that it is a catch-all optimization for SNP data. Three assumptions must be met for this function to work:
SNPs are bi-allelic
Samples are haploid or diploid
All samples have the same ploidy
If the user supplies a genind or
genclone object, prevosti.dist()
will be used for
calculation.
Zhian N. Kamvar, Jonah C. Brooks
diss.dist()
, snpclone,
genlight, win.ia()
, samp.ia()
set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) x # Assess fraction of different alleles system.time(xd <- bitwise.dist(x, threads = 1L)) xd # Calculate Euclidean distance system.time(xdt <- bitwise.dist(x, euclidean = TRUE, scale_missing = TRUE, threads = 1L)) xdt ## Not run: # This function is more efficient in both memory and speed than [dist()] for # calculating Euclidean distance on genlight objects. For example, we can # observe a clear speed increase when we attempt a calculation on 100k SNPs # with 10% missing data: set.seed(999) mat <- matrix(sample(c(0:2, NA), 100000 * 50, replace = TRUE, prob = c(0.3, 0.3, 0.3, 0.1)), nrow = 50) glite <- new("genlight", mat, ploidy = 2) # Default Euclidean distance system.time(dist(glite)) # Bitwise dist system.time(bitwise.dist(glite, euclidean = TRUE, scale_missing = TRUE)) ## End(Not run)
set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) x # Assess fraction of different alleles system.time(xd <- bitwise.dist(x, threads = 1L)) xd # Calculate Euclidean distance system.time(xdt <- bitwise.dist(x, euclidean = TRUE, scale_missing = TRUE, threads = 1L)) xdt ## Not run: # This function is more efficient in both memory and speed than [dist()] for # calculating Euclidean distance on genlight objects. For example, we can # observe a clear speed increase when we attempt a calculation on 100k SNPs # with 10% missing data: set.seed(999) mat <- matrix(sample(c(0:2, NA), 100000 * 50, replace = TRUE, prob = c(0.3, 0.3, 0.3, 0.1)), nrow = 50) glite <- new("genlight", mat, ploidy = 2) # Default Euclidean distance system.time(dist(glite)) # Bitwise dist system.time(bitwise.dist(glite, euclidean = TRUE, scale_missing = TRUE)) ## End(Not run)
This function will perform the index of association on a bootstrapped data set multiple times to create a distribution, showing the variation of the index due to repeat observations.
boot.ia(gid, how = "partial", reps = 999, quiet = FALSE, ...)
boot.ia(gid, how = "partial", reps = 999, quiet = FALSE, ...)
gid |
a genind or genclone object |
how |
method of bootstrap. The default |
reps |
an integer specifying the number of replicates to perform. Defaults to 999. |
quiet |
a logical. If |
... |
options passed on to |
a data frame with the index of association and standardized index of association in columns. Number of rows represents the number of reps.
This function is experimental. Please do not use this unless you know what you are doing.
data(Pinf) boot.ia(Pinf, reps = 99)
data(Pinf) boot.ia(Pinf, reps = 99)
as.genclone will create a genclone object from a genind object OR anything that can be passed to the genind initializer.
bootgen2genind(bg) as.genclone(x, ..., mlg, mlgclass = TRUE) genclone2genind(x) as.genambig(x)
bootgen2genind(bg) as.genclone(x, ..., mlg, mlgclass = TRUE) genclone2genind(x) as.genambig(x)
bg |
a bootgen object |
x |
|
... |
arguments passed on to the |
mlg |
an optional vector of multilocus genotypes as integers |
mlgclass |
should the mlg slot be of class MLG? |
genclone2genind will remove the mlg slot from the genclone object, creating a genind object.
as.genambig will convert a genind or genclone object to a polysat genambig class.
Zhian N. Kamvar
splitStrata
, genclone
,
read.genalex
aboot
data(Aeut) Aeut # Conversion to genclone -------------------------------------------------- Aeut.gc <- as.genclone(Aeut) Aeut.gc # Conversion to genind ---------------------------------------------------- Aeut.gi <- genclone2genind(Aeut.gc) Aeut.gi # Conversion to polysat's "genambig" class -------------------------------- if (require("polysat")) { data(Pinf) Pinf.gb <- as.genambig(Pinf) summary(Pinf.gb) } data(nancycats) # Conversion to bootgen for random sampling of loci ----------------------- nan.bg <- new("bootgen", nancycats[pop = 9]) nan.bg # Conversion back to genind ----------------------------------------------- nan.gid <- bootgen2genind(nan.bg) nan.gid
data(Aeut) Aeut # Conversion to genclone -------------------------------------------------- Aeut.gc <- as.genclone(Aeut) Aeut.gc # Conversion to genind ---------------------------------------------------- Aeut.gi <- genclone2genind(Aeut.gc) Aeut.gi # Conversion to polysat's "genambig" class -------------------------------- if (require("polysat")) { data(Pinf) Pinf.gb <- as.genambig(Pinf) summary(Pinf.gb) } data(nancycats) # Conversion to bootgen for random sampling of loci ----------------------- nan.bg <- new("bootgen", nancycats[pop = 9]) nan.bg # Conversion back to genind ----------------------------------------------- nan.gid <- bootgen2genind(nan.bg) nan.gid
Create a tree using Bruvo's Distance with non-parametric bootstrapping.
bruvo.boot( pop, replen = 1, add = TRUE, loss = TRUE, sample = 100, tree = "upgma", showtree = TRUE, cutoff = NULL, quiet = FALSE, root = NULL, ... )
bruvo.boot( pop, replen = 1, add = TRUE, loss = TRUE, sample = 100, tree = "upgma", showtree = TRUE, cutoff = NULL, quiet = FALSE, root = NULL, ... )
pop |
|
replen |
a |
add |
if |
loss |
if |
sample |
an |
tree |
any function that can generate a tree from a distance matrix.
Default is |
showtree |
|
cutoff |
|
quiet |
|
root |
|
... |
any argument to be passed on to |
This function will calculate a tree based off of Bruvo's distance
and then utilize boot.phylo
to randomly sample loci with
replacement, recalculate the tree, and tally up the bootstrap support
(measured in percent success). While this function can take any tree
function, it has native support for two algorithms: nj
and upgma
. If you want to use any other functions,
you must load the package before you use them (see examples).
a tree of class phylo with nodelables
Please refer to the documentation for bruvo.dist for details on
the algorithm. If the user does not provide a vector of appropriate length
for replen
, it will be estimated by taking the minimum difference
among represented alleles at each locus. IT IS NOT RECOMMENDED TO RELY ON
THIS ESTIMATION.
Zhian N. Kamvar, Javier F. Tabima
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
bruvo.dist
, nancycats
,
upgma
, nj
, boot.phylo
,
nodelabels
, tab
,
missingno
.
# Please note that the data presented is assuming that the nancycat dataset # contains all dinucleotide repeats, it most likely is not an accurate # representation of the data. # Load the nancycats dataset and construct the repeat vector. data(nancycats) ssr <- rep(2, 9) # Analyze the 1st population in nancycats bruvo.boot(popsub(nancycats, 1), replen = ssr) ## Not run: # Always load the library before you specify the function. library("ape") # Estimate the tree based off of the BIONJ algorithm. bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = bionj) # Utilizing balanced FastME bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = fastme.bal) # To change parameters for the tree, wrap it in a function. # For example, let's build the tree without utilizing subtree-prune-regraft myFastME <- function(x) fastme.bal(x, nni = TRUE, spr = FALSE, tbr = TRUE) bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = myFastME) ## End(Not run)
# Please note that the data presented is assuming that the nancycat dataset # contains all dinucleotide repeats, it most likely is not an accurate # representation of the data. # Load the nancycats dataset and construct the repeat vector. data(nancycats) ssr <- rep(2, 9) # Analyze the 1st population in nancycats bruvo.boot(popsub(nancycats, 1), replen = ssr) ## Not run: # Always load the library before you specify the function. library("ape") # Estimate the tree based off of the BIONJ algorithm. bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = bionj) # Utilizing balanced FastME bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = fastme.bal) # To change parameters for the tree, wrap it in a function. # For example, let's build the tree without utilizing subtree-prune-regraft myFastME <- function(x) fastme.bal(x, nni = TRUE, spr = FALSE, tbr = TRUE) bruvo.boot(popsub(nancycats, 9), replen = ssr, tree = myFastME) ## End(Not run)
Calculate the average Bruvo's distance over all loci in a population.
bruvo.dist(pop, replen = 1, add = TRUE, loss = TRUE, by_locus = FALSE) bruvo.between( query, ref, replen = 1, add = TRUE, loss = TRUE, by_locus = FALSE )
bruvo.dist(pop, replen = 1, add = TRUE, loss = TRUE, by_locus = FALSE) bruvo.between( query, ref, replen = 1, add = TRUE, loss = TRUE, by_locus = FALSE )
pop |
|
replen |
a |
add |
if |
loss |
if |
by_locus |
indicator to get the results per locus. The default setting
is |
query |
|
ref |
Bruvo's distance between two alleles is calculated as
, where x
is the number of repeat units between the two alleles (see the Algorithms
and Equations vignette for more details). These distances are calculated
over all combinations of alleles at a locus and then the minimum average
distance between allele combinations is taken as the distance for that
locus. All loci are then averaged over to obtain the distance between two
samples. Missing data is ignored (in the same fashion as
mean(c(1:9, NA), na.rm = TRUE)
) if all alleles are missing. See the
next section for other cases.
Ploidy is irrelevant with respect to calculation of Bruvo's distance. However, since it makes a comparison between all alleles at a locus, it only makes sense that the two loci need to have the same ploidy level. Unfortunately for polyploids, it's often difficult to fully separate distinct alleles at each locus, so you end up with genotypes that appear to have a lower ploidy level than the organism.
To help deal with these situations, Bruvo has suggested three methods for dealing with these differences in ploidy levels:
Infinite Model - The simplest way to deal with it is to count all missing alleles as infinitely large so that the distance between it and anything else is 1. Aside from this being computationally simple, it will tend to inflate distances between individuals.
Genome Addition Model - If it is suspected that the organism has gone through a recent genome expansion, the missing alleles will be replace with all possible combinations of the observed alleles in the shorter genotype. For example, if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], [69, 70, 70, 69], and [69, 70, 70, 70]. The resulting distances are then averaged over the number of comparisons.
Genome Loss Model - This is similar to the genome addition model, except that it assumes that there was a recent genome reduction event and uses the observed values in the full genotype to fill the missing values in the short genotype. As with the Genome Addition Model, the resulting distances are averaged over the number of comparisons.
Combination Model - Combine and average the genome addition and loss models.
As mentioned above, the infinite model is biased, but it is not nearly as computationally intensive as either of the other models. The reason for this is that both of the addition and loss models requires replacement of alleles and recalculation of Bruvo's distance. The number of replacements required is equal to n^k where where n is the number of potential replacements and k is the number of alleles to be replaced. To reduce the number of calculations and assumptions otherwise, Bruvo's distance will be calculated using the largest observed ploidy in pairwise comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids.
an object of class dist
or a list of these objects if
by_locus = TRUE
bruvo.between()
: Bruvo's distance between a query and a reference
Only diferences between query individuals and reference individuals will be reported
All other values are NaN
Do not use missingno with this function.
In earlier versions of poppr, the authors had assumed that, because the calculation of Bruvo's distance does not rely on orderd sets of alleles, the imputation methods in the genome addition and genome loss models would also assume unordered alleles for creating the hypothetical genotypes. This means that the results from this imputation did not consider all possible combinations of alleles, resulting in either an over- or under- estimation of Bruvo's distance between two samples with two or more missing alleles. This version of poppr considers all possible combinations when calculating Bruvo's distance for incomplete genotype with a negligable gain in computation time.
If you want to see the effect of this change on your data, you can use the
global poppr option old.bruvo.model
. Currently, this option is
FALSE
and you can set it by using
options(old.bruvo.model = TRUE)
, but make sure to reset it to
FALSE
afterwards.
The replen
argument is crucial for proper analysis of Bruvo's
distance since the calculation relies on the knowledge of the number of
steps between alleles. To calculate Bruvo's distance, your raw allele calls
are first divided by the repeat lengths and then rounded. This can create a
problem with repeat lengths of even size due to the IEC 60559 standard that
says rounding at 0.5 is to the nearest even number, meaning that it is
possible for two alleles that are one step apart may appear to be exactly
the same. This can be fixed by subtracting a tiny number from the repeat
length with the function fix_replen
. Please consider using
this before running Bruvo's distance.
The add
and loss
arguments
modify the model choice accordingly:
Infinite
Model: add = FALSE, loss = FALSE
Genome Addition
Model: add = TRUE, loss = FALSE
Genome Loss Model:
add = FALSE, loss = TRUE
Combination Model
(DEFAULT): add = TRUE, loss = TRUE
Details of each model
choice are described in the Details section, above. Additionally,
genotypes containing all missing values at a locus will return a value of
NA
and not contribute to the average across loci.
If the user does not provide a vector of
appropriate length for replen
, it will be estimated by taking the
minimum difference among represented alleles at each locus. IT IS NOT
RECOMMENDED TO RELY ON THIS ESTIMATION.
Zhian N. Kamvar
David Folarin
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
fix_replen
, test_replen
,
bruvo.boot
, bruvo.msn
# Please note that the data presented is assuming that the nancycat dataset # contains all dinucleotide repeats, it most likely is not an accurate # representation of the data. # Load the nancycats dataset and construct the repeat vector. data(nancycats) names(alleles(nancycats)) <- locNames(nancycats) # small bug in this data set # Assume the alleles are all dinucleotide repeats. ssr <- rep(2, nLoc(nancycats)) test_replen(nancycats, ssr) # Are the repeat lengths consistent? (ssr <- fix_replen(nancycats, ssr)) # Nope. We need to fix them. # Analyze the first population in nancycats bruvo.dist(popsub(nancycats, 1), replen = ssr) ## Not run: # get the per locus estimates: bruvo.dist(popsub(nancycats, 1), replen = ssr, by_locus = TRUE) # View each population as a heatmap. sapply(popNames(nancycats), function(x) heatmap(as.matrix(bruvo.dist(popsub(nancycats, x), replen = ssr)), symm=TRUE)) ## End(Not run)
# Please note that the data presented is assuming that the nancycat dataset # contains all dinucleotide repeats, it most likely is not an accurate # representation of the data. # Load the nancycats dataset and construct the repeat vector. data(nancycats) names(alleles(nancycats)) <- locNames(nancycats) # small bug in this data set # Assume the alleles are all dinucleotide repeats. ssr <- rep(2, nLoc(nancycats)) test_replen(nancycats, ssr) # Are the repeat lengths consistent? (ssr <- fix_replen(nancycats, ssr)) # Nope. We need to fix them. # Analyze the first population in nancycats bruvo.dist(popsub(nancycats, 1), replen = ssr) ## Not run: # get the per locus estimates: bruvo.dist(popsub(nancycats, 1), replen = ssr, by_locus = TRUE) # View each population as a heatmap. sapply(popNames(nancycats), function(x) heatmap(as.matrix(bruvo.dist(popsub(nancycats, x), replen = ssr)), symm=TRUE)) ## End(Not run)
Create minimum spanning network of selected populations using Bruvo's distance.
bruvo.msn( gid, replen = 1, add = TRUE, loss = TRUE, mlg.compute = "original", palette = topo.colors, sublist = "All", exclude = NULL, blacklist = NULL, vertex.label = "MLG", gscale = TRUE, glim = c(0, 0.8), gadj = 3, gweight = 1, wscale = TRUE, showplot = TRUE, include.ties = FALSE, threshold = NULL, clustering.algorithm = NULL, ... )
bruvo.msn( gid, replen = 1, add = TRUE, loss = TRUE, mlg.compute = "original", palette = topo.colors, sublist = "All", exclude = NULL, blacklist = NULL, vertex.label = "MLG", gscale = TRUE, glim = c(0, 0.8), gadj = 3, gweight = 1, wscale = TRUE, showplot = TRUE, include.ties = FALSE, threshold = NULL, clustering.algorithm = NULL, ... )
gid |
|
replen |
a |
add |
if |
loss |
if |
mlg.compute |
if the multilocus genotypes are set to "custom" (see
|
palette |
a |
sublist |
a |
exclude |
a |
blacklist |
DEPRECATED, use exclude. |
vertex.label |
a |
gscale |
"grey scale". If this is |
glim |
"grey limit". Two numbers between zero and one. They determine
the upper and lower limits for the |
gadj |
"grey adjust". a positive |
gweight |
"grey weight". an |
wscale |
"width scale". If this is |
showplot |
logical. If |
include.ties |
logical. If |
threshold |
numeric. By default, this is |
clustering.algorithm |
string. By default, this is |
... |
any other arguments that could go into plot.igraph |
The minimum spanning network generated by this function is generated
via igraph's minimum.spanning.tree
. The resultant
graph produced can be plotted using igraph functions, or the entire object
can be plotted using the function plot_poppr_msn
, which will
give the user a scale bar and the option to layout your data.
The area of the nodes are representative of the number of samples. Because igraph scales nodes by radius, the node sizes in the graph are represented as the square root of the number of samples.
Each node on the graph represents a different multilocus genotype.
The edges on the graph represent genetic distances that connect the
multilocus genotypes. In genclone objects, it is possible to set the
multilocus genotypes to a custom definition. This creates a problem for
clone correction, however, as it is very possible to define custom lineages
that are not monophyletic. When clone correction is performed on these
definitions, information is lost from the graph. To circumvent this, The
clone correction will be done via the computed multilocus genotypes, either
"original" or "contracted". This is specified in the mlg.compute
argument, above.
If your incoming data set is of the class genclone
,
and it contains contracted multilocus genotypes, this function will retain
that information for creating the minimum spanning network. You can use the
arguments threshold
and clustering.algorithm
to change the
threshold or clustering algorithm used in the network. For example, if you
have a data set that has a threshold of 0.1 and you wish to have a minimum
spanning network without a threshold, you can simply add
threshold = 0.0
, and no clustering will happen.
The threshold
and clustering.algorithm
arguments can also be
used to filter un-contracted data sets.
graph |
a minimum spanning network with nodes corresponding to MLGs within the data set. Colors of the nodes represent population membership. Width and color of the edges represent distance. |
populations |
a vector of the population names corresponding to the vertex colors |
colors |
a vector of the hexadecimal representations of the colors used in the vertex colors |
Please see the documentation for
bruvo.dist
for details on the algorithm.
The edges of these graphs may cross each other if the graph becomes too large.
The nodes in the graph represent multilocus genotypes. The colors of the nodes are representative of population membership. It is not uncommon to see different populations containing the same multilocus genotype.
Zhian N. Kamvar, Javier F. Tabima
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
bruvo.dist
, nancycats
,
plot_poppr_msn
, mst
bruvo.boot
, greycurve
poppr.msn
# Load the data set. data(nancycats) # View populations 8 and 9 with default colors. bruvo.msn(nancycats, replen = rep(2, 9), sublist=8:9, vertex.label="inds", vertex.label.cex=0.7, vertex.label.dist=0.4) ## Not run: # View heat colors. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette=heat.colors, vertex.label.cex=0.7, vertex.label.dist=0.4) # View custom colors. Here, we use black and orange. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4) # View with darker shades of grey (setting the upper limit to 1/2 black 1/2 white). bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, glim=c(0, 0.5)) # View with no grey scaling. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, gscale=FALSE) # View with no line widths. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, wscale=FALSE) # View with no scaling at all. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, gscale=FALSE) # View the whole population, but without labels. bruvo.msn(nancycats, replen=rep(2, 9), vertex.label=NA) ## End(Not run)
# Load the data set. data(nancycats) # View populations 8 and 9 with default colors. bruvo.msn(nancycats, replen = rep(2, 9), sublist=8:9, vertex.label="inds", vertex.label.cex=0.7, vertex.label.dist=0.4) ## Not run: # View heat colors. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette=heat.colors, vertex.label.cex=0.7, vertex.label.dist=0.4) # View custom colors. Here, we use black and orange. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4) # View with darker shades of grey (setting the upper limit to 1/2 black 1/2 white). bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, glim=c(0, 0.5)) # View with no grey scaling. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, gscale=FALSE) # View with no line widths. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, wscale=FALSE) # View with no scaling at all. bruvo.msn(nancycats, replen=rep(2, 9), sublist=8:9, vertex.label="inds", palette = colorRampPalette(c("orange", "black")), vertex.label.cex=0.7, vertex.label.dist=0.4, gscale=FALSE) # View the whole population, but without labels. bruvo.msn(nancycats, replen=rep(2, 9), vertex.label=NA) ## End(Not run)
This function removes any duplicated multilocus genotypes from any specified population strata.
clonecorrect(pop, strata = 1, combine = FALSE, keep = 1)
clonecorrect(pop, strata = 1, combine = FALSE, keep = 1)
pop |
|
strata |
a hierarchical formula or numeric vector. This will define the columns of the data frame in the strata slot to use. |
combine |
|
keep |
|
This function will clone correct based on the stratification
provided. To clone correct indiscriminately of population structure, set
strata = NA
.
a clone corrected genclone
,
snpclone
, or genind
object.
Zhian N. Kamvar
# LOAD A. euteiches data set data(Aeut) # Redefine it as a genclone object Aeut <- as.genclone(Aeut) strata(Aeut) <- other(Aeut)$population_hierarchy[-1] # Check the number of multilocus genotypes mlg(Aeut) popNames(Aeut) # Clone correct at the population level. Aeut.pop <- clonecorrect(Aeut, strata = ~Pop) mlg(Aeut.pop) popNames(Aeut.pop) ## Not run: # Clone correct at the subpopulation level with respect to population and # combine. Aeut.subpop <- clonecorrect(Aeut, strata = ~Pop/Subpop, combine=TRUE) mlg(Aeut.subpop) popNames(Aeut.subpop) # Do the same, but set to the population level. Aeut.subpop2 <- clonecorrect(Aeut, strata = ~Pop/Subpop, keep=1) mlg(Aeut.subpop2) popNames(Aeut.subpop2) # LOAD H3N2 dataset data(H3N2) strata(H3N2) <- other(H3N2)$x # Extract only the individuals located in China country <- clonecorrect(H3N2, strata = ~country) # How many isolates did we have from China before clone correction? sum(strata(H3N2, ~country) == "China") # 155 # How many unique isolates from China after clone correction? sum(strata(country, ~country) == "China") # 79 # Something a little more complicated. (This could take a few minutes on # slower computers) # setting the hierarchy to be Country > Year > Month c.y.m <- clonecorrect(H3N2, strata = ~year/month/country) # How many isolates in the original data set? nInd(H3N2) # 1903 # How many after we clone corrected for country, year, and month? nInd(c.y.m) # 1190 ## End(Not run)
# LOAD A. euteiches data set data(Aeut) # Redefine it as a genclone object Aeut <- as.genclone(Aeut) strata(Aeut) <- other(Aeut)$population_hierarchy[-1] # Check the number of multilocus genotypes mlg(Aeut) popNames(Aeut) # Clone correct at the population level. Aeut.pop <- clonecorrect(Aeut, strata = ~Pop) mlg(Aeut.pop) popNames(Aeut.pop) ## Not run: # Clone correct at the subpopulation level with respect to population and # combine. Aeut.subpop <- clonecorrect(Aeut, strata = ~Pop/Subpop, combine=TRUE) mlg(Aeut.subpop) popNames(Aeut.subpop) # Do the same, but set to the population level. Aeut.subpop2 <- clonecorrect(Aeut, strata = ~Pop/Subpop, keep=1) mlg(Aeut.subpop2) popNames(Aeut.subpop2) # LOAD H3N2 dataset data(H3N2) strata(H3N2) <- other(H3N2)$x # Extract only the individuals located in China country <- clonecorrect(H3N2, strata = ~country) # How many isolates did we have from China before clone correction? sum(strata(H3N2, ~country) == "China") # 155 # How many unique isolates from China after clone correction? sum(strata(country, ~country) == "China") # 79 # Something a little more complicated. (This could take a few minutes on # slower computers) # setting the hierarchy to be Country > Year > Month c.y.m <- clonecorrect(H3N2, strata = ~year/month/country) # How many isolates in the original data set? nInd(H3N2) # 1903 # How many after we clone corrected for country, year, and month? nInd(c.y.m) # 1190 ## End(Not run)
Given a series of thresholds for a data set that collapse it into one giant cluster, this will search the top fraction of threshold differences to find the largest difference. The average between the thresholds spanning that difference is the cutoff threshold defining the clonal lineage threshold.
cutoff_predictor(thresholds, fraction = 0.5)
cutoff_predictor(thresholds, fraction = 0.5)
thresholds |
a vector of numerics coming from mlg.filter where the threshold has been set to the maximum threshold theoretically possible. |
fraction |
the fraction of the data to seek the threshold. |
a numeric value representing the threshold at which multilocus lineages should be defined.
This function originally appeared in doi:10.5281/zenodo.17424. This is a bit of a blunt instrument.
Zhian N. Kamvar
ZN Kamvar, JC Brooks, and NJ Grünwald. 2015. Supplementary Material for Frontiers Plant Genetics and Genomics 'Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality'. DOI: doi:10.5281/zenodo.17424
Kamvar ZN, Brooks JC and Grünwald NJ (2015) Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Front. Genet. 6:208. doi: doi:10.3389/fgene.2015.00208
data(Pinf) pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)) pthresh <- filter_stats(Pinf, distance = bruvo.dist, replen = pinfreps, plot = TRUE, stats = "THRESHOLD", threads = 1L) # prediction for farthest neighbor cutoff_predictor(pthresh$farthest) # prediction for all algorithms sapply(pthresh, cutoff_predictor)
data(Pinf) pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)) pthresh <- filter_stats(Pinf, distance = bruvo.dist, replen = pinfreps, plot = TRUE, stats = "THRESHOLD", threads = 1L) # prediction for farthest neighbor cutoff_predictor(pthresh$farthest) # prediction for all algorithms sapply(pthresh, cutoff_predictor)
diss.dist uses the same discrete dissimilarity matrix utilized by the index
of association (see ia
for details). By default, it returns a
distance reflecting the number of allelic differences between two
individuals. When percent = TRUE
, it returns a ratio of the number of
observed differences by the number of possible differences. Eg. two
individuals who share half of the same alleles will have a distance of 0.5.
This function can analyze distances for any marker system.
diss.dist(x, percent = FALSE, mat = FALSE)
diss.dist(x, percent = FALSE, mat = FALSE)
x |
a |
percent |
|
mat |
|
The distance calculated here is quite simple and goes by many names, depending on its application. The most familiar name might be the Hamming distance, or the number of differences between two strings.
Pairwise distances between individuals present in the genind object.
When percent = TRUE
, this is exactly the same as
provesti.dist
, except that it performs better for large
numbers of individuals (n > 125) at the cost of available memory.
Zhian N. Kamvar
prevosti.dist
,
bitwise.dist
(for SNP data)
# A simple example. Let's analyze the mean distance among populations of A. # euteiches. data(Aeut) mean(diss.dist(popsub(Aeut, 1))) ## Not run: mean(diss.dist(popsub(Aeut, 2))) mean(diss.dist(Aeut)) ## End(Not run)
# A simple example. Let's analyze the mean distance among populations of A. # euteiches. data(Aeut) mean(diss.dist(popsub(Aeut, 1))) ## Not run: mean(diss.dist(popsub(Aeut, 2))) mean(diss.dist(Aeut)) ## End(Not run)
This function is intended to perform bootstrap statistics on a matrix of multilocus genotype counts in different populations. Results from this function should be interpreted carefully as the default statistics are known to have a downward bias. See the details for more information.
diversity_boot( tab, n, n.boot = 1L, n.rare = NULL, H = TRUE, G = TRUE, lambda = TRUE, E5 = TRUE, ... )
diversity_boot( tab, n, n.boot = 1L, n.rare = NULL, H = TRUE, G = TRUE, lambda = TRUE, E5 = TRUE, ... )
tab |
a table produced from the poppr function
|
n |
an integer > 0 specifying the number of bootstrap replicates to
perform (corresponds to |
n.boot |
an integer specifying the number of samples to be drawn in each
bootstrap replicate. If |
n.rare |
a sample size at which all resamplings should be performed.
This should be no larger than the smallest sample size. Defaults to
|
H |
logical whether or not to calculate Shannon's index |
G |
logical whether or not to calculate Stoddart and Taylor's index (aka inverse Simpson's index). |
lambda |
logical whether or not to calculate Simpson's index |
E5 |
logical whether or not to calculate Evenness |
... |
other parameters passed on to |
Bootstrapping is performed in three ways:
if n.rare
is a number greater than zero, then bootstrapping
is performed by randomly sampling without replacement n.rare
samples from the data.
\item if `n.boot` is greater than 1, bootstrapping is performed by sampling n.boot samples from a multinomial distribution weighted by the proportion of each MLG in the data. \item if `n.boot` is less than 2, bootstrapping is performed by sampling N samples from a multinomial distribution weighted by the proportion of each MLG in the data.
When sampling with replacement, the diversity statistics here present a
downward bias partially due to the small number of samples in the data.
The result is that the mean of the bootstrapped samples will often be
much lower than the observed value. Alternatively, you can increase the
sample size of the bootstrap by increasing the size of n.boot
. Both
of these methods should be taken with caution in interpretation. There
are several R packages freely available that will calculate and perform
bootstrap estimates of Shannon and Simpson diversity metrics (eg.
entropart, entropy, simboot, and
EntropyEstimation. These packages also offer unbiased estimators of
Shannon and Simpson diversity. Please take care when attempting to
interpret the results of this function.
a list of objects of class "boot".
Zhian N. Kamvar
diversity_stats()
for basic statistic calculation,
diversity_ci()
for confidence intervals and plotting, and
poppr()
. For bootstrap sampling:
stats::rmultinom()
boot::boot()
library(poppr) data(Pinf) tab <- mlg.table(Pinf, plot = FALSE) diversity_boot(tab, 10L) ## Not run: # This can be done in a parallel fashion (OSX uses "multicore", Windows uses "snow") system.time(diversity_boot(tab, 10000L, parallel = "multicore", ncpus = 4L)) system.time(diversity_boot(tab, 10000L)) ## End(Not run)
library(poppr) data(Pinf) tab <- mlg.table(Pinf, plot = FALSE) diversity_boot(tab, 10L) ## Not run: # This can be done in a parallel fashion (OSX uses "multicore", Windows uses "snow") system.time(diversity_boot(tab, 10000L, parallel = "multicore", ncpus = 4L)) system.time(diversity_boot(tab, 10000L)) ## End(Not run)
This function is for calculating bootstrap statistics and their confidence intervals. It is important to note that the calculation of confidence intervals is not perfect (See Details). Please be cautious when interpreting the results.
diversity_ci( tab, n = 1000, n.boot = 1L, ci = 95, total = TRUE, rarefy = FALSE, n.rare = 10, plot = TRUE, raw = TRUE, center = TRUE, ... )
diversity_ci( tab, n = 1000, n.boot = 1L, ci = 95, total = TRUE, rarefy = FALSE, n.rare = 10, plot = TRUE, raw = TRUE, center = TRUE, ... )
tab |
a |
n |
an integer defining the number of bootstrap replicates (defaults to 1000). |
n.boot |
an integer specifying the number of samples to be drawn in each
bootstrap replicate. If |
ci |
the percent for confidence interval. |
total |
argument to be passed on to |
rarefy |
if |
n.rare |
an integer specifying the smallest size at which to resample
data. This is only used if |
plot |
If |
raw |
if |
center |
if |
... |
parameters to be passed on to |
For details on the bootstrapping procedures, see
diversity_boot()
. Default bootstrapping is performed by
sampling N samples from a multinomial distribution weighted by the
relative multilocus genotype abundance per population where N is
equal to the number of samples in the data set. If n.boot > 2,
then n.boot samples are taken at each bootstrap replicate. When
rarefy = TRUE
, then samples are taken at the smallest population
size without replacement. This will provide confidence intervals for all
but the smallest population.
Confidence intervals are derived from the function
boot::norm.ci()
. This function will attempt to correct for bias
between the observed value and the bootstrapped estimate. When center = TRUE
(default), the confidence interval is calculated from the
bootstrapped distribution and centered around the bias-corrected estimate
as prescribed in Marcon (2012). This method can lead to undesirable
properties, such as the confidence interval lying outside of the maximum
possible value. For rarefaction, the confidence interval is simply
determined by calculating the percentiles from the bootstrapped
distribution. If you want to calculate your own confidence intervals, you
can use the results of the permutations stored in the $boot
element
of the output.
Rarefaction in the sense of this function is simply sampling a subset of the data at size n.rare. The estimates derived from this method have straightforward interpretations and allow you to compare diversity across populations since you are controlling for sample size.
Results are plotted as boxplots with point estimates. If there is no rarefaction applied, confidence intervals are displayed around the point estimates. The boxplots represent the actual values from the bootstrapping and will often appear below the estimates and confidence intervals.
obs a matrix with observed statistics in columns, populations in rows
est a matrix with estimated statistics in columns, populations in rows
CI an array of 3 dimensions giving the lower and upper bound, the index measured, and the population.
boot a list containing the output of
boot::boot()
for each population.
a data frame with the statistic observations, estimates, and confidence intervals in columns, and populations in rows. Note that the confidence intervals are converted to characters and rounded to three decimal places.
Almost all of the statistics supplied here have a maximum when all genotypes are equally represented. This means that bootstrapping the samples will always be downwardly biased. In many cases, the confidence intervals from the bootstrapped distribution will fall outside of the observed statistic. The reported confidence intervals here are reported by assuming the variance of the bootstrapped distribution is the same as the variance around the observed statistic. As different statistics have different properties, there will not always be one clear method for calculating confidence intervals. A suggestion for correction in Shannon's index is to center the CI around the observed statistic (Marcon, 2012), but there are theoretical limitations to this. For details, see https://stats.stackexchange.com/q/156235/49413.
While it is possible to use custom functions with this, there are three important things to remember when using these functions:
1. The function must return a single value. 2. The function must allow for both matrix and vector inputs 3. The function name cannot match or partially match any arguments from [boot::boot()]
Anonymous functions are okay
(e.g. function(x) vegan::rarefy(t(as.matrix(x)), 10)
).
Zhian N. Kamvar
Marcon, E., Herault, B., Baraloto, C. and Lang, G. (2012). The Decomposition of Shannon’s Entropy and a Confidence Interval for Beta Diversity. Oikos 121(4): 516-522.
diversity_boot()
diversity_stats()
poppr()
boot::boot()
boot::norm.ci()
boot::boot.ci()
library(poppr) data(Pinf) diversity_ci(Pinf, n = 100L) ## Not run: # With pretty results diversity_ci(Pinf, n = 100L, raw = FALSE) # This can be done in a parallel fasion (OSX uses "multicore", Windows uses "snow") system.time(diversity_ci(Pinf, 10000L, parallel = "multicore", ncpus = 4L)) system.time(diversity_ci(Pinf, 10000L)) # We often get many requests for a clonal fraction statistic. As this is # simply the number of observed MLGs over the number of samples, we # recommended that people calculate it themselves. With this function, you # can add it in: CF <- function(x){ x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ res <- rowSums(x > 0)/rowSums(x) } else { res <- sum(x > 0)/sum(x) } return(res) } # Show pretty results diversity_ci(Pinf, 1000L, CF = CF, center = TRUE, raw = FALSE) diversity_ci(Pinf, 1000L, CF = CF, rarefy = TRUE, raw = FALSE) ## End(Not run)
library(poppr) data(Pinf) diversity_ci(Pinf, n = 100L) ## Not run: # With pretty results diversity_ci(Pinf, n = 100L, raw = FALSE) # This can be done in a parallel fasion (OSX uses "multicore", Windows uses "snow") system.time(diversity_ci(Pinf, 10000L, parallel = "multicore", ncpus = 4L)) system.time(diversity_ci(Pinf, 10000L)) # We often get many requests for a clonal fraction statistic. As this is # simply the number of observed MLGs over the number of samples, we # recommended that people calculate it themselves. With this function, you # can add it in: CF <- function(x){ x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ res <- rowSums(x > 0)/rowSums(x) } else { res <- sum(x > 0)/sum(x) } return(res) } # Show pretty results diversity_ci(Pinf, 1000L, CF = CF, center = TRUE, raw = FALSE) diversity_ci(Pinf, 1000L, CF = CF, rarefy = TRUE, raw = FALSE) ## End(Not run)
Calculate diversity statistics on a matrix containing counts of multilocus genotypes per population.
diversity_stats(z, H = TRUE, G = TRUE, lambda = TRUE, E5 = TRUE, ...)
diversity_stats(z, H = TRUE, G = TRUE, lambda = TRUE, E5 = TRUE, ...)
z |
a table of integers representing counts of MLGs (columns) per population (rows) |
H |
logical whether or not to calculate Shannon's index |
G |
logical whether or not to calculate Stoddart and Taylor's index (aka inverse Simpson's index). |
lambda |
logical whether or not to calculate Simpson's index |
E5 |
logical whether or not to calculate Evenness |
... |
any functions that can be calculated on a vector or matrix of genotype counts. |
This function will calculate any diversity statistic for counts of
multilocus genotypes per population. This does not count allelic diversity.
The calculations of H, G, and lambda are all performed by
vegan::diversity()
. E5 is calculated as
.
a numeric matrix giving statistics (columns) for each population (rows).
Zhian N. Kamvar
diversity_boot()
diversity_ci()
poppr()
library(poppr) data(Pinf) tab <- mlg.table(Pinf, plot = FALSE) diversity_stats(tab) ## Not run: # Example using the poweRlaw package to calculate the negative slope of the # Pareto distribution. library("poweRlaw") power_law_beta <- function(x){ xpow <- displ(x[x > 0]) # Generate the distribution xpow$setPars(estimate_pars(xpow)) # Estimate the parameters xdat <- plot(xpow, draw = FALSE) # Extract the data xlm <- lm(log(y) ~ log(x), data = xdat) # Run log-log linear model for slope return(-coef(xlm)[2]) } Beta <- function(x){ x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ res <- apply(x, 1, power_law_beta) } else { res <- power_law_beta(x) } return(res) } diversity_stats(tab, B = Beta) ## End(Not run)
library(poppr) data(Pinf) tab <- mlg.table(Pinf, plot = FALSE) diversity_stats(tab) ## Not run: # Example using the poweRlaw package to calculate the negative slope of the # Pareto distribution. library("poweRlaw") power_law_beta <- function(x){ xpow <- displ(x[x > 0]) # Generate the distribution xpow$setPars(estimate_pars(xpow)) # Estimate the parameters xdat <- plot(xpow, draw = FALSE) # Extract the data xlm <- lm(log(y) ~ log(x), data = xdat) # Run log-log linear model for slope return(-coef(xlm)[2]) } Beta <- function(x){ x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ res <- apply(x, 1, power_law_beta) } else { res <- power_law_beta(x) } return(res) } diversity_stats(tab, B = Beta) ## End(Not run)
This function is a wrapper to mlg.filter. It will calculate all of the stats for mlg.filter utilizing all of the algorithms.
filter_stats( x, distance = bitwise.dist, threshold = 1e+06 + .Machine$double.eps^0.5, stats = "All", missing = "ignore", plot = FALSE, cols = NULL, nclone = NULL, hist = "Scott", threads = 1L, ... )
filter_stats( x, distance = bitwise.dist, threshold = 1e+06 + .Machine$double.eps^0.5, stats = "All", missing = "ignore", plot = FALSE, cols = NULL, nclone = NULL, hist = "Scott", threads = 1L, ... )
x |
|
distance |
a distance function or matrix |
threshold |
a threshold to be passed to |
stats |
what statistics should be calculated. |
missing |
how to treat missing data with mlg.filter |
plot |
If the threshold is a maximum threshold, should the statistics be plotted (Figure 2) |
cols |
the colors to use for each algorithm (defaults to set1 of RColorBrewer). |
nclone |
the number of multilocus genotypes you expect for the data. This will draw horizontal line on the graph at the value nclone and then vertical lines showing the cutoff thresholds for each algorithm. |
hist |
if you want a histogram to be plotted behind the statistics,
select a method here. Available methods are "sturges", "fd", or "scott"
(default) as documented in |
threads |
(unused) Previously the number of threads to be used. As of poppr version 2.4.1, this is by default set to 1. |
... |
extra parameters passed on to the distance function. |
a list of results from mlg.filter from the three
algorithms. (returns invisibly if plot = TRUE
)
This function originally appeared in doi:10.5281/zenodo.17424
Zhian N. Kamvar, Jonah C. Brooks
ZN Kamvar, JC Brooks, and NJ Grünwald. 2015. Supplementary Material for Frontiers Plant Genetics and Genomics 'Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality'. DOI: doi:10.5281/zenodo.17424
Kamvar ZN, Brooks JC and Grünwald NJ (2015) Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Front. Genet. 6:208. doi: doi:10.3389/fgene.2015.00208
mlg.filter
cutoff_predictor
bitwise.dist
diss.dist
# Basic usage example: Bruvo's Distance -------------------------------- data(Pinf) pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)) bres <- filter_stats(Pinf, distance = bruvo.dist, replen = pinfreps, plot = TRUE, threads = 1L) print(bres) # shows all of the statistics # Use these results with cutoff_filter() print(thresh <- cutoff_predictor(bres$farthest$THRESHOLDS)) mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps) <- thresh Pinf # Different distances will give different results ----------------------- nres <- filter_stats(Pinf, distance = nei.dist, plot = TRUE, threads = 1L, missing = "mean") print(thresh <- cutoff_predictor(nres$farthest$THRESHOLDS)) mlg.filter(Pinf, distance = nei.dist, missing = "mean") <- thresh Pinf
# Basic usage example: Bruvo's Distance -------------------------------- data(Pinf) pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2)) bres <- filter_stats(Pinf, distance = bruvo.dist, replen = pinfreps, plot = TRUE, threads = 1L) print(bres) # shows all of the statistics # Use these results with cutoff_filter() print(thresh <- cutoff_predictor(bres$farthest$THRESHOLDS)) mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps) <- thresh Pinf # Different distances will give different results ----------------------- nres <- filter_stats(Pinf, distance = nei.dist, plot = TRUE, threads = 1L, missing = "mean") print(thresh <- cutoff_predictor(nres$farthest$THRESHOLDS)) mlg.filter(Pinf, distance = nei.dist, missing = "mean") <- thresh Pinf
Attempts to fix inconsistent repeat lengths found by test_replen
fix_replen(gid, replen, e = 1e-05, fix_some = TRUE)
fix_replen(gid, replen, e = 1e-05, fix_some = TRUE)
gid |
|
replen |
a numeric vector of repeat motif lengths. |
e |
a number to be subtracted or added to inconsistent repeat lengths to allow for proper rounding. |
fix_some |
if |
This function is modified from the version used in
doi:10.5281/zenodo.13007.
Before being fed into the
algorithm to calculate Bruvo's distance, the amplicon length is divided by
the repeat unit length. Because of the amplified primer sequence attached
to sequence repeat, this division does not always result in an integer and
so the resulting numbers are rounded. The rounding also protects against
slight mis-calls of alleles. Because we know that
is equivalent to
, we know that the primer sequence will not alter the relationships
between the alleles. Unfortunately for nucleotide repeats that have powers
of 2, rounding in R is based off of the IEC 60559 standard (see
round
), that means that any number ending in 5 is rounded to
the nearest even digit. This function will attempt to alleviate this
problem by adding a very small amount to the repeat length so that division
will not result in a 0.5. If this fails, the same amount will be
subtracted. If neither of these work, a warning will be issued and it is up
to the user to determine if the fault is in the allele calls or the repeat
lengths.
a numeric vector of corrected repeat motif lengths.
Zhian N. Kamvar
Zhian N. Kamvar, Meg M. Larsen, Alan M. Kanaskie, Everett M. Hansen, & Niklaus J. Grünwald. Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO, doi:10.5281/zenodo.13007, 2014.
Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015). Spatial and temporal analysis of populations of the sudden oak death pathogen in Oregon forests. Phytopathology 105:982-989. doi: doi:10.1094/PHYTO-12-14-0350-FI
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
test_replen
bruvo.dist
bruvo.msn
bruvo.boot
data(Pram) (Pram_replen <- setNames(c(3, 2, 4, 4, 4), locNames(Pram))) fix_replen(Pram, Pram_replen) # Let's start with an example of a tetranucleotide repeat motif and imagine # that there are twenty alleles all 1 step apart: (x <- 1:20L * 4L) # These are the true lengths of the different alleles. Now, let's add the # primer sequence to them. (PxP <- x + 21 + 21) # Now we make sure that x / 4 is equal to 1:20, which we know each have # 1 difference. x/4 # Now, we divide the sequence with the primers by 4 and see what happens. (PxPc <- PxP/4) (PxPcr <- round(PxPc)) diff(PxPcr) # we expect all 1s # Let's try that again by subtracting a tiny amount from 4 (PxPc <- PxP/(4 - 1e-5)) (PxPcr <- round(PxPc)) diff(PxPcr)
data(Pram) (Pram_replen <- setNames(c(3, 2, 4, 4, 4), locNames(Pram))) fix_replen(Pram, Pram_replen) # Let's start with an example of a tetranucleotide repeat motif and imagine # that there are twenty alleles all 1 step apart: (x <- 1:20L * 4L) # These are the true lengths of the different alleles. Now, let's add the # primer sequence to them. (PxP <- x + 21 + 21) # Now we make sure that x / 4 is equal to 1:20, which we know each have # 1 difference. x/4 # Now, we divide the sequence with the primers by 4 and see what happens. (PxPc <- PxP/4) (PxPcr <- round(PxPc)) diff(PxPcr) # we expect all 1s # Let's try that again by subtracting a tiny amount from 4 (PxPc <- PxP/(4 - 1e-5)) (PxPcr <- round(PxPc)) diff(PxPcr)
GENclone is an S4 class that extends the
genind
object.
SNPclone is an S4 class
that extends the genlight
object.
They will have
all of the same attributes as their parent classes, but they will contain
one extra slot to store extra information about multilocus genotypes.
The genclone and snpclone classes will allow for more optimized methods of clone correction.
Previously for genind and genlight objects,
multilocus genotypes were not retained after a data set was subset by
population. The new mlg
slot allows us to assign the
multilocus genotypes and retain that information no matter how we subset
the data set. This new slot can either contain numeric values for
multilocus genotypes OR it can contain a special internal
MLG
class that allows for custom multilocus genotype
definitions and filtering.
mlg
a vector representing multilocus genotypes for the data set OR an
object of class MLG
.
The genclone
class extends class
"genind"
, directly.
The snpclone
class
extends class "genlight"
, directly.
When calculating multilocus genotypes for genclone objects, a rank
function is used, but calculation of multilocus genotypes for snpclone
objects is distance-based (via bitwise.dist
and
mlg.filter
). This means that genclone objects are sensitive
to missing data, whereas snpclone objects are insensitive.
Zhian N. Kamvar
as.genclone
as.snpclone
genind
genlight
strata
setPop
MLG
mll
mlg.filter
## Not run: # genclone objects can be created from genind objects # data(partial_clone) partial_clone (pc <- as.genclone(partial_clone)) # snpclone objects can be created from genlight objects # set.seed(999) (gl <- glSim(100, 0, n.snp.struc = 1e3, ploidy = 2, parallel = FALSE)) (sc <- as.snpclone(rbind(gl, gl, parallel = FALSE), parallel = FALSE)) # # Use mlg.filter to create a distance threshold to define multilocus genotypes. mlg.filter(sc, threads = 1L) <- 0.25 sc # 82 mlgs ## End(Not run)
## Not run: # genclone objects can be created from genind objects # data(partial_clone) partial_clone (pc <- as.genclone(partial_clone)) # snpclone objects can be created from genlight objects # set.seed(999) (gl <- glSim(100, 0, n.snp.struc = 1e3, ploidy = 2, parallel = FALSE)) (sc <- as.snpclone(rbind(gl, gl, parallel = FALSE), parallel = FALSE)) # # Use mlg.filter to create a distance threshold to define multilocus genotypes. mlg.filter(sc, threads = 1L) <- 0.25 sc # 82 mlgs ## End(Not run)
genind2genalex will export a genclone or genind object to a \*.csv file formatted for use in genalex.
genind2genalex( gid, filename = "", overwrite = FALSE, quiet = FALSE, pop = NULL, allstrata = TRUE, geo = FALSE, geodf = "xy", sep = ",", sequence = FALSE )
genind2genalex( gid, filename = "", overwrite = FALSE, quiet = FALSE, pop = NULL, allstrata = TRUE, geo = FALSE, geodf = "xy", sep = ",", sequence = FALSE )
gid |
|
filename |
a string indicating the name and/or path of the file you wish
to create. If this is left unchanged, the results will be saved in a
temporary file and a warning will be displayed for six seconds before the
file is written. This process should give you time to cancel the process
and choose a file path. Otherwise, the name of the file is returned, so you
can copy that to a file of your choice with |
overwrite |
|
quiet |
|
pop |
a character vector OR formula specifying the population factor.
This can be used to specify a specific subset of strata or custom
population factor for the output. Note that the |
allstrata |
if this is |
geo |
|
geodf |
|
sep |
a character specifying what character to use to separate columns. Defaults to ",". |
sequence |
when |
The the file path or connection where the data were written.
If your data set lacks a population structure, it will be coded in the new file as a single population labeled "Pop". Likewise, if you don't have any labels for your individuals, they will be labeled as "ind1" through "indN", with N being the size of your population.
Zhian N. Kamvar
read.genalex()
, clonecorrect()
, genclone, genind
## Not run: data(nancycats) genind2genalex(nancycats, "~/Documents/nancycats.csv", geo=TRUE) ## End(Not run)
## Not run: data(nancycats) genind2genalex(nancycats, "~/Documents/nancycats.csv", geo=TRUE) ## End(Not run)
Genotype accumulation curves are useful for determining the minimum number of loci necessary to discriminate between individuals in a population. This function will randomly sample loci without replacement and count the number of multilocus genotypes observed.
genotype_curve( gen, sample = 100, maxloci = 0L, quiet = FALSE, thresh = 1, plot = TRUE, drop = TRUE, dropna = TRUE )
genotype_curve( gen, sample = 100, maxloci = 0L, quiet = FALSE, thresh = 1, plot = TRUE, drop = TRUE, dropna = TRUE )
gen |
|
sample |
an |
maxloci |
the maximum number of loci to sample. By default,
|
quiet |
if |
thresh |
a number from 0 to 1. This will draw a line at that fraction of multilocus genotypes, rounded. Defaults to 1, which will draw a line at the maximum number of observable genotypes. |
plot |
if |
drop |
if |
dropna |
if |
Internally, this function works by converting the data into a
loci
object, which represents genotypes as a data
frame of factors. Random samples are taken of 1 to n-1 columns of the
matrix and the number of unique rows are counted to determine the number of
multilocus genotypes in that random sample. This function does not take
into account any definitions of MLGs via mlg.filter
or
mll.custom
.
(invisibly by deafuls) a matrix of integers showing the results of each randomization. Columns represent the number of loci sampled and rows represent an independent sample.
Zhian N. Kamvar
data(nancycats) nan_geno <- genotype_curve(nancycats) ## Not run: # Marker Type Comparison -------------------------------------------------- # With AFLP data, it is often necessary to include more markers for resolution data(Aeut) Ageno <- genotype_curve(Aeut) # Many microsatellite data sets have hypervariable markers data(microbov) mgeno <- geotype_curve(microbov) # Adding a trendline ------------------------------------------------------ # Trendlines: you can add a smoothed trendline with geom_smooth() library("ggplot2") p <- last_plot() p + geom_smooth() # Producing Figures for Publication --------------------------------------- # This data set has been pre filtered data(monpop) mongeno <- genotype_curve(monpop) # Here, we add a curve and a title for publication p <- last_plot() mytitle <- expression(paste("Genotype Accumulation Curve for ", italic("M. fructicola"))) p + geom_smooth() + theme_bw() + theme(text = element_text(size = 12, family = "serif")) + theme(title = element_text(size = 14)) + ggtitle(mytitle) ## End(Not run)
data(nancycats) nan_geno <- genotype_curve(nancycats) ## Not run: # Marker Type Comparison -------------------------------------------------- # With AFLP data, it is often necessary to include more markers for resolution data(Aeut) Ageno <- genotype_curve(Aeut) # Many microsatellite data sets have hypervariable markers data(microbov) mgeno <- geotype_curve(microbov) # Adding a trendline ------------------------------------------------------ # Trendlines: you can add a smoothed trendline with geom_smooth() library("ggplot2") p <- last_plot() p + geom_smooth() # Producing Figures for Publication --------------------------------------- # This data set has been pre filtered data(monpop) mongeno <- genotype_curve(monpop) # Here, we add a curve and a title for publication p <- last_plot() mytitle <- expression(paste("Genotype Accumulation Curve for ", italic("M. fructicola"))) p + geom_smooth() + theme_bw() + theme(text = element_text(size = 12, family = "serif")) + theme(title = element_text(size = 14)) + ggtitle(mytitle) ## End(Not run)
getfile is a convenience function that serves as a wrapper for the functions
file.choose()
, file.path()
, and list.files()
.
If the user is working in a GUI environment, a window will pop up, allowing
the user to choose a specified file regardless of path.
getfile(multi = FALSE, pattern = NULL, combine = TRUE)
getfile(multi = FALSE, pattern = NULL, combine = TRUE)
multi |
this is an indicator to allow the user to store the names of
multiple files found in the directory. This is useful in conjunction with
|
pattern |
a |
combine |
|
path |
a character string of the absolute path to the chosen file or files |
files |
a character vector containing the chosen file name or names. |
Zhian N. Kamvar
## Not run: x <- getfile() poppr(x$files) y <- getfile(multi=TRUE, pattern="^.+?dat$") #useful for reading in multiple FSTAT formatted files. yfiles <- poppr.all(y$files) ## End(Not run)
## Not run: x <- getfile() poppr(x$files) y <- getfile(multi=TRUE, pattern="^.+?dat$") #useful for reading in multiple FSTAT formatted files. yfiles <- poppr.all(y$files) ## End(Not run)
This function has one purpose. It is for deciding the appropriate scaling for a grey palette to be used for edge weights of a minimum spanning network.
greycurve( data = seq(0, 1, length = 1000), glim = c(0, 0.8), gadj = 3, gweight = 1, scalebar = FALSE )
greycurve( data = seq(0, 1, length = 1000), glim = c(0, 0.8), gadj = 3, gweight = 1, scalebar = FALSE )
data |
a sequence of numbers to be converted to greyscale. |
glim |
"grey limit". Two numbers between zero and one. They determine
the upper and lower limits for the |
gadj |
"grey adjust". a positive |
gweight |
"grey weight". an |
scalebar |
When this is set to |
A plot displaying a grey gradient from 0.001 to 1 with minimum and maximum values displayed as yellow lines, and an equation for the correction displayed in red.
Zhian N. Kamvar
# Normal grey curve with an adjustment of 3, an upper limit of 0.8, and # weighted towards smaller values. greycurve() ## Not run: # 1:1 relationship grey curve. greycurve(gadj=1, glim=1:0) # Grey curve weighted towards larger values. greycurve(gweight=2) # Same as the first, but the limit is 1. greycurve(glim=1:0) # Setting the lower limit to 0.1 and weighting towards larger values. greycurve(glim=c(0.1,0.8), gweight=2) ## End(Not run)
# Normal grey curve with an adjustment of 3, an upper limit of 0.8, and # weighted towards smaller values. greycurve() ## Not run: # 1:1 relationship grey curve. greycurve(gadj=1, glim=1:0) # Grey curve weighted towards larger values. greycurve(gweight=2) # Same as the first, but the limit is 1. greycurve(glim=1:0) # Setting the lower limit to 0.1 and weighting towards larger values. greycurve(glim=c(0.1,0.8), gweight=2) ## End(Not run)
Calculate the Index of Association and Standardized Index of Association.
ia( gid, sample = 0, method = 1, quiet = FALSE, missing = "ignore", plot = TRUE, hist = TRUE, index = "rbarD", valuereturn = FALSE ) pair.ia( gid, sample = 0L, quiet = FALSE, plot = TRUE, low = "blue", high = "red", limits = NULL, index = "rbarD", method = 1L ) resample.ia(gid, n = NULL, reps = 999, quiet = FALSE, use_psex = FALSE, ...) jack.ia(gid, n = NULL, reps = 999, quiet = FALSE)
ia( gid, sample = 0, method = 1, quiet = FALSE, missing = "ignore", plot = TRUE, hist = TRUE, index = "rbarD", valuereturn = FALSE ) pair.ia( gid, sample = 0L, quiet = FALSE, plot = TRUE, low = "blue", high = "red", limits = NULL, index = "rbarD", method = 1L ) resample.ia(gid, n = NULL, reps = 999, quiet = FALSE, use_psex = FALSE, ...) jack.ia(gid, n = NULL, reps = 999, quiet = FALSE)
gid |
a |
sample |
an integer indicating the number of permutations desired (eg 999). |
method |
an integer from 1 to 4 indicating the sampling method desired.
see |
quiet |
Should the function print anything to the screen while it is
performing calculations?
|
missing |
a character string. see |
plot |
When |
hist |
|
index |
|
valuereturn |
|
low |
(for pair.ia) a color to use for low values when 'plot = TRUE' |
high |
(for pair.ia) a color to use for low values when 'plot = TRUE' |
limits |
(for pair.ia) the limits to be used for the color scale. Defaults to 'NULL'. If you want to use a custom range, supply two numbers between -1 and 1, (e.g. 'limits = c(-0.15, 1)') |
n |
an integer specifying the number of samples to be drawn. Defaults to
|
reps |
an integer specifying the number of replicates to perform. Defaults to 999. |
use_psex |
a logical. If |
... |
arguments passed on to |
ia()
calculates the index of association over all loci in the data set.
pair.ia()
calculates the index of association in a pairwise manner
among all loci.
resample.ia()
calculates the index of association on a reduced data set
multiple times to create a distribution, showing the variation of values
observed at a given sample size (previously jack.ia()
).
The index of association was originally developed by A.H.D. Brown analyzing population structure of wild barley (Brown, 1980). It has been widely used as a tool to detect clonal reproduction within populations . Populations whose members are undergoing sexual reproduction, whether it be selfing or out-crossing, will produce gametes via meiosis, and thus have a chance to shuffle alleles in the next generation. Populations whose members are undergoing clonal reproduction, however, generally do so via mitosis. This means that the most likely mechanism for a change in genotype is via mutation. The rate of mutation varies from species to species, but it is rarely sufficiently high to approximate a random shuffling of alleles. The index of association is a calculation based on the ratio of the variance of the raw number of differences between individuals and the sum of those variances over each locus . You can also think of it as the observed variance over the expected variance. If they are the same, then the index is zero after subtracting one (from Maynard-Smith, 1993):
Since the distance is more or less a binary distance, any sort of marker can be used for this analysis. In the calculation, phase is not considered, and any difference increases the distance between two individuals. Remember that each column represents a different allele and that each entry in the table represents the fraction of the genotype made up by that allele at that locus. Notice also that the sum of the rows all equal one. Poppr uses this to calculate distances by simply taking the sum of the absolute values of the differences between rows.
The calculation for the distance between two individuals at a single locus with a allelic states and a ploidy of k is as follows (except for Presence/Absence data):
To find the total number of differences between two individuals over all loci, you just take d over m loci, a value we'll call D:
These values are calculated over all possible combinations of individuals
in the data set, after which you end up
with
values of d and
values of D. Calculating the
observed variances is fairly straightforward (modified from Agapow and
Burt, 2001):
Calculating the expected variance is the sum of each of the variances of the individual loci. The calculation at a single locus, j is the same as the previous equation, substituting values of D for d:
The expected variance is then the sum of all the variances over all m loci:
Agapow and Burt showed that increases steadily with the number
of loci, so they came up with an approximation that is widely used,
. For the derivation, see the manual for
multilocus.
pair.ia()
A matrix with two columns and choose(nLoc(gid), 2) rows representing the values for Ia and rbarD per locus pair.
A named number vector of length 2 giving the Index of Association, "Ia"; and the Standardized Index of Association, "rbarD"
A a named numeric vector of length 4 with the following values:
Ia - numeric. The index of association.
p.Ia - A number indicating the p-value resulting from a one-sided permutation test based on the number of samples indicated in the original call.
rbarD - numeric. The standardized index of association.
p.rD - A factor indicating the p-value resulting from a one-sided permutation test based on the number of samples indicated in the original call.
valureturn = TRUE
A list with the following elements:
index The above vector
samples A data frame with s by 2 column data frame where s is the number of samples defined. The columns are for the values of Ia and rbarD, respectively.
a data frame with the index of association and standardized index of association in columns. Number of rows represents the number of reps.
jack.ia()
is deprecated as the name was misleading. Please use
resample.ia()
Zhian N. Kamvar
Paul-Michael Agapow and Austin Burt. Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2):101-102, 2001
A.H.D. Brown, M.W. Feldman, and E. Nevo. Multilocus structure of natural populations of Hordeum spontaneum. Genetics, 96(2):523-536, 1980.
J M Smith, N H Smith, M O'Rourke, and B G Spratt. How clonal are bacteria? Proceedings of the National Academy of Sciences, 90(10):4384-4388, 1993.
poppr()
, missingno()
,
import2genind()
, read.genalex()
,
clonecorrect()
, win.ia()
, samp.ia()
data(nancycats) ia(nancycats) # Pairwise over all loci: data(partial_clone) res <- pair.ia(partial_clone) plot(res, low = "black", high = "green", index = "Ia") # Resampling data(Pinf) resample.ia(Pinf, reps = 99) ## Not run: # Pairwise IA with p-values (this will take about a minute) res <- pair.ia(partial_clone, sample = 999) head(res) # Plot the results of resampling rbarD. library("ggplot2") Pinf.resamp <- resample.ia(Pinf, reps = 999) ggplot(Pinf.resamp[2], aes(x = rbarD)) + geom_histogram() + geom_vline(xintercept = ia(Pinf)[2]) + geom_vline(xintercept = ia(clonecorrect(Pinf))[2], linetype = 2) + xlab(expression(bar(r)[d])) # Get the indices back and plot the distributions. nansamp <- ia(nancycats, sample = 999, valuereturn = TRUE) plot(nansamp, index = "Ia") plot(nansamp, index = "rbarD") # You can also adjust the parameters for how large to display the text # so that it's easier to export it for publication/presentations. library("ggplot2") plot(nansamp, labsize = 5, linesize = 2) + theme_bw() + # adding a theme theme(text = element_text(size = rel(5))) + # changing text size theme(plot.title = element_text(size = rel(4))) + # changing title size ggtitle("Index of Association of nancycats") # adding a new title # Get the index for each population. lapply(seppop(nancycats), ia) # With sampling lapply(seppop(nancycats), ia, sample = 999) # Plot pairwise ia for all populations in a grid with cowplot # Set up the library and data library("cowplot") data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom setPop(monpop) <- ~Tree # Need to set up a list in which to store the plots. plotlist <- vector(mode = "list", length = nPop(monpop)) names(plotlist) <- popNames(monpop) # Loop throgh the populations, calculate pairwise ia, plot, and then # capture the plot in the list for (i in popNames(monpop)){ x <- pair.ia(monpop[pop = i], limits = c(-0.15, 1)) # subset, calculate, and plot plotlist[[i]] <- ggplot2::last_plot() # save the last plot } # Use the plot_grid function to plot. plot_grid(plotlist = plotlist, labels = paste("Tree", popNames(monpop))) ## End(Not run)
data(nancycats) ia(nancycats) # Pairwise over all loci: data(partial_clone) res <- pair.ia(partial_clone) plot(res, low = "black", high = "green", index = "Ia") # Resampling data(Pinf) resample.ia(Pinf, reps = 99) ## Not run: # Pairwise IA with p-values (this will take about a minute) res <- pair.ia(partial_clone, sample = 999) head(res) # Plot the results of resampling rbarD. library("ggplot2") Pinf.resamp <- resample.ia(Pinf, reps = 999) ggplot(Pinf.resamp[2], aes(x = rbarD)) + geom_histogram() + geom_vline(xintercept = ia(Pinf)[2]) + geom_vline(xintercept = ia(clonecorrect(Pinf))[2], linetype = 2) + xlab(expression(bar(r)[d])) # Get the indices back and plot the distributions. nansamp <- ia(nancycats, sample = 999, valuereturn = TRUE) plot(nansamp, index = "Ia") plot(nansamp, index = "rbarD") # You can also adjust the parameters for how large to display the text # so that it's easier to export it for publication/presentations. library("ggplot2") plot(nansamp, labsize = 5, linesize = 2) + theme_bw() + # adding a theme theme(text = element_text(size = rel(5))) + # changing text size theme(plot.title = element_text(size = rel(4))) + # changing title size ggtitle("Index of Association of nancycats") # adding a new title # Get the index for each population. lapply(seppop(nancycats), ia) # With sampling lapply(seppop(nancycats), ia, sample = 999) # Plot pairwise ia for all populations in a grid with cowplot # Set up the library and data library("cowplot") data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom setPop(monpop) <- ~Tree # Need to set up a list in which to store the plots. plotlist <- vector(mode = "list", length = nPop(monpop)) names(plotlist) <- popNames(monpop) # Loop throgh the populations, calculate pairwise ia, plot, and then # capture the plot in the list for (i in popNames(monpop)){ x <- pair.ia(monpop[pop = i], limits = c(-0.15, 1)) # subset, calculate, and plot plotlist[[i]] <- ggplot2::last_plot() # save the last plot } # Use the plot_grid function to plot. plot_grid(plotlist = plotlist, labels = paste("Tree", popNames(monpop))) ## End(Not run)
This function will launch an interactive interface that allows you to create, plot, manipulate, and save minimum spanning networks. It runs using the shiny R package.
imsn()
imsn()
Creating and plotting MSNs requires three steps:
Create a distance matrix from your data
Create a minimum spanning network with your data and the matrix
Visualize the minimum spanning network
The function plot_poppr_msn
is currently the most flexible way
of visualizing your minimum spanning network, but with 20 parameters, it can
become pretty intimidating trying to find the right display for your MSN.
With this function, all three steps are combined into one interactive interface that will allow you to intuitively modify your minimum spanning network and even save the results to a pdf or png file.
NULL, invisibly
In the left hand panel, there are three buttons to execute the functions. These allow you to run the data set after you manipulate all of the parameters.
GO! - This button will start the application with the specified parameters
reData - Use this button when you have changed any parameters under the section Data Parameters. This involves recalculating the distance matrix and msn.
reGraph - Use this button when you have changed any parameters under the section Graphical Parameters. This involves superficial changes to the display of the minimum spanning network.
The right hand panel contains different tabs related to your data set of choice.
Plot - The minimum spanning network itself
Data - A display of your data set
Command - The commands used to create the plot. You can copy and paste this to an R file for reproducibility.
Save Plot - This provides a tool for you to save the plot to a PDF or PNG image.
Session Information - displays the result of
sessionInfo
for reproducibility.
Zhian N. Kamvar
plot_poppr_msn
diss.dist
bruvo.dist
bruvo.msn
poppr.msn
nei.dist
popsub
missingno
## Not run: # Set up some data library("poppr") library("magrittr") data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom summary(monpop) monpop_ssr <- c(CHMFc4 = 7, CHMFc5 = 2, CHMFc12 = 4, SEA = 4, SED = 4, SEE = 2, SEG = 6, SEI = 3, SEL = 4, SEN = 2, SEP = 4, SEQ = 2, SER = 4) t26 <- monpop %>% setPop(~Tree) %>% popsub("26") %>% setPop(~Year/Symptom) t26 if (interactive()) { imsn() # select Bruvo's distance and enter "monpop_ssr" into the Repeat Length field. # It is also possible to run this from github if you are connected to the internet. # This allows you to access any bug fixes that may have been updated before a formal # release on CRAN shiny::runGitHub("grunwaldlab/poppr", subdir = "inst/shiny/msn_explorer") # You can also use your own distance matrices, but there's a small catch. # in order to do so, you must write a function that will subset the matrix # to whatever populations are in your data. Here's an example with the above mondist <- bruvo.dist(monpop, replen = monpop_ssr) myDist <- function(x, d = mondist){ dm <- as.matrix(d) # Convert the dist object to a square matrix xi <- indNames(x) # Grab the sample names that exist return(as.dist(dm[xi, xi])) # return only the elements that have the names # in the data set } # After executing imsn, choose: # Distance: custom # myDist imsn() } ## End(Not run)
## Not run: # Set up some data library("poppr") library("magrittr") data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom summary(monpop) monpop_ssr <- c(CHMFc4 = 7, CHMFc5 = 2, CHMFc12 = 4, SEA = 4, SED = 4, SEE = 2, SEG = 6, SEI = 3, SEL = 4, SEN = 2, SEP = 4, SEQ = 2, SER = 4) t26 <- monpop %>% setPop(~Tree) %>% popsub("26") %>% setPop(~Year/Symptom) t26 if (interactive()) { imsn() # select Bruvo's distance and enter "monpop_ssr" into the Repeat Length field. # It is also possible to run this from github if you are connected to the internet. # This allows you to access any bug fixes that may have been updated before a formal # release on CRAN shiny::runGitHub("grunwaldlab/poppr", subdir = "inst/shiny/msn_explorer") # You can also use your own distance matrices, but there's a small catch. # in order to do so, you must write a function that will subset the matrix # to whatever populations are in your data. Here's an example with the above mondist <- bruvo.dist(monpop, replen = monpop_ssr) myDist <- function(x, d = mondist){ dm <- as.matrix(d) # Convert the dist object to a square matrix xi <- indNames(x) # Grab the sample names that exist return(as.dist(dm[xi, xi])) # return only the elements that have the names # in the data set } # After executing imsn, choose: # Distance: custom # myDist imsn() } ## End(Not run)
If two samples share no loci typed in common, they are incomparable and will produce missing data in a distance matrix, which could lead to problems with further analyses. This function finds these samples and returns a matrix of how many other samples these are incomparable with.
incomp(gid)
incomp(gid)
gid |
a genind or genclone object |
a square matrix with samples that are incomparable
data(nancycats) # These two populations have no samples that are incomparable incomp(nancycats[pop = c(1, 17)]) # If you reduce the number of loci, we find that there are # incomparable samples. incomp(nancycats[pop = c(1, 17), loc = c(1, 4)])
data(nancycats) # These two populations have no samples that are incomparable incomp(nancycats[pop = c(1, 17)]) # If you reduce the number of loci, we find that there are # incomparable samples. incomp(nancycats[pop = c(1, 17), loc = c(1, 4)])
Create a table summarizing missing data or ploidy information of a genind or genclone object
info_table( gen, type = c("missing", "ploidy"), percent = TRUE, plot = FALSE, df = FALSE, returnplot = FALSE, low = "blue", high = "red", plotlab = TRUE, scaled = TRUE )
info_table( gen, type = c("missing", "ploidy"), percent = TRUE, plot = FALSE, df = FALSE, returnplot = FALSE, low = "blue", high = "red", plotlab = TRUE, scaled = TRUE )
gen |
|
type |
|
percent |
|
plot |
|
df |
|
returnplot |
|
low |
|
high |
|
plotlab |
|
scaled |
|
Missing data is accounted for on a per-population level.
Ploidy is accounted for on a per-individual level.
This data is potentially useful for identifying areas of systematic missing data. There are a few caveats to be aware of.
Regarding counts of missing data: Each count represents the number of individuals with missing data at each locus. The last column, "mean" can be thought of as the average number of individuals with missing data per locus.
Regarding percentage missing data: This percentage is relative to the population and locus, not to the entire data set. The last column, "mean" represents the average percent of the population with missing data per locus.
This option is useful for data that has been imported with mixed ploidies. It will summarize the relative levels of ploidy per individual per locus. This is simply based off of observed alleles and does not provide any further estimates.
a matrix, data frame (df = TRUE
), or a list (returnplot
= TRUE
) representing missing data per population (type = 'missing'
)
or ploidy per individual (type = 'ploidy'
) in a genind
or genclone object.
Zhian N. Kamvar
data(nancycats) nancy.miss <- info_table(nancycats, plot = TRUE, type = "missing") data(Pinf) Pinf.ploid <- info_table(Pinf, plot = TRUE, type = "ploidy")
data(nancycats) nancy.miss <- info_table(nancycats, plot = TRUE, type = "missing") data(Pinf) Pinf.ploid <- info_table(Pinf, plot = TRUE, type = "ploidy")
This function will facilitate in removing phylogenetically uninformative loci
from a genclone
or genind
object.
The user has the ability to define what uninformative means by setting a
cutoff value for either percentage of differentiating genotypes or minor
allele frequency.
informloci(pop, cutoff = 2/nInd(pop), MAF = 0.01, quiet = FALSE)
informloci(pop, cutoff = 2/nInd(pop), MAF = 0.01, quiet = FALSE)
pop |
|
cutoff |
|
MAF |
|
quiet |
|
This function will remove uninformative loci using a traditional MAF
cutoff (using isPoly
from adegenet) as well
as analyzing the number of observed genotypes in a locus. This is important
for clonal organisms that can have fixed heterozygous sites not detected by
MAF methods.
A genind
object with user-defined informative loci.
This will have a few side effects that affect certain analyses. First, the number of multilocus genotypes might be reduced due to the reduced number of markers (if you are only using a genind object). Second, if you plan on using this data for analysis of the index of association, be sure to use the standardized version (rbarD) that corrects for the number of observed loci.
Zhian N. Kamvar
# We will use a dummy data set to demonstrate how this detects uninformative # loci using both MAF and a cutoff. genos <- c("A/A", "A/B", "A/C", "B/B", "B/C", "C/C") v <- sample(genos, 100, replace = TRUE) w <- c(rep(genos[2], 99), genos[3]) # found by cutoff x <- c(rep(genos[1], 98), genos[3], genos[2]) # found by MAF y <- c(rep(genos[1], 99), genos[2]) # found by both z <- sample(genos, 100, replace = TRUE) dat <- df2genind(data.frame(v = v, w = w, x = x, y = y, z = z), sep = "/") informloci(dat) ## Not run: # Ignore MAF informloci(dat, MAF = 0) # Ignore cutoff informloci(dat, cutoff = 0) # Real data data(H3N2) informloci(H3N2) ## End(Not run)
# We will use a dummy data set to demonstrate how this detects uninformative # loci using both MAF and a cutoff. genos <- c("A/A", "A/B", "A/C", "B/B", "B/C", "C/C") v <- sample(genos, 100, replace = TRUE) w <- c(rep(genos[2], 99), genos[3]) # found by cutoff x <- c(rep(genos[1], 98), genos[3], genos[2]) # found by MAF y <- c(rep(genos[1], 99), genos[2]) # found by both z <- sample(genos, 100, replace = TRUE) dat <- df2genind(data.frame(v = v, w = w, x = x, y = y, z = z), sep = "/") informloci(dat) ## Not run: # Ignore MAF informloci(dat, MAF = 0) # Ignore cutoff informloci(dat, cutoff = 0) # Real data data(H3N2) informloci(H3N2) ## End(Not run)
Check for validity of a genclone or snpclone object
is.snpclone(x) is.clone(x) is.genclone(x)
is.snpclone(x) is.clone(x) is.genclone(x)
x |
a genclone or snpclone object |
a genclone object will always be a valid genind object and a snpclone object will always be a valid genlight object.
Zhian N. Kamvar
(sc <- as.snpclone(glSim(100, 1e3, ploid=2, parallel = FALSE), parallel = FALSE, n.cores = 1L)) is.snpclone(sc) is.clone(sc) data(nancycats) nanclone <- as.genclone(nancycats) is.genclone(nanclone)
(sc <- as.snpclone(glSim(100, 1e3, ploid=2, parallel = FALSE), parallel = FALSE, n.cores = 1L)) is.snpclone(sc) is.clone(sc) data(nancycats) nanclone <- as.genclone(nancycats) is.genclone(nanclone)
Create a table of summary statistics per locus.
locus_table( x, index = "simpson", lev = "allele", population = "ALL", information = TRUE )
locus_table( x, index = "simpson", lev = "allele", population = "ALL", information = TRUE )
x |
a adegenet::genind or genclone object. |
index |
Which diversity index to use. Choices are
|
lev |
At what level do you want to analyze diversity? Choices are
|
population |
Select the populations to be analyzed. This is the
parameter |
information |
When |
a table with 4 columns indicating the Number of alleles/genotypes observed, Diversity index chosen, Nei's 1978 gene diversity (expected heterozygosity), and Evenness.
The calculation of Hexp
is where p is the allele
frequencies at a given locus and n is the number of observed alleles (Nei,
1978) in each locus and then returning the average. Caution should be
exercised in interpreting the results of Hexp with polyploid organisms with
ambiguous ploidy. The lack of allelic dosage information will cause rare
alleles to be over-represented and artificially inflate the index. This is
especially true with small sample sizes.
If lev = "genotype"
, then all statistics reflect genotypic diversity
within each locus. This includes the calculation for Hexp
, which turns
into the unbiased Simpson's index.
Zhian N. Kamvar
Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, and Helene Wagner. vegan: Community Ecology Package, 2012. R package version 2.0-5.
Niklaus J. Gr\"unwald, Stephen B. Goodwin, Michael G. Milgroom, and William E. Fry. Analysis of genotypic diversity data for populations of microorganisms. Phytopathology, 93(6):738-46, 2003
J.A. Ludwig and J.F. Reynolds. Statistical Ecology. A Primer on Methods and Computing. New York USA: John Wiley and Sons, 1988.
E.C. Pielou. Ecological Diversity. Wiley, 1975.
J.A. Stoddart and J.F. Taylor. Genotypic diversity: estimation and prediction in samples. Genetics, 118(4):705-11, 1988.
Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583-590, 1978.
Claude Elwood Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423,623-656, 1948
data(nancycats) locus_table(nancycats[pop = 5]) ## Not run: # Analyze locus statistics for the North American population of P. infestans. # Note that due to the unknown dosage of alleles, many of these statistics # will be artificially inflated for polyploids. data(Pinf) locus_table(Pinf, population = "North America") ## End(Not run)
data(nancycats) locus_table(nancycats[pop = 5]) ## Not run: # Analyze locus statistics for the North American population of P. infestans. # Note that due to the unknown dosage of alleles, many of these statistics # will be artificially inflated for polyploids. data(Pinf) locus_table(Pinf, population = "North America") ## End(Not run)
Split samples from a genind object into pseudo-haplotypes
make_haplotypes(gid)
make_haplotypes(gid)
gid |
Certain analyses, such as amova work best if within-sample variance (error) can be estimated. Practically, this is performed by splitting the genotypes across all loci to create multiple haplotypes. This way, the within-sample distance can be calculated and incorporated into the model. Please note that the haplotypes generated are based on the order of the unphased alleles in the genind object and do not represent true haplotypes.
Haploid data will be returned un-touched.
a haploid genind object with an extra strata column called "Individual".
The other slot will not be copied over to the new genind object.
poppr.amova()
pegas::amova()
as.genambig()
# Diploid data is doubled ------------------------------------------------- data(nancycats) nan9 <- nancycats[pop = 9] nan9hap <- make_haplotypes(nan9) nan9 # 9 individuals from population 9 nan9hap # 18 haplotypes strata(nan9hap) # strata gains a new column: Individual indNames(nan9hap) # individuals are renamed sequentially # Mix ploidy data can be split, but should be treated with caution -------- # # For example, the Pinf data set contains 86 tetraploid individuals, # but there appear to only be diploids and triploid genotypes. When # we convert to haplotypes, those with all missing data are dropped. data(Pinf) Pinf pmiss <- info_table(Pinf, type = "ploidy", plot = TRUE) # No samples appear to be triploid across all loci. This will cause # several haplotypes to have a lot of missing data. p_haps <- make_haplotypes(Pinf) p_haps head(genind2df(p_haps), n = 20)
# Diploid data is doubled ------------------------------------------------- data(nancycats) nan9 <- nancycats[pop = 9] nan9hap <- make_haplotypes(nan9) nan9 # 9 individuals from population 9 nan9hap # 18 haplotypes strata(nan9hap) # strata gains a new column: Individual indNames(nan9hap) # individuals are renamed sequentially # Mix ploidy data can be split, but should be treated with caution -------- # # For example, the Pinf data set contains 86 tetraploid individuals, # but there appear to only be diploids and triploid genotypes. When # we convert to haplotypes, those with all missing data are dropped. data(Pinf) Pinf pmiss <- info_table(Pinf, type = "ploidy", plot = TRUE) # No samples appear to be triploid across all loci. This will cause # several haplotypes to have a lot of missing data. p_haps <- make_haplotypes(Pinf) p_haps head(genind2df(p_haps), n = 20)
missingno gives the user four options to deal with missing data: remove loci, remove samples, replace with zeroes, or replace with average allele counts.
missingno(pop, type = "loci", cutoff = 0.05, quiet = FALSE, freq = FALSE)
missingno(pop, type = "loci", cutoff = 0.05, quiet = FALSE, freq = FALSE)
pop |
|
type |
a |
cutoff |
|
quiet |
if |
freq |
defaults to |
These methods provide a way to deal with systematic missing data and
to give a wrapper for adegenet
's tab
function.
ALL OF THESE ARE TO BE USED WITH CAUTION.
Using this function with polyploid data (where missing data is coded as "0") may give spurious results.
"ignore"
- does not remove or replace missing data.
"loci"
- removes all loci containing missing data in the entire
data set.
"genotype"
- removes any genotypes/isolates/individuals with
missing data.
"mean"
- replaces all NA's with the mean of the alleles for the
entire data set.
"zero"
or "0"
- replaces all NA's with "0". Introduces
more diversity.
"wild missingno appeared!"
Zhian N. Kamvar
tab
, poppr
, poppr.amova
,
nei.dist
, aboot
data(nancycats) nancy.locina <- missingno(nancycats, type = "loci") ## Found 617 missing values. ## 2 loci contained missing values greater than 5%. ## Removing 2 loci : fca8 fca45 nancy.genona <- missingno(nancycats, type = "geno") ## Found 617 missing values. ## 38 genotypes contained missing values greater than 5%. ## Removing 38 genotypes : N215 N216 N188 N189 N190 N191 N192 N302 N304 N310 ## N195 N197 N198 N199 N200 N201 N206 N182 N184 N186 N298 N299 N300 N301 N303 ## N282 N283 N288 N291 N292 N293 N294 N295 N296 N297 N281 N289 N290 # Replacing all NA with "0" (see tab in the adegenet package). nancy.0 <- missingno(nancycats, type = "0") ## Replaced 617 missing values # Replacing all NA with the mean of each column (see tab in the # adegenet package). nancy.mean <- missingno(nancycats, type = "mean") ## Replaced 617 missing values
data(nancycats) nancy.locina <- missingno(nancycats, type = "loci") ## Found 617 missing values. ## 2 loci contained missing values greater than 5%. ## Removing 2 loci : fca8 fca45 nancy.genona <- missingno(nancycats, type = "geno") ## Found 617 missing values. ## 38 genotypes contained missing values greater than 5%. ## Removing 38 genotypes : N215 N216 N188 N189 N190 N191 N192 N302 N304 N310 ## N195 N197 N198 N199 N200 N201 N206 N182 N184 N186 N298 N299 N300 N301 N303 ## N282 N283 N288 N291 N292 N293 N294 N295 N296 N297 N281 N289 N290 # Replacing all NA with "0" (see tab in the adegenet package). nancy.0 <- missingno(nancycats, type = "0") ## Replaced 617 missing values # Replacing all NA with the mean of each column (see tab in the # adegenet package). nancy.mean <- missingno(nancycats, type = "mean") ## Replaced 617 missing values
Create counts, vectors, and matrices of multilocus genotypes.
mlg(gid, quiet = FALSE) mlg.table( gid, strata = NULL, sublist = "ALL", exclude = NULL, blacklist = NULL, mlgsub = NULL, bar = TRUE, plot = TRUE, total = FALSE, color = FALSE, background = FALSE, quiet = FALSE ) mlg.vector(gid, reset = FALSE) mlg.crosspop( gid, strata = NULL, sublist = "ALL", exclude = NULL, blacklist = NULL, mlgsub = NULL, indexreturn = FALSE, df = FALSE, quiet = FALSE ) mlg.id(gid)
mlg(gid, quiet = FALSE) mlg.table( gid, strata = NULL, sublist = "ALL", exclude = NULL, blacklist = NULL, mlgsub = NULL, bar = TRUE, plot = TRUE, total = FALSE, color = FALSE, background = FALSE, quiet = FALSE ) mlg.vector(gid, reset = FALSE) mlg.crosspop( gid, strata = NULL, sublist = "ALL", exclude = NULL, blacklist = NULL, mlgsub = NULL, indexreturn = FALSE, df = FALSE, quiet = FALSE ) mlg.id(gid)
gid |
a adegenet::genind, genclone, adegenet::genlight, or snpclone object. |
quiet |
|
strata |
a formula specifying the strata at which computation is to be performed. |
sublist |
a |
exclude |
a |
blacklist |
DEPRECATED, use exclude. |
mlgsub |
a |
bar |
deprecated. Same as |
plot |
|
total |
|
color |
an option to display a single barchart for mlg.table, colored by population (note, this becomes facetted if 'background = TRUE'). |
background |
an option to display the the total number of MLGs across populations per facet in the background of the plot. |
reset |
logical. For genclone objects, the MLGs are defined by the input data, but they do not change if more or less information is added (i.e. loci are dropped). Setting 'reset = TRUE' will recalculate MLGs. Default is 'FALSE', returning the MLGs defined in the @mlg slot. |
indexreturn |
|
df |
|
Multilocus genotypes are the unique combination of alleles across
all loci. For details of how these are calculated see vignette("mlg", package = "poppr")
. In short, for genind and genclone objects, they are
calculated by using a rank function on strings of alleles, which is
sensitive to missing data. For genlight and snpclone objects, they are
calculated with distance methods via bitwise.dist and
mlg.filter, which means that these are insensitive to missing
data. Three different types of MLGs can be defined in poppr:
original the default definition of multilocus genotypes as detailed above
contracted these are multilocus genotypes collapsed into multilocus lineages (mll) with genetic distance via mlg.filter
custom user-defined multilocus genotypes. These are useful for information such as mycelial compatibility groups
All of the functions documented here will work on any of the MLG types defined in poppr
an integer describing the number of multilocus genotypes observed.
a matrix with columns indicating unique multilocus genotypes and rows indicating populations. This table can be used with the funciton diversity_stats to calculate the Shannon-Weaver index (H), Stoddart and Taylor's index (aka inverse Simpson's index; G), Simpson's index (lambda), and evenness (E5).
a numeric vector naming the multilocus genotype of each individual in the dataset.
default a list
where each element contains a named integer
vector representing the number of individuals represented from each
population in that MLG
indexreturn = TRUE
a vector
of integers defining the multilocus
genotypes that have individuals crossing populations
df = TRUE
A long form data frame with the columns: MLG, Population,
Count. Useful for graphing with ggplot2
a list of multilocus genotypes with the associated individual names per MLG.
The resulting matrix of 'mlg.table' can be used for analysis with the vegan package.
mlg.vector will recalculate the mlg vector for [adegenet::genind] objects and will return the contents of the mlg slot in [genclone][genclone-class] objects. This means that MLGs will be different for subsetted [adegenet::genind] objects.
Zhian N. Kamvar
vegan::diversity()
diversity_stats
popsub
mll
mlg.filter
mll.custom
# Load the data set data(Aeut) # Investigate the number of multilocus genotypes. amlg <- mlg(Aeut) amlg # 119 # show the multilocus genotype vector avec <- mlg.vector(Aeut) avec # Get a table atab <- mlg.table(Aeut, color = TRUE) atab # See where multilocus genotypes cross populations acrs <- mlg.crosspop(Aeut) # MLG.59: (2 inds) Athena Mt. Vernon # See which individuals belong to each MLG aid <- mlg.id(Aeut) aid["59"] # individuals 159 and 57 ## Not run: # For the mlg.table, you can also choose to display the number of MLGs across # populations in the background mlg.table(Aeut, background = TRUE) mlg.table(Aeut, background = TRUE, color = TRUE) # A simple example. 10 individuals, 5 genotypes. mat1 <- matrix(ncol=5, 25:1) mat1 <- rbind(mat1, mat1) mat <- matrix(nrow=10, ncol=5, paste(mat1,mat1,sep="/")) mat.gid <- df2genind(mat, sep="/") mlg(mat.gid) mlg.vector(mat.gid) mlg.table(mat.gid) # Now for a more complicated example. # Data set of 1903 samples of the H3N2 flu virus genotyped at 125 SNP loci. data(H3N2) mlg(H3N2, quiet = FALSE) H.vec <- mlg.vector(H3N2) # Changing the population vector to indicate the years of each epidemic. pop(H3N2) <- other(H3N2)$x$country H.tab <- mlg.table(H3N2, plot = FALSE, total = TRUE) # Show which genotypes exist accross populations in the entire dataset. res <- mlg.crosspop(H3N2, quiet = FALSE) # Let's say we want to visualize the multilocus genotype distribution for the # USA and Russia mlg.table(H3N2, sublist = c("USA", "Russia"), bar=TRUE) # An exercise in subsetting the output of mlg.table and mlg.vector. # First, get the indices of each MLG duplicated across populations. inds <- mlg.crosspop(H3N2, quiet = FALSE, indexreturn = TRUE) # Since the columns of the table from mlg.table are equal to the number of # MLGs, we can subset with just the columns. H.sub <- H.tab[, inds] # We can also do the same by using the mlgsub flag. H.sub <- mlg.table(H3N2, mlgsub = inds) # We can subset the original data set using the output of mlg.vector to # analyze only the MLGs that are duplicated across populations. new.H <- H3N2[H.vec %in% inds, ] ## End(Not run)
# Load the data set data(Aeut) # Investigate the number of multilocus genotypes. amlg <- mlg(Aeut) amlg # 119 # show the multilocus genotype vector avec <- mlg.vector(Aeut) avec # Get a table atab <- mlg.table(Aeut, color = TRUE) atab # See where multilocus genotypes cross populations acrs <- mlg.crosspop(Aeut) # MLG.59: (2 inds) Athena Mt. Vernon # See which individuals belong to each MLG aid <- mlg.id(Aeut) aid["59"] # individuals 159 and 57 ## Not run: # For the mlg.table, you can also choose to display the number of MLGs across # populations in the background mlg.table(Aeut, background = TRUE) mlg.table(Aeut, background = TRUE, color = TRUE) # A simple example. 10 individuals, 5 genotypes. mat1 <- matrix(ncol=5, 25:1) mat1 <- rbind(mat1, mat1) mat <- matrix(nrow=10, ncol=5, paste(mat1,mat1,sep="/")) mat.gid <- df2genind(mat, sep="/") mlg(mat.gid) mlg.vector(mat.gid) mlg.table(mat.gid) # Now for a more complicated example. # Data set of 1903 samples of the H3N2 flu virus genotyped at 125 SNP loci. data(H3N2) mlg(H3N2, quiet = FALSE) H.vec <- mlg.vector(H3N2) # Changing the population vector to indicate the years of each epidemic. pop(H3N2) <- other(H3N2)$x$country H.tab <- mlg.table(H3N2, plot = FALSE, total = TRUE) # Show which genotypes exist accross populations in the entire dataset. res <- mlg.crosspop(H3N2, quiet = FALSE) # Let's say we want to visualize the multilocus genotype distribution for the # USA and Russia mlg.table(H3N2, sublist = c("USA", "Russia"), bar=TRUE) # An exercise in subsetting the output of mlg.table and mlg.vector. # First, get the indices of each MLG duplicated across populations. inds <- mlg.crosspop(H3N2, quiet = FALSE, indexreturn = TRUE) # Since the columns of the table from mlg.table are equal to the number of # MLGs, we can subset with just the columns. H.sub <- H.tab[, inds] # We can also do the same by using the mlgsub flag. H.sub <- mlg.table(H3N2, mlgsub = inds) # We can subset the original data set using the output of mlg.vector to # analyze only the MLGs that are duplicated across populations. new.H <- H3N2[H.vec %in% inds, ] ## End(Not run)
Multilocus genotypes are initially defined by naive string matching, but this definition does not take into account missing data or genotyping error, casting these as unique genotypes. Defining multilocus genotypes by genetic distance allows you to incorporate genotypes that have missing data o genotyping error into their parent clusters.
mlg.filter( pop, threshold = 0, missing = "asis", memory = FALSE, algorithm = "farthest_neighbor", distance = "diss.dist", threads = 1L, stats = "MLGs", ... ) mlg.filter( pop, missing = "asis", memory = FALSE, algorithm = "farthest_neighbor", distance = "diss.dist", threads = 1L, ... ) <- value
mlg.filter( pop, threshold = 0, missing = "asis", memory = FALSE, algorithm = "farthest_neighbor", distance = "diss.dist", threads = 1L, stats = "MLGs", ... ) mlg.filter( pop, missing = "asis", memory = FALSE, algorithm = "farthest_neighbor", distance = "diss.dist", threads = 1L, ... ) <- value
pop |
|
threshold |
a number indicating the minimum distance two MLGs must be separated by to be considered different. Defaults to 0, which will reflect the original (naive) MLG definition. |
missing |
any method to be used by |
memory |
whether this function should remember the last distance matrix it generated. TRUE will attempt to reuse the last distance matrix if the other parameters are the same. (default) FALSE will ignore any stored matrices and not store any it generates. |
algorithm |
determines the type of clustering to be done.
|
distance |
a character or function defining the distance to be applied
to pop. Defaults to |
threads |
(unused) Previously, this was the maximum number of parallel threads to be used within this function. Default is 1 indicating that this function will run serially. Any other number will result in a warning. |
stats |
a character vector specifying which statistics should be returned (details below). Choices are "MLG", "THRESHOLDS", "DISTANCES", "SIZES", or "ALL". If choosing "ALL" or more than one, a named list will be returned. |
... |
any parameters to be passed off to the distance method. |
value |
the threshold at which genotypes should be collapsed. |
This function will take in any distance matrix or function and collapse multilocus genotypes below a given threshold. If you use this function as the assignment method (mlg.filter(myData, distance = myDist) <- 0.5), the distance function or matrix will be remembered by the object. This means that if you define your own distance matrix or function, you must keep it in memory to further utilize mlg.filter.
Default, a vector of collapsed multilocus genotypes. Otherwise, any combination of the following:
a numeric vector defining the multilocus genotype cluster of each individual in the dataset. Each genotype cluster is separated from every other genotype cluster by at least the defined threshold value, as calculated by the selected algorithm.
A numeric vector representing the thresholds beyond which clusters of multilocus genotypes were collapsed.
A square matrix representing the distances between each cluster.
The sizes of the multilocus genotype clusters in order.
mlg.vector
makes use of mlg.vector
grouping prior to
applying the given threshold. Genotype numbers returned by
mlg.vector
represent the lowest numbered genotype (as returned by
mlg.vector
) in in each new multilocus genotype. Therefore
mlg.filter
and mlg.vector
return the same vector when
threshold is set to 0 or less.
filter_stats
,
cutoff_predictor
,
mll
,
genclone
,
snpclone
,
diss.dist
,
bruvo.dist
data(partial_clone) pc <- as.genclone(partial_clone, threads = 1L) # convert to genclone object # Basic Use --------------------------------------------------------------- # Show MLGs at threshold 0.05 mlg.filter(pc, threshold = 0.05, distance = "nei.dist", threads = 1L) pc # 26 mlgs # Set MLGs at threshold 0.05 mlg.filter(pc, distance = "nei.dist", threads = 1L) <- 0.05 pc # 25 mlgs ## Not run: # The distance definition is persistant mlg.filter(pc) <- 0.1 pc # 24 mlgs # But you can still change the definition mlg.filter(pc, distance = "diss.dist", percent = TRUE) <- 0.1 pc # Choosing a threshold ---------------------------------------------------- # Thresholds for collapsing multilocus genotypes should not be arbitrary. It # is important to consider what threshold is suitable. One method of choosing # a threshold is to find a gap in the distance distribution that represents # clonal groups. You can look at this by analyzing the distribution of all # possible thresholds with the function "cutoff_predictor". # For this example, we'll use Bruvo's distance to predict the cutoff for # P. infestans. data(Pinf) Pinf # Repeat lengths are necessary for Bruvo's distance (pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2))) # Now we can collect information of the thresholds. We can set threshold = 1 # because we know that this will capture the maximum possible distance: (thresholds <- mlg.filter(Pinf, distance = bruvo.dist, stats = "THRESHOLDS", replen = pinfreps, threshold = 1)) # We can use these thresholds to find an appropriate cutoff (pcut <- cutoff_predictor(thresholds)) mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps) <- pcut Pinf # This can also be visualized with the "filter_stats" function. # Special case: threshold = 0 --------------------------------------------- # It's important to remember that a threshold of 0 is equal to the original # MLG definition. This example will show a data set that contains genotypes # with missing data that share all alleles with other genotypes except for # the missing one. data(monpop) monpop # 264 mlg mlg.filter(monpop) <- 0 nmll(monpop) # 264 mlg # In order to merge these genotypes with missing data, we should set the # threshold to be slightly higher than 0. We will use the smallest fraction # the computer can store. mlg.filter(monpop) <- .Machine$double.eps ^ 0.5 nmll(monpop) # 236 mlg # Custom distance --------------------------------------------------------- # Custom genetic distances can be used either in functions from other # packages or user-defined functions data(Pinf) Pinf mlg.filter(Pinf, distance = function(x) dist(tab(x))) <- 3 Pinf mlg.filter(Pinf) <- 4 Pinf # genlight / snpclone objects --------------------------------------------- set.seed(999) gc <- as.snpclone(glSim(100, 0, n.snp.struc = 1e3, ploidy = 2)) gc # 100 mlgs mlg.filter(gc) <- 0.25 gc # 82 mlgs ## End(Not run)
data(partial_clone) pc <- as.genclone(partial_clone, threads = 1L) # convert to genclone object # Basic Use --------------------------------------------------------------- # Show MLGs at threshold 0.05 mlg.filter(pc, threshold = 0.05, distance = "nei.dist", threads = 1L) pc # 26 mlgs # Set MLGs at threshold 0.05 mlg.filter(pc, distance = "nei.dist", threads = 1L) <- 0.05 pc # 25 mlgs ## Not run: # The distance definition is persistant mlg.filter(pc) <- 0.1 pc # 24 mlgs # But you can still change the definition mlg.filter(pc, distance = "diss.dist", percent = TRUE) <- 0.1 pc # Choosing a threshold ---------------------------------------------------- # Thresholds for collapsing multilocus genotypes should not be arbitrary. It # is important to consider what threshold is suitable. One method of choosing # a threshold is to find a gap in the distance distribution that represents # clonal groups. You can look at this by analyzing the distribution of all # possible thresholds with the function "cutoff_predictor". # For this example, we'll use Bruvo's distance to predict the cutoff for # P. infestans. data(Pinf) Pinf # Repeat lengths are necessary for Bruvo's distance (pinfreps <- fix_replen(Pinf, c(2, 2, 6, 2, 2, 2, 2, 2, 3, 3, 2))) # Now we can collect information of the thresholds. We can set threshold = 1 # because we know that this will capture the maximum possible distance: (thresholds <- mlg.filter(Pinf, distance = bruvo.dist, stats = "THRESHOLDS", replen = pinfreps, threshold = 1)) # We can use these thresholds to find an appropriate cutoff (pcut <- cutoff_predictor(thresholds)) mlg.filter(Pinf, distance = bruvo.dist, replen = pinfreps) <- pcut Pinf # This can also be visualized with the "filter_stats" function. # Special case: threshold = 0 --------------------------------------------- # It's important to remember that a threshold of 0 is equal to the original # MLG definition. This example will show a data set that contains genotypes # with missing data that share all alleles with other genotypes except for # the missing one. data(monpop) monpop # 264 mlg mlg.filter(monpop) <- 0 nmll(monpop) # 264 mlg # In order to merge these genotypes with missing data, we should set the # threshold to be slightly higher than 0. We will use the smallest fraction # the computer can store. mlg.filter(monpop) <- .Machine$double.eps ^ 0.5 nmll(monpop) # 236 mlg # Custom distance --------------------------------------------------------- # Custom genetic distances can be used either in functions from other # packages or user-defined functions data(Pinf) Pinf mlg.filter(Pinf, distance = function(x) dist(tab(x))) <- 3 Pinf mlg.filter(Pinf) <- 4 Pinf # genlight / snpclone objects --------------------------------------------- set.seed(999) gc <- as.snpclone(glSim(100, 0, n.snp.struc = 1e3, ploidy = 2)) gc # 100 mlgs mlg.filter(gc) <- 0.25 gc # 82 mlgs ## End(Not run)
The following methods allow the user to access and manipulate multilocus lineages in genclone or snpclone objects.
mll(x, type = NULL) nmll(x, type = NULL) mll(x) <- value
mll(x, type = NULL) nmll(x, type = NULL) mll(x) <- value
x |
|
type |
a character specifying "original", "contracted", or "custom" defining they type of mlgs to return. Defaults to what is set in the object. |
value |
a character specifying which mlg type is visible in the object. See details. |
genclone and snpclone objects have a slot for an internal class of object called MLG. This class allows the storage of flexible mll definitions:
"original" - naive mlgs defined by string comparison. This is default.
"contracted" - mlgs defined by a genetic distance threshold.
"custom" - user-defined MLGs
an object of the same type as x.
Zhian N. Kamvar
data(partial_clone) pc <- as.genclone(partial_clone) mll(pc) mll(pc) <- "custom" mll(pc) mll.levels(pc) <- LETTERS mll(pc)
data(partial_clone) pc <- as.genclone(partial_clone) mll(pc) mll(pc) <- "custom" mll(pc) mll.levels(pc) <- LETTERS mll(pc)
This function will allow you to define custom multilocus lineages for your data set.
mll.custom(x, set = TRUE, value) mll.custom(x, set = TRUE) <- value mll.levels(x, set = TRUE, value) mll.levels(x, set = TRUE) <- value
mll.custom(x, set = TRUE, value) mll.custom(x, set = TRUE) <- value mll.levels(x, set = TRUE, value) mll.levels(x, set = TRUE) <- value
x |
|
set |
logical. If |
value |
a vector that defines the multilocus lineages for your data. This can be a vector of ANYTHING that can be turned into a factor. |
an object of the same type as x
Zhian N. Kamvar
data(partial_clone) pc <- as.genclone(partial_clone) mll.custom(pc) <- LETTERS[mll(pc)] mll(pc) # Let's say we had a mistake and the A mlg was actually B. mll.levels(pc)[mll.levels(pc) == "A"] <- "B" mll(pc) # Set the MLL back to the original definition. mll(pc) <- "original" mll(pc)
data(partial_clone) pc <- as.genclone(partial_clone) mll.custom(pc) <- LETTERS[mll(pc)] mll(pc) # Let's say we had a mistake and the A mlg was actually B. mll.levels(pc)[mll.levels(pc) == "A"] <- "B" mll(pc) # Set the MLL back to the original definition. mll(pc) <- "original" mll(pc)
This function will allow you to reset multilocus lineages for your data set.
mll.reset(x, value)
mll.reset(x, value)
x |
|
value |
a character vector that specifies which levels you wish to be reset. |
an object of the same type as x
This method has no assignment method. If "original" is not contained in "value", it is assumed that the "original" definition will be used to reset the MLGs.
Zhian N. Kamvar
# This data set was a subset of a larger data set, so the multilocus # genotypes are not all sequential data(Pinf) mll(Pinf) <- "original" mll(Pinf) # If we use mll.reset, then it will become sequential Pinf.new <- mll.reset(Pinf, TRUE) # reset all mll(Pinf.new) ## Not run: # It is possible to reset only specific mll definitions. For example, let's # say that we wanted to filter our multilocus genotypes by nei's distance mlg.filter(Pinf, dist = nei.dist, missing = "mean") <- 0.02 # And we wanted to set those as custom genotypes, mll.custom(Pinf) <- mll(Pinf, "contracted") mll.levels(Pinf) <- .genlab("MLG", nmll(Pinf, "custom")) # We could reset just the original and the filtered if we wanted to and keep # the custom as it were. Pinf.new <- mll.reset(Pinf, c("original", "contracted")) mll(Pinf.new, "original") mll(Pinf.new, "contracted") mll(Pinf.new, "custom") # If "original" is not one of the values, then that is used as a baseline. Pinf.orig <- mll.reset(Pinf, "contracted") mll(Pinf.orig, "contracted") mll(Pinf.new, "contracted") ## End(Not run)
# This data set was a subset of a larger data set, so the multilocus # genotypes are not all sequential data(Pinf) mll(Pinf) <- "original" mll(Pinf) # If we use mll.reset, then it will become sequential Pinf.new <- mll.reset(Pinf, TRUE) # reset all mll(Pinf.new) ## Not run: # It is possible to reset only specific mll definitions. For example, let's # say that we wanted to filter our multilocus genotypes by nei's distance mlg.filter(Pinf, dist = nei.dist, missing = "mean") <- 0.02 # And we wanted to set those as custom genotypes, mll.custom(Pinf) <- mll(Pinf, "contracted") mll.levels(Pinf) <- .genlab("MLG", nmll(Pinf, "custom")) # We could reset just the original and the filtered if we wanted to and keep # the custom as it were. Pinf.new <- mll.reset(Pinf, c("original", "contracted")) mll(Pinf.new, "original") mll(Pinf.new, "contracted") mll(Pinf.new, "custom") # If "original" is not one of the values, then that is used as a baseline. Pinf.orig <- mll.reset(Pinf, "contracted") mll(Pinf.orig, "contracted") mll(Pinf.new, "contracted") ## End(Not run)
This is microsatellite data for a population of the haploid plant pathogen *Monilinia fructicola* that causes disease within peach tree canopies (Everhart & Scherm, 2014). Entire populations within trees were sampled across 3 years (2009, 2010, and 2011) in a total of four trees, where one tree was sampled in all three years, for a total of 6 within-tree populations. Within each year, samples in the spring were taken from affected blossoms (termed "BB" for blossom blight) and in late summer from affected fruits (termed "FR" for fruit rot). There are a total of 694 isolates with 65 to 173 isolates within each canopy population that were characterized using a set of 13 microsatellite markers.
data(monpop)
data(monpop)
a [genclone-class] object with 3 hierarchical levels coded into one population factor. These are named "Tree", "Year", and "Symptom"
SE Everhart, H Scherm, (2015) Fine-scale genetic structure of *Monilinia fructicola* during brown rot epidemics within individual peach tree canopies. Phytopathology 105:542-549 doi: doi:10.1094/PHYTO-03-14-0088-R
data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom setPop(monpop) <- ~Symptom/Year monpop
data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom setPop(monpop) <- ~Symptom/Year monpop
These functions are modified from the function dist.genpop to be applicable for distances between individuals.
nei.dist(x, warning = TRUE) edwards.dist(x) rogers.dist(x) reynolds.dist(x) provesti.dist(x) prevosti.dist
nei.dist(x, warning = TRUE) edwards.dist(x) rogers.dist(x) reynolds.dist(x) provesti.dist(x) prevosti.dist
x |
|
warning |
If |
An object of class function
of length 1.
It is important to be careful with the interpretation of these distances as they were originally intended for calculation of between-population distance. As Nei's distance is the negative log of 0:1, this means that it is very possible to obtain distances of infinity. When this happens, infinite values are corrected to be 10 * max(D) where D is the distance matrix without infinite values.
an object of class dist with the same number of observations as the number of individuals in your data.
Prevosti's distance is identical to diss.dist
, except
that diss.dist
is optimized for a larger number of
individuals (n > 125) at the cost of required memory. Both
prevosti.dist
and provesti.dist
are the same function,
provesti.dist
is a spelling error and exists for backwards
compatibility.
These distances were adapted from the adegenet function
dist.genpop
to work with genind
objects.
Zhian N. Kamvar (poppr adaptation) Thibaut Jombart (adegenet adaptation) Daniel Chessel (ade4)
Nei, M. (1972) Genetic distances between populations. American Naturalist, 106, 283-292.
Nei M. (1978) Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 23, 341-369.
Avise, J. C. (1994) Molecular markers, natural history and evolution. Chapman & Hall, London.
Edwards, A.W.F. (1971) Distance between populations on the basis of gene frequencies. Biometrics, 27, 873-881.
Cavalli-Sforza L.L. and Edwards A.W.F. (1967) Phylogenetic analysis: models and estimation procedures. Evolution, 32, 550-570.
Hartl, D.L. and Clark, A.G. (1989) Principles of population genetics. Sinauer Associates, Sunderland, Massachussetts (p. 303).
Reynolds, J. B., B. S. Weir, and C. C. Cockerham. (1983) Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics, 105, 767-779.
Rogers, J.S. (1972) Measures of genetic similarity and genetic distances. Studies in Genetics, Univ. Texas Publ., 7213, 145-153.
Avise, J. C. (1994) Molecular markers, natural history and evolution. Chapman & Hall, London.
Prevosti A. (1974) La distancia genetica entre poblaciones. Miscellanea Alcobe, 68, 109-118.
Prevosti A., Ocana J. and Alonso G. (1975) Distances between populations of Drosophila subobscura, based on chromosome arrangements frequencies. Theoretical and Applied Genetics, 45, 231-241.
For more information on dissimilarity indexes:
Gower J. and Legendre P. (1986) Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5-48
Legendre P. and Legendre L. (1998) Numerical Ecology, Elsevier Science B.V. 20, pp274-288.
data(nancycats) (nan9 <- popsub(nancycats, 9)) (neinan <- nei.dist(nan9)) (ednan <- edwards.dist(nan9)) (rodnan <- rogers.dist(nan9)) (reynan <- reynolds.dist(nan9)) (pronan <- prevosti.dist(nan9))
data(nancycats) (nan9 <- popsub(nancycats, 9)) (neinan <- nei.dist(nan9)) (ednan <- edwards.dist(nan9)) (rodnan <- rogers.dist(nan9)) (reynan <- reynolds.dist(nan9)) (pronan <- prevosti.dist(nan9))
Convert an old genclone object to a new genclone object
old2new_genclone(object, donor = new(class(object)))
old2new_genclone(object, donor = new(class(object)))
object |
a genclone object from poppr v. 1.1 |
donor |
a new genclone object from poppr v. 2.0 |
Zhian N. Kamvar
These data were simulated using SimuPOP version 1.0.8 with 99.9% clonal reproduction over 10,000 generations. Populations were assigned post-hoc and are simply present for the purposes of demonstrating a minimum spanning network with Bruvo's distance.
data(partial_clone)
data(partial_clone)
a [genind()] object with 50 individuals, 10 loci, and four populations.
Bo Peng and Christopher Amos (2008) Forward-time simulations of nonrandom mating populations using simuPOP. *bioinformatics*, 24 (11): 1408-1409.
Calculate the probability of genotypes based on the product of allele frequencies over all loci.
pgen(gid, pop = NULL, by_pop = TRUE, log = TRUE, freq = NULL, ...)
pgen(gid, pop = NULL, by_pop = TRUE, log = TRUE, freq = NULL, ...)
gid |
a genind or genclone object. |
pop |
either a formula to set the population factor from the
|
by_pop |
When this is |
log |
a |
freq |
a vector or matrix of allele frequencies. This defaults to
|
... |
options from correcting rare alleles. The default is to correct allele frequencies to 1/n |
Pgen is the probability of a given genotype occuring in a population assuming HWE. Thus, the value for diploids is
where are the allele frequencies and h is the count of the
number of heterozygous sites in the sample (Arnaud-Haond et al. 2007; Parks
and Werth, 1993). The allele frequencies, by default, are calculated using
a round-robin approach where allele frequencies at a particular locus are
calculated on the clone-censored genotypes without that locus.
To avoid issues with numerical precision of small numbers, this function calculates pgen per locus by adding up log-transformed values of allele frequencies. These can easily be transformed to return the true value (see examples).
A vector containing Pgen values per locus for each genotype in the object.
For haploids, Pgen at a particular locus is the allele frequency. This
function cannot handle polyploids. Additionally, when the argument
pop
is not NULL
, by_pop
is automatically TRUE
.
Zhian N. Kamvar, Jonah Brooks, Stacy A. Krueger-Hadfield, Erik Sotka
Arnaud-Haond, S., Duarte, C. M., Alberto, F., & Serrão, E. A. 2007. Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115-5139.
Parks, J. C., & Werth, C. R. 1993. A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). American Journal of Botany, 537-544.
psex
, rraf
, rrmlg
,
rare_allele_correction
data(Pram) head(pgen(Pram, log = FALSE)) ## Not run: # You can also supply the observed allele frequencies pramfreq <- Pram %>% genind2genpop() %>% tab(freq = TRUE) head(pgen(Pram, log = FALSE, freq = pramfreq)) # You can get the Pgen values over all loci by summing over the logged results: pgen(Pram, log = TRUE) %>% # calculate pgen matrix rowSums(na.rm = TRUE) %>% # take the sum of each row exp() # take the exponent of the results # You can also take the product of the non-logged results: apply(pgen(Pram, log = FALSE), 1, prod, na.rm = TRUE) ## Rare Allele Correction --------------------------------------------------- ## # If you don't supply a table of frequencies, they are calculated with rraf # with correction = TRUE. This is normally benign when analyzing large # populations, but it can have a great effect on small populations. To help # control this, you can supply arguments described in # help("rare_allele_correction"). # Default is to correct by 1/n per population. Since the calculation is # performed on a smaller sample size due to round robin clone correction, it # would be more appropriate to correct by 1/rrmlg at each locus. This is # acheived by setting d = "rrmlg". Since this is a diploid, we would want to # account for the number of chromosomes, and so we set mul = 1/2 head(pgen(Pram, log = FALSE, d = "rrmlg", mul = 1/2)) # compare with the output above # If you wanted to treat all alleles as equally rare, then you would set a # specific value (let's say the rare alleles are 1/100): head(pgen(Pram, log = FALSE, e = 1/100)) ## End(Not run)
data(Pram) head(pgen(Pram, log = FALSE)) ## Not run: # You can also supply the observed allele frequencies pramfreq <- Pram %>% genind2genpop() %>% tab(freq = TRUE) head(pgen(Pram, log = FALSE, freq = pramfreq)) # You can get the Pgen values over all loci by summing over the logged results: pgen(Pram, log = TRUE) %>% # calculate pgen matrix rowSums(na.rm = TRUE) %>% # take the sum of each row exp() # take the exponent of the results # You can also take the product of the non-logged results: apply(pgen(Pram, log = FALSE), 1, prod, na.rm = TRUE) ## Rare Allele Correction --------------------------------------------------- ## # If you don't supply a table of frequencies, they are calculated with rraf # with correction = TRUE. This is normally benign when analyzing large # populations, but it can have a great effect on small populations. To help # control this, you can supply arguments described in # help("rare_allele_correction"). # Default is to correct by 1/n per population. Since the calculation is # performed on a smaller sample size due to round robin clone correction, it # would be more appropriate to correct by 1/rrmlg at each locus. This is # acheived by setting d = "rrmlg". Since this is a diploid, we would want to # account for the number of chromosomes, and so we set mul = 1/2 head(pgen(Pram, log = FALSE, d = "rrmlg", mul = 1/2)) # compare with the output above # If you wanted to treat all alleles as equally rare, then you would set a # specific value (let's say the rare alleles are 1/100): head(pgen(Pram, log = FALSE, e = 1/100)) ## End(Not run)
The Pinf data set contains 86 isolates genotyped over 11 microsatellite loci collected from Mexico, Peru, Columbia, and Ecuador. This is a subset of the data used for the reference below.
data(Pinf)
data(Pinf)
a [genclone-class] object with 2 hierarchical levels called "Continent" and "Country" that contain 2 and 4 populations, respectively.
Goss, Erica M., Javier F. Tabima, David EL Cooke, Silvia Restrepo, William E. Fry, Gregory A. Forbes, Valerie J. Fieland, Martha Cardenas, and Niklaus J. Grünwald. "The Irish potato famine pathogen *Phytophthora infestans* originated in central Mexico rather than the Andes." Proceedings of the National Academy of Sciences 111:8791-8796. doi: doi:10.1073/pnas.1401884111
This function allows you to take the output of poppr.msn and bruvo.msn and customize the plot by labeling groups of individuals, size of nodes, and adjusting the palette and scale bar.
plot_poppr_msn( x, poppr_msn, gscale = TRUE, gadj = 3, mlg.compute = "original", glim = c(0, 0.8), gweight = 1, wscale = TRUE, nodescale = 10, nodebase = NULL, nodelab = 2, inds = "ALL", mlg = FALSE, quantiles = TRUE, cutoff = NULL, palette = NULL, layfun = layout.auto, beforecut = FALSE, pop.leg = TRUE, size.leg = TRUE, scale.leg = TRUE, ... )
plot_poppr_msn( x, poppr_msn, gscale = TRUE, gadj = 3, mlg.compute = "original", glim = c(0, 0.8), gweight = 1, wscale = TRUE, nodescale = 10, nodebase = NULL, nodelab = 2, inds = "ALL", mlg = FALSE, quantiles = TRUE, cutoff = NULL, palette = NULL, layfun = layout.auto, beforecut = FALSE, pop.leg = TRUE, size.leg = TRUE, scale.leg = TRUE, ... )
x |
a |
poppr_msn |
a |
gscale |
"grey scale". If this is |
gadj |
"grey adjust". a positive |
mlg.compute |
if the multilocus genotypes are set to "custom" (see
|
glim |
"grey limit". Two numbers between zero and one. They determine
the upper and lower limits for the |
gweight |
"grey weight". an |
wscale |
"width scale". If this is |
nodescale |
a |
nodebase |
deprecated a |
nodelab |
an |
inds |
a |
mlg |
|
quantiles |
|
cutoff |
a number indicating the longest distance to display in your graph. This is performed by removing edges with weights greater than this number. |
palette |
a function or character corresponding to a specific palette you want to use to delimit your populations. The default is whatever palette was used to produce the original graph. |
layfun |
a function specifying the layout of nodes in your graph. It
defaults to |
beforecut |
if |
pop.leg |
if |
size.leg |
if |
scale.leg |
if |
... |
any other parameters to be passed on to
|
The previous incarnation of msn plotting in poppr simply plotted the minimum spanning network with the legend of populations, but did not provide a scale bar and it did not provide the user a simple way of manipulating the layout or labels. This function allows the user to manipulate many facets of graph creation, making the creation of minimum spanning networks ever so slightly more user friendly.
This function must have both the source data and the output msn to work. The source data must contain the same population structure as the graph. Every other parameter has a default setting.
inds
By default, the graph will label each node (circle) with
all of the samples (individuals) that are contained within that node. As
each node represents a single multilocus genotype (MLG) or individuals (n
>= 1), this argument is designed to allow you to selectively label the
nodes based on query of sample name or MLG number. If the option mlg
= TRUE
, the multilocus genotype assignment will be used to label the node.
If you do not want to label the nodes by individual or multilocus genotype,
simply set this to a name that doesn't exist in your data.
nodescale
The nodes (circles) on the graph represent different
multilocus genotypes. The area of the nodes represent the number of
individuals. Setting nodescale will scale the area of the nodes.
nodelab
If a node is not labeled by individual, this will
label the size of the nodes greater than or equal to this value. If you
don't want to label the size of the nodes, simply set this to a very high
number.
cutoff
This is useful for when you want to investigate groups
of multilocus genotypes separated by a specific distance or if you have two
distinct populations and you want to physically separate them in your
network.
beforecut
This is an indicator useful if you want to maintain
the same position of the nodes before and after removing edges with the
cutoff
argument. This works best if you set a seed before you run
the function.
Each node on the graph represents a different multilocus genotype.
The edges on the graph represent genetic distances that connect the
multilocus genotypes. In genclone objects, it is possible to set the
multilocus genotypes to a custom definition. This creates a problem for
clone correction, however, as it is very possible to define custom lineages
that are not monophyletic. When clone correction is performed on these
definitions, information is lost from the graph. To circumvent this, The
clone correction will be done via the computed multilocus genotypes, either
"original" or "contracted". This is specified in the mlg.compute
argument, above.
To avoid drawing the legend over the graph, legends
are separated by different plotting areas. This means that if legends are
included, you cannot plot multiple MSNs in a single plot. The scale bar (to
be added in manually) can be obtained from greycurve
and the
legend can be plotted with legend
.
the modified msn list, invisibly.
Zhian N. Kamvar
layout.auto
plot.igraph
poppr.msn
bruvo.msn
greycurve
delete_edges
palette
# Using a data set of the Aphanomyces eutieches root rot pathogen. data(Aeut) adist <- diss.dist(Aeut, percent = TRUE) amsn <- poppr.msn(Aeut, adist, showplot = FALSE) # Default library("igraph") # To get all the layouts. set.seed(500) plot_poppr_msn(Aeut, amsn, gadj = 15) ## Not run: # Different layouts (from igraph) can be used by supplying the function name. set.seed(500) plot_poppr_msn(Aeut, amsn, gadj = 15, layfun = layout_with_kk) # Removing link between populations (cutoff = 0.2) and labelling no individuals set.seed(500) plot_poppr_msn(Aeut, amsn, inds = "none", gadj = 15, beforecut = TRUE, cutoff = 0.2) # Labelling individual #57 because it is an MLG that crosses populations # Showing clusters of MLGS with at most 5% variation # Notice that the Mt. Vernon population appears to be more clonal set.seed(50) plot_poppr_msn(Aeut, amsn, gadj = 15, cutoff = 0.05, inds = "057") data(partial_clone) pcmsn <- bruvo.msn(partial_clone, replen = rep(1, 10)) # You can plot using a color palette or a vector of named colors # Here's a way to define the colors beforehand pc_colors <- nPop(partial_clone) %>% RColorBrewer::brewer.pal("Set2") %>% setNames(popNames(partial_clone)) pc_colors # Labelling the samples contained in multilocus genotype 9 set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = 9) # Doing the same thing, but using one of the sample names as input. set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = "sim 20") # Note that this is case sensitive. Nothing is labeled. set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = "Sim 20") # Something pretty data(microbov) mdist <- diss.dist(microbov, percent = TRUE) micmsn <- poppr.msn(microbov, mdist, showplot = FALSE) plot_poppr_msn(microbov, micmsn, palette = "terrain.colors", inds = "n", quantiles = FALSE) plot_poppr_msn(microbov, micmsn, palette = "terrain.colors", inds = "n", cutoff = 0.3, quantiles = FALSE) ### Utilizing vectors for palettes data(Pram) Pram_sub <- popsub(Pram, exclude = c("Nursery_CA", "Nursery_OR")) # Creating the network for the forest min_span_net_sub <- bruvo.msn(Pram_sub, replen = other(Pram)$REPLEN, add = TRUE, loss = TRUE, showplot = FALSE, include.ties = TRUE) # Creating the network with nurseries min_span_net <- bruvo.msn(Pram, replen = other(Pram)$REPLEN, add = TRUE, loss = TRUE, showplot = FALSE, include.ties = TRUE) # Only forest genotypes set.seed(70) plot_poppr_msn(Pram, min_span_net_sub, inds = "ALL", mlg = TRUE, gadj = 9, nodescale = 5, palette = other(Pram)$comparePal, cutoff = NULL, quantiles = FALSE, beforecut = TRUE) # With Nurseries set.seed(70) plot_poppr_msn(Pram, min_span_net, inds = "ALL", mlg = TRUE, gadj = 9, nodescale = 5, palette = other(Pram)$comparePal, cutoff = NULL, quantiles = FALSE, beforecut = TRUE) ## End(Not run)
# Using a data set of the Aphanomyces eutieches root rot pathogen. data(Aeut) adist <- diss.dist(Aeut, percent = TRUE) amsn <- poppr.msn(Aeut, adist, showplot = FALSE) # Default library("igraph") # To get all the layouts. set.seed(500) plot_poppr_msn(Aeut, amsn, gadj = 15) ## Not run: # Different layouts (from igraph) can be used by supplying the function name. set.seed(500) plot_poppr_msn(Aeut, amsn, gadj = 15, layfun = layout_with_kk) # Removing link between populations (cutoff = 0.2) and labelling no individuals set.seed(500) plot_poppr_msn(Aeut, amsn, inds = "none", gadj = 15, beforecut = TRUE, cutoff = 0.2) # Labelling individual #57 because it is an MLG that crosses populations # Showing clusters of MLGS with at most 5% variation # Notice that the Mt. Vernon population appears to be more clonal set.seed(50) plot_poppr_msn(Aeut, amsn, gadj = 15, cutoff = 0.05, inds = "057") data(partial_clone) pcmsn <- bruvo.msn(partial_clone, replen = rep(1, 10)) # You can plot using a color palette or a vector of named colors # Here's a way to define the colors beforehand pc_colors <- nPop(partial_clone) %>% RColorBrewer::brewer.pal("Set2") %>% setNames(popNames(partial_clone)) pc_colors # Labelling the samples contained in multilocus genotype 9 set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = 9) # Doing the same thing, but using one of the sample names as input. set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = "sim 20") # Note that this is case sensitive. Nothing is labeled. set.seed(999) plot_poppr_msn(partial_clone, pcmsn, palette = pc_colors, inds = "Sim 20") # Something pretty data(microbov) mdist <- diss.dist(microbov, percent = TRUE) micmsn <- poppr.msn(microbov, mdist, showplot = FALSE) plot_poppr_msn(microbov, micmsn, palette = "terrain.colors", inds = "n", quantiles = FALSE) plot_poppr_msn(microbov, micmsn, palette = "terrain.colors", inds = "n", cutoff = 0.3, quantiles = FALSE) ### Utilizing vectors for palettes data(Pram) Pram_sub <- popsub(Pram, exclude = c("Nursery_CA", "Nursery_OR")) # Creating the network for the forest min_span_net_sub <- bruvo.msn(Pram_sub, replen = other(Pram)$REPLEN, add = TRUE, loss = TRUE, showplot = FALSE, include.ties = TRUE) # Creating the network with nurseries min_span_net <- bruvo.msn(Pram, replen = other(Pram)$REPLEN, add = TRUE, loss = TRUE, showplot = FALSE, include.ties = TRUE) # Only forest genotypes set.seed(70) plot_poppr_msn(Pram, min_span_net_sub, inds = "ALL", mlg = TRUE, gadj = 9, nodescale = 5, palette = other(Pram)$comparePal, cutoff = NULL, quantiles = FALSE, beforecut = TRUE) # With Nurseries set.seed(70) plot_poppr_msn(Pram, min_span_net, inds = "ALL", mlg = TRUE, gadj = 9, nodescale = 5, palette = other(Pram)$comparePal, cutoff = NULL, quantiles = FALSE, beforecut = TRUE) ## End(Not run)
For the poppr package description, please see package?poppr
This function allows the user to quickly view indices of heterozygosity,
evenness, and linkage to aid in the decision of a path to further analyze a
specified dataset. It natively takes adegenet::genind and
genclone objects, but can convert any raw data formats
that adegenet can take (fstat, structure, genetix, and genpop) as well as
genalex files exported into a csv format (see read.genalex()
for details).
poppr( dat, total = TRUE, sublist = "ALL", exclude = NULL, blacklist = NULL, sample = 0, method = 1, missing = "ignore", cutoff = 0.05, quiet = FALSE, clonecorrect = FALSE, strata = 1, keep = 1, plot = TRUE, hist = TRUE, index = "rbarD", minsamp = 10, legend = FALSE, ... )
poppr( dat, total = TRUE, sublist = "ALL", exclude = NULL, blacklist = NULL, sample = 0, method = 1, missing = "ignore", cutoff = 0.05, quiet = FALSE, clonecorrect = FALSE, strata = 1, keep = 1, plot = TRUE, hist = TRUE, index = "rbarD", minsamp = 10, legend = FALSE, ... )
dat |
a adegenet::genind object OR a genclone object OR any fstat, structure, genetix, genpop, or genalex formatted file. |
total |
When |
sublist |
a list of character strings or integers to indicate specific
population names (accessed via |
exclude |
a |
blacklist |
DEPRECATED, use exclude. |
sample |
an integer indicating the number of permutations desired to
obtain p-values. Sampling will shuffle genotypes at each locus to simulate
a panmictic population using the observed genotypes. Calculating the
p-value includes the observed statistics, so set your sample number to one
off for a round p-value (eg. |
method |
an integer from 1 to 4 indicating the method of sampling
desired. see |
missing |
how should missing data be treated? |
cutoff |
|
quiet |
|
clonecorrect |
default |
strata |
a |
keep |
an |
plot |
|
hist |
|
index |
|
minsamp |
an |
legend |
|
... |
arguments to be passed on to |
This table is intended to be a first look into the dynamics of mutlilocus
genotype diversity. Many of the statistics (except for the the index of
association) are simply based on counts of multilocus genotypes and do not
take into account the actual allelic states. Descriptions of the
statistics can be found in the Algorithms and Equations vignette:
vignette("algo", package = "poppr")
.
The sampling procedure is explicitly for testing the index of association.
None of the other diversity statistics (H, G, lambda, E.5) are tested with
this sampling due to the differing data types. To obtain confidence
intervals for these statistics, please see diversity_ci()
.
Rarefaction analysis is performed on the number of multilocus genotypes
because it is relatively easy to estimate (Grünwald et al., 2003). To
obtain rarefied estimates of diversity, it is possible to use
diversity_ci()
with the argument rarefy = TRUE
This function outputs a ggplot2 graphic of histograms. These can be
manipulated to be visualized in another manner by retrieving the plot with
the last_plot()
command from ggplot2. A useful manipulation would
be to arrange the graphs into a single column so that the values of the
statistic line up: p <- last_plot(); p + facet_wrap(~population, ncol = 1, scales = "free_y")
The name for the groupings is
"population" and the name for the x axis is "value".
A data frame with populations in rows and the following columns:
Pop: A vector indicating the population factor
N: An integer vector indicating the number of individuals/isolates in the specified population.
MLG: An integer vector indicating the number of multilocus genotypes
found in the specified population, (see: mlg()
)
eMLG: The expected number of MLG at the lowest common sample size (set
by the parameter minsamp
).
SE: The standard error for the rarefaction analysis
H: Shannon-Weiner Diversity index
G: Stoddard and Taylor's Index
lambda: Simpson's index
E.5: Evenness
Hexp: Nei's gene diversity (expected heterozygosity)
Ia: A numeric vector giving the value of the Index of Association for
each population factor, (see ia()
).
p.Ia: A numeric vector indicating the p-value for Ia from the number
of reshufflings indicated in sample
. Lowest value is 1/n where n is the
number of observed values.
rbarD: A numeric vector giving the value of the Standardized Index of
Association for each population factor, (see ia()
).
p.rD: A numeric vector indicating the p-value for rbarD from the
number of reshuffles indicated in sample
. Lowest value is 1/n where n is
the number of observed values.
File: A vector indicating the name of the original data file.
The calculation of Hexp
has changed from poppr 1.x. It was
previously calculated based on the diversity of multilocus genotypes,
resulting in a value of 1 for sexual populations. This was obviously not
Nei's 1978 expected heterozygosity. We have thus changed the statistic to
be the true value of Hexp by calculating where p is the allele
frequencies at a given locus and n is the number of observed alleles (Nei,
1978) in each locus and then returning the average. Caution should be
exercised in interpreting the results of Hexp with polyploid organisms with
ambiguous ploidy. The lack of allelic dosage information will cause rare
alleles to be over-represented and artificially inflate the index. This is
especially true with small sample sizes.
Zhian N. Kamvar
Paul-Michael Agapow and Austin Burt. Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2):101-102, 2001
A.H.D. Brown, M.W. Feldman, and E. Nevo. Multilocus structure of natural populations of Hordeum spontaneum. Genetics, 96(2):523-536, 1980.
Niklaus J. Gr\"unwald, Stephen B. Goodwin, Michael G. Milgroom, and William E. Fry. Analysis of genotypic diversity data for populations of microorganisms. Phytopathology, 93(6):738-46, 2003
Bernhard Haubold and Richard R. Hudson. Lian 3.0: detecting linkage disequilibrium in multilocus data. Bioinformatics, 16(9):847-849, 2000.
Kenneth L.Jr. Heck, Gerald van Belle, and Daniel Simberloff. Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56(6):pp. 1459-1461, 1975
Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583-590, 1978.
S H Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577-586, 1971.
J.A. Ludwig and J.F. Reynolds. Statistical Ecology. A Primer on Methods and Computing. New York USA: John Wiley and Sons, 1988.
Simpson, E. H. Measurement of diversity. Nature 163: 688, 1949 doi:10.1038/163688a0
Good, I. J. (1953). On the Population Frequency of Species and the Estimation of Population Parameters. Biometrika 40(3/4): 237-264.
Lande, R. (1996). Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos 76: 5-13.
Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, and Helene Wagner. vegan: Community Ecology Package, 2012. R package version 2.0-5.
E.C. Pielou. Ecological Diversity. Wiley, 1975.
Claude Elwood Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423,623-656, 1948
J M Smith, N H Smith, M O'Rourke, and B G Spratt. How clonal are bacteria? Proceedings of the National Academy of Sciences, 90(10):4384-4388, 1993.
J.A. Stoddart and J.F. Taylor. Genotypic diversity: estimation and prediction in samples. Genetics, 118(4):705-11, 1988.
clonecorrect()
,
poppr.all()
,
ia()
,
missingno()
,
mlg()
,
diversity_stats()
,
diversity_ci()
data(nancycats) poppr(nancycats) ## Not run: # Sampling poppr(nancycats, sample = 999, total = FALSE, plot = TRUE) # Customizing the plot library("ggplot2") p <- last_plot() p + facet_wrap(~population, scales = "free_y", ncol = 1) # Turning off diversity statistics (see get_stats) poppr(nancycats, total=FALSE, H = FALSE, G = FALSE, lambda = FALSE, E5 = FALSE) # The previous version of poppr contained a definition of Hexp, which # was calculated as (N/(N - 1))*lambda. It basically looks like an unbiased # Simpson's index. This statistic was originally included in poppr because it # was originally included in the program multilocus. It was finally figured # to be an unbiased Simpson's diversity metric (Lande, 1996; Good, 1953). data(Aeut) uSimp <- function(x){ lambda <- vegan::diversity(x, "simpson") x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ N <- rowSums(x) } else { N <- sum(x) } return((N/(N-1))*lambda) } poppr(Aeut, uSimp = uSimp) # Demonstration with viral data # Note: this is a larger data set that could take a couple of minutes to run # on slower computers. data(H3N2) strata(H3N2) <- data.frame(other(H3N2)$x) setPop(H3N2) <- ~country poppr(H3N2, total = FALSE, sublist=c("Austria", "China", "USA"), clonecorrect = TRUE, strata = ~country/year) ## End(Not run)
data(nancycats) poppr(nancycats) ## Not run: # Sampling poppr(nancycats, sample = 999, total = FALSE, plot = TRUE) # Customizing the plot library("ggplot2") p <- last_plot() p + facet_wrap(~population, scales = "free_y", ncol = 1) # Turning off diversity statistics (see get_stats) poppr(nancycats, total=FALSE, H = FALSE, G = FALSE, lambda = FALSE, E5 = FALSE) # The previous version of poppr contained a definition of Hexp, which # was calculated as (N/(N - 1))*lambda. It basically looks like an unbiased # Simpson's index. This statistic was originally included in poppr because it # was originally included in the program multilocus. It was finally figured # to be an unbiased Simpson's diversity metric (Lande, 1996; Good, 1953). data(Aeut) uSimp <- function(x){ lambda <- vegan::diversity(x, "simpson") x <- drop(as.matrix(x)) if (length(dim(x)) > 1){ N <- rowSums(x) } else { N <- sum(x) } return((N/(N-1))*lambda) } poppr(Aeut, uSimp = uSimp) # Demonstration with viral data # Note: this is a larger data set that could take a couple of minutes to run # on slower computers. data(H3N2) strata(H3N2) <- data.frame(other(H3N2)$x) setPop(H3N2) <- ~country poppr(H3N2, total = FALSE, sublist=c("Austria", "China", "USA"), clonecorrect = TRUE, strata = ~country/year) ## End(Not run)
Determines whether openMP is support on this system.
poppr_has_parallel()
poppr_has_parallel()
FALSE if openMP is not supported, TRUE if it is
Zhian N. Kamvar, Jonah C. Brooks
poppr_has_parallel()
poppr_has_parallel()
poppr.all is a wrapper function that will loop through a list of files from the working directory, execute [poppr()], and concatenate the output into one data frame.
poppr.all(filelist, ...)
poppr.all(filelist, ...)
filelist |
a list of files in the current working directory |
... |
arguments passed on to poppr |
see [poppr()]
Zhian N. Kamvar
[poppr()], [getfile()]
## Not run: # Obtain a list of fstat files from a directory. x <- getfile(multi=TRUE, pattern="^.+?dat$") # run the analysis on each file. poppr.all(file.path(x$path, x$files)) ## End(Not run)
## Not run: # Obtain a list of fstat files from a directory. x <- getfile(multi=TRUE, pattern="^.+?dat$") # run the analysis on each file. poppr.all(file.path(x$path, x$files)) ## End(Not run)
This function simplifies the process necessary for performing AMOVA in R. It
gives user the choice of utilizing either the ade4 or the pegas
implementation of AMOVA. See ade4::amova()
(ade4) and pegas::amova()
(pegas) for details on the specific implementation.
poppr.amova( x, hier = NULL, clonecorrect = FALSE, within = TRUE, dist = NULL, squared = TRUE, freq = TRUE, correction = "quasieuclid", sep = "_", filter = FALSE, threshold = 0, algorithm = "farthest_neighbor", threads = 1L, missing = "loci", cutoff = 0.05, quiet = FALSE, method = c("ade4", "pegas"), nperm = 0 )
poppr.amova( x, hier = NULL, clonecorrect = FALSE, within = TRUE, dist = NULL, squared = TRUE, freq = TRUE, correction = "quasieuclid", sep = "_", filter = FALSE, threshold = 0, algorithm = "farthest_neighbor", threads = 1L, missing = "loci", cutoff = 0.05, quiet = FALSE, method = c("ade4", "pegas"), nperm = 0 )
x |
|
hier |
a hierarchical formula that defines your population
hierarchy. (e.g.: |
clonecorrect |
|
within |
|
dist |
an optional distance matrix calculated on your data. If this is
set to |
squared |
if a distance matrix is supplied, this indicates whether or not it represents squared distances. |
freq |
|
correction |
a |
sep |
Deprecated. As of poppr version 2, this argument serves no purpose. |
filter |
|
threshold |
a number indicating the minimum distance two MLGs must be separated by to be considered different. Defaults to 0, which will reflect the original (naive) MLG definition. |
algorithm |
determines the type of clustering to be done.
|
threads |
|
missing |
specify method of correcting for missing data utilizing
options given in the function |
cutoff |
specify the level at which missing data should be
removed/modified. See |
quiet |
|
method |
Which method for calculating AMOVA should be used? Choices refer to package implementations: "ade4" (default) or "pegas". See details for differences. |
nperm |
the number of permutations passed to the pegas implementation of amova. |
The poppr implementation of AMOVA is a very detailed wrapper for the
ade4 implementation. The output is an ade4::amova()
class list that
contains the results in the first four elements. The inputs are contained
in the last three elements. The inputs required for the ade4 implementation
are:
a distance matrix on all unique genotypes (haplotypes)
a data frame defining the hierarchy of the distance matrix
a genotype (haplotype) frequency table.
All of this data can be constructed from a genind or genlight object, but can be daunting for a novice R user. This function automates the entire process. Since there are many variables regarding genetic data, some points need to be highlighted:
The hierarchy is defined by different
population strata that separate your data hierarchically. These strata are
defined in the strata slot of genind,
genlight, genclone, and
snpclone objects. They are useful for defining the
population factor for your data. See the function strata()
for details on
how to properly define these strata.
Heterozygosities within
genotypes are sources of variation from within individuals and can be
quantified in AMOVA. When within = TRUE
, poppr will split genotypes into
haplotypes with the function make_haplotypes()
and use those to calculate
within-individual variance. No estimation of phase is made. This acts much
like the default settings for AMOVA in the Arlequin software package.
Within individual variance will not be calculated for haploid individuals
or dominant markers as the haplotypes cannot be split further. Setting
within = FALSE
uses the euclidean distance of the allele frequencies
within each individual. Note: within = TRUE
is incompatible with
filter = TRUE
. In this case, within
will be set to FALSE
With the ade4 implementation of
AMOVA (utilized by poppr), distances must be Euclidean (due to the
nature of the calculations). Unfortunately, many genetic distance measures
are not always euclidean and must be corrected for before being analyzed.
Poppr automates this with three methods implemented in ade4,
quasieuclid()
, lingoes()
, and cailliez()
. The correction of these
distances should not adversely affect the outcome of the analysis.
Filtering multilocus genotypes is performed by
mlg.filter()
. This can necessarily only be done AMOVA tests that do not
account for within-individual variance. The distance matrix used to
calculate the amova is derived from using mlg.filter()
with the option
stats = "distance"
, which reports the distance between multilocus
genotype clusters. One useful way to utilize this feature is to correct for
genotypes that have equivalent distance due to missing data. (See example
below.)
Both ade4 and pegas have
implementations of AMOVA, both of which are appropriately called "amova".
The ade4 version is faster, but there have been questions raised as to the
validity of the code utilized. The pegas version is slower, but careful
measures have been implemented as to the accuracy of the method. It must be
noted that there appears to be a bug regarding permuting analyses where
within individual variance is accounted for (within = TRUE
) in the pegas
implementation. If you want to perform permutation analyses on the pegas
implementation, you must set within = FALSE
. In addition, while clone
correction is implemented for both methods, filtering is only implemented
for the ade4 version.
As of poppr version 2.7.0, this function is able to calculate phi statistics for within-individual variance for polyploid data with full dosage information. When a data set does not contain full dosage information for all samples, then the resulting pseudo-haplotypes will contain missing data, which would result in an incorrect estimate of variance.
Instead, the AMOVA will be performed on the distance matrix derived from
allele counts or allele frequencies, depending on the freq
option. This
has been shown to be robust to estimates with mixed ploidy (Ronfort et al.
1998; Meirmans and Liu 2018). If you wish to brute-force your way to
estimating AMOVA using missing values, you can split your haplotypes with
the make_haplotypes()
function.
One strategy for addressing ambiguous dosage in your polyploid data set
would be to convert your data to polysat's genambig
class with the
as.genambig()
, estimate allele frequencies with polysat::deSilvaFreq()
,
and use these frequencies to randomly sample alleles to fill in the
ambiguous alleles.
a list of class amova
from the ade4 or pegas package. See
ade4::amova()
or pegas::amova()
for details.
Excoffier, L., Smouse, P.E. and Quattro, J.M. (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131, 479-491.
Ronfort, J., Jenczewski, E., Bataillon, T., and Rousset, F. (1998). Analysis of population structure in autotetraploid species. Genetics, 150, 921–930.
Meirmans, P., Liu, S. (2018) Analysis of Molecular Variance (AMOVA) for Autopolyploids Submitted.
ade4::amova()
, pegas::amova()
, clonecorrect()
, diss.dist()
,
missingno()
, ade4::is.euclid()
, strata()
, make_haplotypes()
,
as.genambig()
data(Aeut) strata(Aeut) <- other(Aeut)$population_hierarchy[-1] agc <- as.genclone(Aeut) agc amova.result <- poppr.amova(agc, ~Pop/Subpop) amova.result amova.test <- randtest(amova.result) # Test for significance plot(amova.test) amova.test ## Not run: # You can get the same results with the pegas implementation amova.pegas <- poppr.amova(agc, ~Pop/Subpop, method = "pegas") amova.pegas amova.pegas$varcomp/sum(amova.pegas$varcomp) # Clone correction is possible amova.cc.result <- poppr.amova(agc, ~Pop/Subpop, clonecorrect = TRUE) amova.cc.result amova.cc.test <- randtest(amova.cc.result) plot(amova.cc.test) amova.cc.test # Example with filtering data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom poppr.amova(monpop, ~Symptom/Year) # gets a warning of zero distances poppr.amova(monpop, ~Symptom/Year, filter = TRUE, threshold = 0.1) # no warning ## End(Not run)
data(Aeut) strata(Aeut) <- other(Aeut)$population_hierarchy[-1] agc <- as.genclone(Aeut) agc amova.result <- poppr.amova(agc, ~Pop/Subpop) amova.result amova.test <- randtest(amova.result) # Test for significance plot(amova.test) amova.test ## Not run: # You can get the same results with the pegas implementation amova.pegas <- poppr.amova(agc, ~Pop/Subpop, method = "pegas") amova.pegas amova.pegas$varcomp/sum(amova.pegas$varcomp) # Clone correction is possible amova.cc.result <- poppr.amova(agc, ~Pop/Subpop, clonecorrect = TRUE) amova.cc.result amova.cc.test <- randtest(amova.cc.result) plot(amova.cc.test) amova.cc.test # Example with filtering data(monpop) splitStrata(monpop) <- ~Tree/Year/Symptom poppr.amova(monpop, ~Symptom/Year) # gets a warning of zero distances poppr.amova(monpop, ~Symptom/Year, filter = TRUE, threshold = 0.1) # no warning ## End(Not run)
Create a minimum spanning network of selected populations using a distance matrix.
poppr.msn( gid, distmat, palette = topo.colors, mlg.compute = "original", sublist = "All", exclude = NULL, blacklist = NULL, vertex.label = "MLG", gscale = TRUE, glim = c(0, 0.8), gadj = 3, gweight = 1, wscale = TRUE, showplot = TRUE, include.ties = FALSE, threshold = NULL, clustering.algorithm = NULL, ... )
poppr.msn( gid, distmat, palette = topo.colors, mlg.compute = "original", sublist = "All", exclude = NULL, blacklist = NULL, vertex.label = "MLG", gscale = TRUE, glim = c(0, 0.8), gadj = 3, gweight = 1, wscale = TRUE, showplot = TRUE, include.ties = FALSE, threshold = NULL, clustering.algorithm = NULL, ... )
gid |
|
distmat |
a distance matrix that has been derived from your data set. |
palette |
a |
mlg.compute |
if the multilocus genotypes are set to "custom" (see
|
sublist |
a |
exclude |
a |
blacklist |
DEPRECATED, use exclude. |
vertex.label |
a |
gscale |
"grey scale". If this is |
glim |
"grey limit". Two numbers between zero and one. They determine
the upper and lower limits for the |
gadj |
"grey adjust". a positive |
gweight |
"grey weight". an |
wscale |
"width scale". If this is |
showplot |
logical. If |
include.ties |
logical. If |
threshold |
numeric. By default, this is |
clustering.algorithm |
string. By default, this is |
... |
any other arguments that could go into plot.igraph |
The minimum spanning network generated by this function is generated
via igraph's minimum.spanning.tree
. The resultant
graph produced can be plotted using igraph functions, or the entire object
can be plotted using the function plot_poppr_msn
, which will
give the user a scale bar and the option to layout your data.
The area of the nodes are representative of the number of samples. Because igraph scales nodes by radius, the node sizes in the graph are represented as the square root of the number of samples.
Each node on the graph represents a different multilocus genotype.
The edges on the graph represent genetic distances that connect the
multilocus genotypes. In genclone objects, it is possible to set the
multilocus genotypes to a custom definition. This creates a problem for
clone correction, however, as it is very possible to define custom lineages
that are not monophyletic. When clone correction is performed on these
definitions, information is lost from the graph. To circumvent this, The
clone correction will be done via the computed multilocus genotypes, either
"original" or "contracted". This is specified in the mlg.compute
argument, above.
If your incoming data set is of the class genclone
,
and it contains contracted multilocus genotypes, this function will retain
that information for creating the minimum spanning network. You can use the
arguments threshold
and clustering.algorithm
to change the
threshold or clustering algorithm used in the network. For example, if you
have a data set that has a threshold of 0.1 and you wish to have a minimum
spanning network without a threshold, you can simply add
threshold = 0.0
, and no clustering will happen.
The threshold
and clustering.algorithm
arguments can also be
used to filter un-contracted data sets.
All filtering will use the distance matrix supplied in the argument
distmat
.
graph |
a minimum spanning network with nodes corresponding to MLGs within the data set. Colors of the nodes represent population membership. Width and color of the edges represent distance. |
populations |
a vector of the population names corresponding to the vertex colors |
colors |
a vector of the hexadecimal representations of the colors used in the vertex colors |
The edges of these graphs may cross each other if the graph becomes too large.
Javier F. Tabima, Zhian N. Kamvar, Jonah C. Brooks
plot_poppr_msn
nancycats
,
upgma
, nj
, nodelabels
,
tab
, missingno
, bruvo.msn
,
greycurve
# Load the data set and calculate the distance matrix for all individuals. data(Aeut) A.dist <- diss.dist(Aeut) # Graph it. A.msn <- poppr.msn(Aeut, A.dist, gadj = 15, vertex.label = NA) # Find the sizes of the nodes (number of individuals per MLL): igraph::vertex_attr(A.msn$graph, "size")^2 ## Not run: # Set subpopulation structure. Aeut.sub <- as.genclone(Aeut) setPop(Aeut.sub) <- ~Pop/Subpop # Plot respective to the subpopulation structure As.msn <- poppr.msn(Aeut.sub, A.dist, gadj=15, vertex.label=NA) # Show only the structure of the Athena population. As.msn <- poppr.msn(Aeut.sub, A.dist, gadj=15, vertex.label=NA, sublist=1:10) # Let's look at the structure of the microbov data set library("igraph") data(microbov) micro.dist <- diss.dist(microbov, percent = TRUE) micro.msn <- poppr.msn(microbov, micro.dist, vertex.label=NA) # Let's plot it and show where individuals have < 15% of their genotypes # different. edge_weight <- E(micro.msn$graph)$weight edge_labels <- ifelse(edge_weight < 0.15, round(edge_weight, 3), NA) plot.igraph(micro.msn$graph, edge.label = edge_labels, vertex.size = 2, edge.label.color = "red") ## End(Not run)
# Load the data set and calculate the distance matrix for all individuals. data(Aeut) A.dist <- diss.dist(Aeut) # Graph it. A.msn <- poppr.msn(Aeut, A.dist, gadj = 15, vertex.label = NA) # Find the sizes of the nodes (number of individuals per MLL): igraph::vertex_attr(A.msn$graph, "size")^2 ## Not run: # Set subpopulation structure. Aeut.sub <- as.genclone(Aeut) setPop(Aeut.sub) <- ~Pop/Subpop # Plot respective to the subpopulation structure As.msn <- poppr.msn(Aeut.sub, A.dist, gadj=15, vertex.label=NA) # Show only the structure of the Athena population. As.msn <- poppr.msn(Aeut.sub, A.dist, gadj=15, vertex.label=NA, sublist=1:10) # Let's look at the structure of the microbov data set library("igraph") data(microbov) micro.dist <- diss.dist(microbov, percent = TRUE) micro.msn <- poppr.msn(microbov, micro.dist, vertex.label=NA) # Let's plot it and show where individuals have < 15% of their genotypes # different. edge_weight <- E(micro.msn$graph)$weight edge_labels <- ifelse(edge_weight < 0.15, round(edge_weight, 3), NA) plot.igraph(micro.msn$graph, edge.label = edge_labels, vertex.size = 2, edge.label.color = "red") ## End(Not run)
Create a new dataset with specified populations or exclude specified populations from the dataset.
popsub( gid, sublist = "ALL", exclude = NULL, blacklist = NULL, mat = NULL, drop = TRUE )
popsub( gid, sublist = "ALL", exclude = NULL, blacklist = NULL, mat = NULL, drop = TRUE )
gid |
|
sublist |
a |
exclude |
a |
blacklist |
DEPRECATED, use exclude. |
mat |
a |
drop |
|
A genind
object or a matrix.
Zhian N. Kamvar
# Load the dataset microbov. data(microbov) # List the population names. popNames(microbov) # Analyze only the populations with exactly 50 individuals mic.50 <- popsub(microbov, sublist=c(1:6, 11:15), exclude=c(3,4,13,14)) ## Not run: # Analyze the first 10 populations, except for "Bazadais" mic.10 <- popsub(microbov, sublist=1:10, exclude="Bazadais") # Take out the two smallest populations micbig <- popsub(microbov, exclude=c("NDama", "Montbeliard")) # Analyze the two largest populations miclrg <- popsub(microbov, sublist=c("BlondeAquitaine", "Charolais")) ## End(Not run)
# Load the dataset microbov. data(microbov) # List the population names. popNames(microbov) # Analyze only the populations with exactly 50 individuals mic.50 <- popsub(microbov, sublist=c(1:6, 11:15), exclude=c(3,4,13,14)) ## Not run: # Analyze the first 10 populations, except for "Bazadais" mic.10 <- popsub(microbov, sublist=1:10, exclude="Bazadais") # Take out the two smallest populations micbig <- popsub(microbov, exclude=c("NDama", "Montbeliard")) # Analyze the two largest populations miclrg <- popsub(microbov, sublist=c("BlondeAquitaine", "Charolais")) ## End(Not run)
This is the data set from doi:10.5281/zenodo.13007. It has been converted to the genclone object as of poppr version 2.0. It contains 729 samples of the Sudden Oak Death pathogen *Phytophthora ramorum* genotyped over five microsatellite loci (Kamvar et. al., 2015). 513 samples were collected from forests in Curry County, OR from 2001 to mid-2014 (labeled by watershed region). The other 216 samples represents genotypes collected from Nurseries in OR and CA from Goss et. al. (2009).
data(Pram)
data(Pram)
a [genclone-class] object with 3 hierarchical levels called "SOURCE", "YEAR", and, "STATE". The other slot contains a named vector of repeat lengths called "REPLEN", a matrix of xy coordinates for the forest samples called "xy", and a palette to color the ~SOURCE/STATE stratification called "comparePal".
Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015). Spatial and temporal analysis of populations of the sudden oak death pathogen in Oregon forests. Phytopathology 105:982-989. doi: doi:10.1094/PHYTO-12-14-0350-FI
Zhian N. Kamvar, Meg M. Larsen, Alan M. Kanaskie, Everett M. Hansen, & Niklaus J. Grünwald. 2014. Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO, doi: doi:10.5281/zenodo.13007
Goss, E. M., Larsen, M., Chastagner, G. A., Givens, D. R., and Grünwald, N. J. 2009. Population genetic analysis infers migration pathways of *Phytophthora ramorum* in US nurseries. PLoS Pathog. 5:e1000583. doi: doi:10.1371/journal.ppat.1000583
data(Pram) # Repeat lengths (previously processed via fix_replen) other(Pram)$REPLEN # Color palette for source by state. Useful for minimum spanning networks other(Pram)$comparePal
data(Pram) # Repeat lengths (previously processed via fix_replen) other(Pram)$REPLEN # Color palette for source by state. Useful for minimum spanning networks other(Pram)$comparePal
Tabulate alleles the occur in only one population.
private_alleles( gid, form = alleles ~ ., report = "table", level = "population", count.alleles = TRUE, drop = FALSE )
private_alleles( gid, form = alleles ~ ., report = "table", level = "population", count.alleles = TRUE, drop = FALSE )
gid |
a adegenet::genind or genclone object. |
form |
a |
report |
one of |
level |
one of |
count.alleles |
|
drop |
|
the argument form
allows for control over the strata at which
private alleles should be computed. It takes a form where the left hand
side of the formula can be either "allele", "locus", or "loci". The right
hand of the equation, by default is ".". If you change it, it must
correspond to strata located in the adegenet::strata()
slot.
Note, that the right hand side is disabled for genpop objects.
a matrix, data.frame, or vector defining the populations or individuals containing private alleles. If vector is chosen, alleles are not defined.
Zhian N. Kamvar
data(Pinf) # Load P. infestans data. private_alleles(Pinf) ## Not run: # Analyze private alleles based on the country of interest: private_alleles(Pinf, alleles ~ Country) # Number of observed alleles per locus private_alleles(Pinf, locus ~ Country, count.alleles = TRUE) # Get raw number of private alleles per locus. (pal <- private_alleles(Pinf, locus ~ Country, count.alleles = FALSE)) # Get percentages. sweep(pal, 2, nAll(Pinf)[colnames(pal)], FUN = "/") # An example of how these data can be displayed. library("ggplot2") Pinfpriv <- private_alleles(Pinf, report = "data.frame") ggplot(Pinfpriv) + geom_tile(aes(x = population, y = allele, fill = count)) ## End(Not run)
data(Pinf) # Load P. infestans data. private_alleles(Pinf) ## Not run: # Analyze private alleles based on the country of interest: private_alleles(Pinf, alleles ~ Country) # Number of observed alleles per locus private_alleles(Pinf, locus ~ Country, count.alleles = TRUE) # Get raw number of private alleles per locus. (pal <- private_alleles(Pinf, locus ~ Country, count.alleles = FALSE)) # Get percentages. sweep(pal, 2, nAll(Pinf)[colnames(pal)], FUN = "/") # An example of how these data can be displayed. library("ggplot2") Pinfpriv <- private_alleles(Pinf, report = "data.frame") ggplot(Pinfpriv) + geom_tile(aes(x = population, y = allele, fill = count)) ## End(Not run)
Probability of encountering a genotype more than once by chance
psex( gid, pop = NULL, by_pop = TRUE, freq = NULL, G = NULL, method = c("single", "multiple"), ... )
psex( gid, pop = NULL, by_pop = TRUE, freq = NULL, G = NULL, method = c("single", "multiple"), ... )
gid |
a genind or genclone object. |
pop |
either a formula to set the population factor from the
|
by_pop |
When this is |
freq |
a vector or matrix of allele frequencies. This defaults to
|
G |
an integer vector specifying the number of observed genets. If NULL,
this will be the number of original multilocus genotypes for
|
method |
which method of calculating psex should be used? Using
|
... |
options from correcting rare alleles. The default is to correct allele frequencies to 1/n |
Psex is the probability of encountering a given genotype more than once by chance. The basic equation from Parks and Werth (1993) is
where G is the number of multilocus genotypes and pgen is the
probability of a given genotype (see
pgen
for its calculation). For a given value of alpha (e.g.
alpha = 0.05), genotypes with psex < alpha can be thought of as a single
genet whereas genotypes with psex > alpha do not have strong evidence that
members belong to the same genet (Parks and Werth, 1993).
When method = "multiple"
, the method from Arnaud-Haond et al. (1997)
is used where the sum of the binomial density is taken.
where N is the number of sampling units i is the ith - 1 encounter of a given genotype, and pgen is the value of pgen for that genotype. This procedure is performed for all samples in the data. For example, if you have a genotype whose pgen value was 0.0001, with 5 observations out of 100 samples, the value of psex is computed like so:
dbinom(0:4, 100, 0.0001)
It is possible to modify G
for single or multiple encounters. With
method = "single"
, G
takes place of the exponent, whereas
with method = "multiple"
, G
replaces N
(see above).
If you supply a named vector for G
with the population names and
by_pop = TRUE
, then the value of G
will be different for each
population.
For example, in the case of method = "multiple"
, let's say you have
two populations that share a genotype between them. The size of population
A and B are 25 and 75, respectively, The values of pgen for that genotype
in population A and B are 0.005 and 0.0001, respectively, and the number of
samples with the genotype in popualtions A and B are 4 and 6, respectively.
In this case psex for this genotype would be calculated for each population
separately if we don't specify G
:
psexA = dbinom(0:3, 25, 0.005) psexB = dbinom(0:5, 75, 0.0001)
If we specify G = 100
, then it changes to:
psexA = dbinom(0:3, 100, 0.005) psexB = dbinom(0:5, 100, 0.0001)
We could also specify G to be the number of genotypes observed in the population (let's say A = 10, B = 20)
psexA = dbinom(0:3, 10, 0.005) psexB = dbinom(0:5, 20, 0.0001)
Unless freq
is supplied, the function will automatically calculate
the round-robin allele frequencies with rraf
and G
with nmll
.
a vector of Psex for each sample.
The values of Psex represent the value for each multilocus genotype.
Additionally, when the argument pop
is not NULL
,
by_pop
is automatically TRUE
.
Zhian N. Kamvar, Jonah Brooks, Stacy A. Krueger-Hadfield, Erik Sotka
Arnaud-Haond, S., Duarte, C. M., Alberto, F., & Serrão, E. A. 2007. Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115-5139.
Parks, J. C., & Werth, C. R. 1993. A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). American Journal of Botany, 537-544.
pgen
, rraf
, rrmlg
,
rare_allele_correction
data(Pram) # With multiple encounters Pram_psex <- psex(Pram, by_pop = FALSE, method = "multiple") plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters") ## Not run: # For a single encounter (default) Pram_psex <- psex(Pram, by_pop = FALSE) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of second encounter") # This can be also done assuming populations structure Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple") plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure") # The above, but correcting zero-value alleles by 1/(2*rrmlg) with no # population structure assumed # Type ?rare_allele_correction for details. Pram_psex2 <- psex(Pram, by_pop = FALSE, d = "rrmlg", mul = 1/2, method = "multiple") plot(Pram_psex2, log = "y", col = ifelse(Pram_psex2 > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure (1/(2*rrmlg))") # We can also set G to the total population size (G <- nInd(Pram)) Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple", G = G) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure G = 729") # Or we can set G to the number of unique MLGs (G <- rowSums(mlg.table(Pram, plot = FALSE) > 0)) Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple", G = G) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure G = nmll") ## An example of supplying previously calculated frequencies and G # From Parks and Werth, 1993, using the first three genotypes. # The row names indicate the number of samples found with that genotype x <- " Hk Lap Mdh2 Pgm1 Pgm2 X6Pgd2 54 12 12 12 23 22 11 36 22 22 11 22 33 11 10 23 22 11 33 13 13" # Since we aren't representing the whole data set here, we are defining the # allele frequencies before the analysis. afreq <- c(Hk.1 = 0.167, Hk.2 = 0.795, Hk.3 = 0.038, Lap.1 = 0.190, Lap.2 = 0.798, Lap.3 = 0.012, Mdh2.0 = 0.011, Mdh2.1 = 0.967, Mdh2.2 = 0.022, Pgm1.2 = 0.279, Pgm1.3 = 0.529, Pgm1.4 = 0.162, Pgm1.5 = 0.029, Pgm2.1 = 0.128, Pgm2.2 = 0.385, Pgm2.3 = 0.487, X6Pgd2.1 = 0.526, X6Pgd2.2 = 0.051, X6Pgd2.3 = 0.423) xtab <- read.table(text = x, header = TRUE, row.names = 1) # Here we are expanding the number of samples to their observed values. # Since we have already defined the allele frequencies, this step is actually # not necessary. all_samples <- rep(rownames(xtab), as.integer(rownames(xtab))) xgid <- df2genind(xtab[all_samples, ], ncode = 1) freqs <- afreq[colnames(tab(xgid))] # only used alleles in the sample pSex <- psex(xgid, by_pop = FALSE, freq = freqs, G = 45) # Note, pgen returns log values for each locus, here we take the sum across # all loci and take the exponent to give us the value of pgen for each sample pGen <- exp(rowSums(pgen(xgid, by_pop = FALSE, freq = freqs))) res <- matrix(c(unique(pGen), unique(pSex)), ncol = 2) colnames(res) <- c("Pgen", "Psex") res <- cbind(xtab, nRamet = rownames(xtab), round(res, 5)) rownames(res) <- 1:3 res # Compare to the first three rows of Table 2 in Parks & Werth, 1993 ## End(Not run)
data(Pram) # With multiple encounters Pram_psex <- psex(Pram, by_pop = FALSE, method = "multiple") plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters") ## Not run: # For a single encounter (default) Pram_psex <- psex(Pram, by_pop = FALSE) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of second encounter") # This can be also done assuming populations structure Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple") plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure") # The above, but correcting zero-value alleles by 1/(2*rrmlg) with no # population structure assumed # Type ?rare_allele_correction for details. Pram_psex2 <- psex(Pram, by_pop = FALSE, d = "rrmlg", mul = 1/2, method = "multiple") plot(Pram_psex2, log = "y", col = ifelse(Pram_psex2 > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure (1/(2*rrmlg))") # We can also set G to the total population size (G <- nInd(Pram)) Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple", G = G) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure G = 729") # Or we can set G to the number of unique MLGs (G <- rowSums(mlg.table(Pram, plot = FALSE) > 0)) Pram_psex <- psex(Pram, by_pop = TRUE, method = "multiple", G = G) plot(Pram_psex, log = "y", col = ifelse(Pram_psex > 0.05, "red", "blue")) abline(h = 0.05, lty = 2) title("Probability of multiple encounters\nwith pop structure G = nmll") ## An example of supplying previously calculated frequencies and G # From Parks and Werth, 1993, using the first three genotypes. # The row names indicate the number of samples found with that genotype x <- " Hk Lap Mdh2 Pgm1 Pgm2 X6Pgd2 54 12 12 12 23 22 11 36 22 22 11 22 33 11 10 23 22 11 33 13 13" # Since we aren't representing the whole data set here, we are defining the # allele frequencies before the analysis. afreq <- c(Hk.1 = 0.167, Hk.2 = 0.795, Hk.3 = 0.038, Lap.1 = 0.190, Lap.2 = 0.798, Lap.3 = 0.012, Mdh2.0 = 0.011, Mdh2.1 = 0.967, Mdh2.2 = 0.022, Pgm1.2 = 0.279, Pgm1.3 = 0.529, Pgm1.4 = 0.162, Pgm1.5 = 0.029, Pgm2.1 = 0.128, Pgm2.2 = 0.385, Pgm2.3 = 0.487, X6Pgd2.1 = 0.526, X6Pgd2.2 = 0.051, X6Pgd2.3 = 0.423) xtab <- read.table(text = x, header = TRUE, row.names = 1) # Here we are expanding the number of samples to their observed values. # Since we have already defined the allele frequencies, this step is actually # not necessary. all_samples <- rep(rownames(xtab), as.integer(rownames(xtab))) xgid <- df2genind(xtab[all_samples, ], ncode = 1) freqs <- afreq[colnames(tab(xgid))] # only used alleles in the sample pSex <- psex(xgid, by_pop = FALSE, freq = freqs, G = 45) # Note, pgen returns log values for each locus, here we take the sum across # all loci and take the exponent to give us the value of pgen for each sample pGen <- exp(rowSums(pgen(xgid, by_pop = FALSE, freq = freqs))) res <- matrix(c(unique(pGen), unique(pSex)), ncol = 2) colnames(res) <- c("Pgen", "Psex") res <- cbind(xtab, nRamet = rownames(xtab), round(res, 5)) rownames(res) <- 1:3 res # Compare to the first three rows of Table 2 in Parks & Werth, 1993 ## End(Not run)
The following is a set of arguments for use in rraf
,
pgen
, and psex
to correct rare allele frequencies
that were lost in estimating round-robin allele frequencies.
e |
a numeric epsilon value to use for all missing allele frequencies. |
d |
the unit by which to take the reciprocal. |
mul |
a multiplier for div. Default is |
sum_to_one |
when |
By default (d = "sample", e = NULL, sum_to_one = FALSE, mul =
1
), this will add 1/(n samples) to all zero-value alleles. The basic formula
is 1/(d * m) unless e is specified. If sum_to_one =
TRUE
, then the frequencies will be scaled as x/sum(x) AFTER correction,
indicating that the allele frequencies will be reduced. See the examples for
details. The general pattern of correction is that the value of the MAF will
be rrmlg > mlg > sample
When calculating allele frequencies from a round-robin
approach, rare alleles are often lost resulting in zero-valued allele
frequencies (Arnaud-Haond et al. 2007, Parks and Werth 1993). This can be
problematic when calculating values for pgen
and
psex
because frequencies of zero will result in undefined
values for samples that contain those rare alleles. The solution to this
problem is to give an estimate for the frequency of those rare alleles, but
the question of HOW to do that arises. These arguments provide a way to
define how rare alleles are to be estimated/corrected.
These arguments are for use in the functions rraf
,
pgen
, and psex
. They will replace the dots (...)
that appear at the end of the function call. For example, if you want to set
the minor allele frequencies to a specific value (let's say 0.001),
regardless of locus, you can insert e = 0.001
along with any other
arguments (note, position is not specific):
pgen(my_data, e = 0.001, log = FALSE) psex(my_data, method = "multiple", e = 0.001)
Zhian N. Kamvar
Arnaud-Haond, S., Duarte, C. M., Alberto, F., & Serrão, E. A. 2007. Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115-5139.
Parks, J. C., & Werth, C. R. 1993. A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). American Journal of Botany, 537-544.
## Not run: data(Pram) #------------------------------------- # If you set correction = FALSE, you'll notice the zero-valued alleles rraf(Pram, correction = FALSE) # By default, however, the data will be corrected by 1/n rraf(Pram) # Of course, this is a diploid organism, we might want to set 1/2n rraf(Pram, mul = 1/2) # To set MAF = 1/2mlg rraf(Pram, d = "mlg", mul = 1/2) # Another way to think about this is, since these allele frequencies were # derived at each locus with different sample sizes, it's only appropriate to # correct based on those sample sizes. rraf(Pram, d = "rrmlg", mul = 1/2) # If we were going to use these frequencies for simulations, we might want to # ensure that they all sum to one. rraf(Pram, d = "mlg", mul = 1/2, sum_to_one = TRUE) #------------------------------------- # When we calculate these frequencies based on population, they are heavily # influenced by the number of observed mlgs. rraf(Pram, by_pop = TRUE, d = "rrmlg", mul = 1/2) # This can be fixed by specifying a specific value rraf(Pram, by_pop = TRUE, e = 0.01) ## End(Not run)
## Not run: data(Pram) #------------------------------------- # If you set correction = FALSE, you'll notice the zero-valued alleles rraf(Pram, correction = FALSE) # By default, however, the data will be corrected by 1/n rraf(Pram) # Of course, this is a diploid organism, we might want to set 1/2n rraf(Pram, mul = 1/2) # To set MAF = 1/2mlg rraf(Pram, d = "mlg", mul = 1/2) # Another way to think about this is, since these allele frequencies were # derived at each locus with different sample sizes, it's only appropriate to # correct based on those sample sizes. rraf(Pram, d = "rrmlg", mul = 1/2) # If we were going to use these frequencies for simulations, we might want to # ensure that they all sum to one. rraf(Pram, d = "mlg", mul = 1/2, sum_to_one = TRUE) #------------------------------------- # When we calculate these frequencies based on population, they are heavily # influenced by the number of observed mlgs. rraf(Pram, by_pop = TRUE, d = "rrmlg", mul = 1/2) # This can be fixed by specifying a specific value rraf(Pram, by_pop = TRUE, e = 0.01) ## End(Not run)
read.genalex will read in a genalex-formatted file that has been exported in a comma separated format and will parse most types of genalex data. The output is a genclone-class or genind-class object.
read.genalex( genalex, ploidy = 2, geo = FALSE, region = FALSE, genclone = TRUE, sep = ",", recode = FALSE )
read.genalex( genalex, ploidy = 2, geo = FALSE, region = FALSE, genclone = TRUE, sep = ",", recode = FALSE )
genalex |
a \*.csv file exported from genalex |
ploidy |
an integer to indicate the ploidy of the dataset |
geo |
indicates the presence of geographic data in the file. This data
will be included in a data frame labeled |
region |
indicates the presence of regional data in the file. |
genclone |
when |
sep |
A character specifying the column separator of the data. Defaults to ",". |
recode |
For polyploid data: Do you want to recode your data to
have varying ploidy? Default is |
The resulting genclone-class or genind-class
object will have a single strata defined in the strata slot. This will be
called "Pop" and will reflect the population factor defined in the genalex
input. If region = TRUE
, a second column will be inserted and labeled
"Region". If you have more than two strata within your data set, you should
run the command adegenet::splitStrata()
on your data set to define the
unique stratifications.
The genind object has an all-or-none approach to missing data. If a sample has missing data at a particular locus, then the entire locus is considered missing. This works for diploids and haploids where allelic dosage is unambiguous. For polyploids this poses a problem as much of the data set would be transformed into missing data. With this function, I have created a workaround.
When importing polyploid data sets, missing data is scored as "0" and kept within the genind object as an extra allele. This will break most analyses relying on allele frequencies*. All of the functions in poppr will work properly with these data sets as multilocus genotype analysis is agnostic of ploidy and we have written both Bruvo's distance and the index of association in such a way as to be able to handle polyploids presented in this manner.
\* To restore functionality of analyses relying on allele frequencies, use
the recode_polyploids()
function.
This function cannot handle raw allele frequency data.
In the case that there are duplicated names within the file, this function
will assume separate individuals and rename each one to a sequence of
integers from 1 to the number of individuals. A vector of the original
names will be saved in the other
slot under original_names
.
Zhian N. Kamvar
genind2genalex()
, clonecorrect()
, genclone,
genind, recode_polyploids()
## Not run: Aeut <- read.genalex(system.file("files/rootrot.csv", package="poppr")) genalex2 <- read.genalex("genalex2.csv", geo=TRUE) # A genalex file with geographic coordinate data. genalex3 <- read.genalex("genalex3.csv", region=TRUE) # A genalex file with regional information. genalex4 <- read.genalex("genalex4.csv", region=TRUE, geo=TRUE) # A genalex file with both regional and geographic information. ## End(Not run)
## Not run: Aeut <- read.genalex(system.file("files/rootrot.csv", package="poppr")) genalex2 <- read.genalex("genalex2.csv", geo=TRUE) # A genalex file with geographic coordinate data. genalex3 <- read.genalex("genalex3.csv", region=TRUE) # A genalex file with regional information. genalex4 <- read.genalex("genalex4.csv", region=TRUE, geo=TRUE) # A genalex file with both regional and geographic information. ## End(Not run)
As the genind object requires ploidy to be consistent across loci, a
workaround to importing polyploid data was to code missing alleles as "0"
(for microsatellite data sets). The advantage of this is that users would be
able to calculate Bruvo's distance, the index of association, and genotypic
diversity statistics. The tradeoff was the fact that this broke all other
analyses as they relied on allele frequencies and the missing alleles are
treated as extra alleles. This function removes those alleles and returns a
genclone
or genind
object where
allele frequencies are coded based on the number of alleles observed at a
single locus per individual. See the examples for more details.
recode_polyploids(poly, newploidy = FALSE, addzero = FALSE)
recode_polyploids(poly, newploidy = FALSE, addzero = FALSE)
poly |
a |
newploidy |
for genind or genclone objects: if |
addzero |
add zeroes onto genind or genclone objects with uneven ploidy?
if |
The genind object has two caveats that make it difficult to work with polyploid data sets:
ploidy must be constant throughout the data set
missing data is treated as "all-or-none"
In an ideal world, polyploid genotypes would be just as unambiguous as diploid or haploid genotypes. Unfortunately, the world we live in is far from ideal and a genotype of AB in a tetraploid organism could be AAAB, AABB, or ABBB. In order to get polyploid data in to adegenet or poppr, we must code all loci to have the same number of allelic states as the ploidy or largest observed heterozygote (if ploidy is unknown). The way to do this is to insert zeroes to pad the alleles. So, to import two genotypes of:
NA | 20 | 23 | 24 |
20 | 24 | 26 | 43 |
they should be coded as:
0 | 20 | 23 | 24 |
20 | 24 | 26 | 43 |
This zero is treated as an extra allele and is represented in the genind object as so:
0 | 20 | 23 | 24 | 26 | 43 |
1 | 1 | 1 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 1 | 1 |
This function remedies this problem by removing the zero column. The above table would become:
20 | 23 | 24 | 26 | 43 |
1 | 1 | 1 | 0 | 0 |
1 | 0 | 1 | 1 | 1 |
With this, the user is able to calculate frequency based statistics on the data set.
a genclone
, genind
, or
genpop
object.
This is an approximation, and a bad one at that. Poppr was not originally intended for polyploids, but with the inclusion of Bruvo's distance, it only made sense to attempt something beyond single use.
Zhian N. Kamvar
data(Pinf) iPinf <- recode_polyploids(Pinf) # Note that the difference between the number of alleles. nAll(Pinf) nAll(iPinf) ## Not run: library("ape") # Removing missing data. setPop(Pinf) <- ~Country # Calculating Rogers' distance. rog <- rogers.dist(genind2genpop(Pinf)) irog <- rogers.dist(recode_polyploids(genind2genpop(Pinf))) # We will now plot neighbor joining trees. Note the decreased distance in the # original data. plot(nj(rog), type = "unrooted") add.scale.bar(lcol = "red", length = 0.02) plot(nj(irog), type = "unrooted") add.scale.bar(lcol = "red", length = 0.02) ## End(Not run)
data(Pinf) iPinf <- recode_polyploids(Pinf) # Note that the difference between the number of alleles. nAll(Pinf) nAll(iPinf) ## Not run: library("ape") # Removing missing data. setPop(Pinf) <- ~Country # Calculating Rogers' distance. rog <- rogers.dist(genind2genpop(Pinf)) irog <- rogers.dist(recode_polyploids(genind2genpop(Pinf))) # We will now plot neighbor joining trees. Note the decreased distance in the # original data. plot(nj(rog), type = "unrooted") add.scale.bar(lcol = "red", length = 0.02) plot(nj(irog), type = "unrooted") add.scale.bar(lcol = "red", length = 0.02) ## End(Not run)
This function utilizes rrmlg
to calculate multilocus genotypes
and then subsets each locus by the resulting MLGs to calculate the
round-robin allele frequencies used for pgen and psex.
rraf(gid, pop = NULL, res = "list", by_pop = FALSE, correction = TRUE, ...)
rraf(gid, pop = NULL, res = "list", by_pop = FALSE, correction = TRUE, ...)
gid |
a genind or genclone object |
pop |
either a formula to set the population factor from the
|
res |
either "list" (default), "vector", or "data.frame". |
by_pop |
When this is |
correction |
a logical indicating whether or not zero-valued allele
frequencies should be corrected using the methods outlined in
correcting rare alleles.
(Default: |
... |
options from correcting rare alleles. The default is to correct allele frequencies to 1/n |
Calculating allele frequencies for clonal populations is a difficult task. Frequencies calculated on non-clone-corrected data suffer from bias due to non-independent samples. On the other hand, frequencies calculated on clone-corrected data artificially increases the significance of rare alleles. The method of round-robin allele frequencies as presented in Parks and Werth (1993) provides a method of calculating allele frequencies in a way that minimizes both of these effects.
Allele frequencies at a given locus are
calculated based on samples that are clone corrected without that
locus. When this happens, rare alleles have a high likelihood of dropping
out, giving them a frequency of "0". For some analyses, this is a perfectly
fine outcome, but for analyses such as pgen
and
psex
, this could result in undefined values. Setting
correction = TRUE
will allow you to control how these zero-valued
allele frequencies are corrected. For details, please see the documentation
on correcting rare alleles and examples.
a vector or list of allele frequencies
When by_pop = TRUE
, the output will be a matrix of allele
frequencies. Additionally, when the argument pop
is not NULL
,
by_pop
is automatically TRUE
.
Zhian N. Kamvar, Jonah Brooks, Stacy A. Krueger-Hadfield, Erik Sotka
Arnaud-Haond, S., Duarte, C. M., Alberto, F., & Serrão, E. A. 2007. Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115-5139.
Parks, J. C., & Werth, C. R. 1993. A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). American Journal of Botany, 537-544.
rrmlg
, pgen
, psex
,
rare_allele_correction
data(Pram) # Round robin allele frequencies, correcting zero-valued frequencies to 1/nInd(Pram) rraf(Pram) ## Not run: ## Round robin allele frequencies will be different than observed # Compare to without round robin: PrLoc <- seploc(Pram, res = "mat") # get locus by matrix lapply(PrLoc, colMeans, na.rm = TRUE) # Without round robin, clone corrected: Pcc <- clonecorrect(Pram, strata = NA) # indiscriminantly clone correct PccLoc <- seploc(Pcc, res = "mat") lapply(PccLoc, colMeans, na.rm = TRUE) ## Different methods of obtaining round robin allele frequencies # Get vector output. rraf(Pram, res = "vector") # Getting the output as a data frame allows us to use ggplot2 to visualize (Prdf <- rraf(Pram, res = "data.frame")) library("ggplot2") ggplot(Prdf, aes(y = allele, x = frequency)) + geom_point() + facet_grid(locus ~ ., scale = "free_y", space = "free") ## Round Robin allele frequencies by population (matrix only) # By default, allele frequencies will be corrected by 1/n per population (Prbp <- rraf(Pram, by_pop = TRUE)) # This might be problematic because populations like PistolRSF_OR has a # population size of four. # By using the 'e' argument to rare_allele_correction, this can be set to a # more reasonable value. (Prbp <- rraf(Pram, by_pop = TRUE, e = 1/nInd(Pram))) ## End(Not run)
data(Pram) # Round robin allele frequencies, correcting zero-valued frequencies to 1/nInd(Pram) rraf(Pram) ## Not run: ## Round robin allele frequencies will be different than observed # Compare to without round robin: PrLoc <- seploc(Pram, res = "mat") # get locus by matrix lapply(PrLoc, colMeans, na.rm = TRUE) # Without round robin, clone corrected: Pcc <- clonecorrect(Pram, strata = NA) # indiscriminantly clone correct PccLoc <- seploc(Pcc, res = "mat") lapply(PccLoc, colMeans, na.rm = TRUE) ## Different methods of obtaining round robin allele frequencies # Get vector output. rraf(Pram, res = "vector") # Getting the output as a data frame allows us to use ggplot2 to visualize (Prdf <- rraf(Pram, res = "data.frame")) library("ggplot2") ggplot(Prdf, aes(y = allele, x = frequency)) + geom_point() + facet_grid(locus ~ ., scale = "free_y", space = "free") ## Round Robin allele frequencies by population (matrix only) # By default, allele frequencies will be corrected by 1/n per population (Prbp <- rraf(Pram, by_pop = TRUE)) # This might be problematic because populations like PistolRSF_OR has a # population size of four. # By using the 'e' argument to rare_allele_correction, this can be set to a # more reasonable value. (Prbp <- rraf(Pram, by_pop = TRUE, e = 1/nInd(Pram))) ## End(Not run)
This function will mask each locus one by one and then calculate multilocus genotypes from the remaining loci in a round-robin fashion. This is used for calculating the round robin allele frequencies for pgen and psex.
rrmlg(gid)
rrmlg(gid)
gid |
a genind, genclone, or loci object. |
a matrix of multilocus genotype assignments by masked locus. There will be n rows and m columns where n = number of samples and m = number of loci.
Zhian N. Kamvar, Jonah Brooks, Stacy A. Krueger-Hadfield, Erik Sotka
Arnaud-Haond, S., Duarte, C. M., Alberto, F., & Serrão, E. A. 2007. Standardizing methods to address clonality in population studies. Molecular Ecology, 16(24), 5115-5139.
Parks, J. C., & Werth, C. R. 1993. A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). American Journal of Botany, 537-544.
# Find out the round-robin multilocus genotype assignments for P. ramorum data(Pram) pmlg_rr <- rrmlg(Pram) head(pmlg_rr) ## Not run: # You can find out how many unique genotypes are found without each locus: colSums(!apply(pmlg_rr, 2, duplicated)) ## End(Not run)
# Find out the round-robin multilocus genotype assignments for P. ramorum data(Pram) pmlg_rr <- rrmlg(Pram) head(pmlg_rr) ## Not run: # You can find out how many unique genotypes are found without each locus: colSums(!apply(pmlg_rr, 2, duplicated)) ## End(Not run)
Genlight objects can contain millions of loci. Since it does not make much sense to calculate the index of association over that many loci, this function will randomly sample sites to calculate the index of association.
samp.ia(x, n.snp = 100L, reps = 100L, threads = 1L, quiet = FALSE)
samp.ia(x, n.snp = 100L, reps = 100L, threads = 1L, quiet = FALSE)
x |
a [genlight][genlight-class] or [snpclone][snpclone-class] object. |
n.snp |
the number of snps to be used to calculate standardized index of association. |
reps |
the number of times to perform the calculation. |
threads |
The maximum number of parallel threads to be used within this function. A value of 0 (default) will attempt to use as many threads as there are available cores/CPUs. In most cases this is ideal. A value of 1 will force the function to run serially, which may increase stability on some systems. Other values may be specified, but should be used with caution. |
quiet |
if 'FALSE', a progress bar will be printed to the screen. |
The index of association is a summary of linkage disequilibrium among many loci. More information on the index of association can be found associated with the funciton [ia()]. A value near or at zero indicator of linkage equilibrium, whereas values significantly greater than zero indicate linkage disequilibrium. However, if the observed variance in distance among individuals is less than the expected, mildly negative values may be observed (as the range of this index is negative one to one). This function will call the function [bitwise.ia()] for 'reps' times to calculate the index of association over 'n.snp' loci. The standardized index of association ('rbarD') will be calculated 'reps' times. These esitmates of linkage disequilibrium from random genomic fractions can then be summarized (e.g., using a histogram) as an estimate of genome-wide linkage disequilibrium.
This function currently only works for objects of class genlight or snpclone that are of a single ploidy level and that ploidy is either haploid or diploid.
Index of association representing the samples in this genlight object.
this will calculate the standardized index of association from Agapow 2001. See [ia()] for details.
Zhian N. Kamvar, Jonah C. Brooks
[genlight][genlight-class], [snpclone][snpclone-class], [win.ia()], [ia()], [bitwise.dist()] [bitwise.ia()]
# with structured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2, parallel = FALSE) position(x) <- sort(sample(1e4, 1e3)) res <- samp.ia(x) hist(res, breaks = "fd") # with unstructured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 1e3, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- samp.ia(x) hist(res, breaks = "fd")
# with structured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2, parallel = FALSE) position(x) <- sort(sample(1e4, 1e3)) res <- samp.ia(x) hist(res, breaks = "fd") # with unstructured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 1e3, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- samp.ia(x) hist(res, breaks = "fd")
genclone
or
genind
object independently over each locus.Shuffle individuals in a genclone
or
genind
object independently over each locus.
shufflepop(pop, method = 1)
shufflepop(pop, method = 1)
pop |
|
method |
an integer between 1 and 4. See details below. |
This function will shuffle each locus in the data set independently of one another, rendering them essentially unlinked. The following methods are available to shuffle your data:
Permute Alleles This will redistribute all alleles in the sample throughout the locus. Missing data is fixed in place. This maintains allelic structure, but heterozygosity is variable.
Parametric Bootstrap This will redistribute available alleles within the locus based on their allelic frequencies. This means that both the allelic state and heterozygosity will vary. The resulting data set will not have missing data.
Non-Parametric Bootstrap This will shuffle the allelic state for each individual. Missing data is fixed in place.
Multilocus Style Permutation This will shuffle the genotypes at each locus, maintaining the heterozygosity and allelic structure.
a genclone
or genind
object
shuffled by a specified method
Zhian N. Kamvar
Paul-Michael Agapow and Austin Burt. 2001. Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2):101-102
# load the microbov dataset data(microbov) # Let's look at a single population for now. Howsabout Zebu Zebu <- popsub(microbov, "Zebu") summary(Zebu) # Take note of the Number of alleles per population and the Observed # heterozygosity as we go through each method. # Permute Alleles: maintain allelic state; heterozygosity varies. summary(shufflepop(Zebu, method=1)) ## Not run: # Parametric Bootstrap: do not maintain allelic state or heterozygosity summary(shufflepop(Zebu, method=2)) # Non-Parametric Bootstrap: do not maintain allelic state or heterozygosity. summary(shufflepop(Zebu, method=3)) # Multilocus Style: maintain allelic state and heterozygosity. summary(shufflepop(Zebu, method=4)) ## End(Not run)
# load the microbov dataset data(microbov) # Let's look at a single population for now. Howsabout Zebu Zebu <- popsub(microbov, "Zebu") summary(Zebu) # Take note of the Number of alleles per population and the Observed # heterozygosity as we go through each method. # Permute Alleles: maintain allelic state; heterozygosity varies. summary(shufflepop(Zebu, method=1)) ## Not run: # Parametric Bootstrap: do not maintain allelic state or heterozygosity summary(shufflepop(Zebu, method=2)) # Non-Parametric Bootstrap: do not maintain allelic state or heterozygosity. summary(shufflepop(Zebu, method=3)) # Multilocus Style: maintain allelic state and heterozygosity. summary(shufflepop(Zebu, method=4)) ## End(Not run)
This function will test for consistency in the sense that all alleles are able to be represented as discrete units after division and rounding.
test_replen(gid, replen)
test_replen(gid, replen)
gid |
|
replen |
a numeric vector of repeat motif lengths. |
This function is modified from the version used in doi:10.5281/zenodo.13007.
a logical vector indicating whether or not the repeat motif length is consistent.
Zhian N. Kamvar
Zhian N. Kamvar, Meg M. Larsen, Alan M. Kanaskie, Everett M. Hansen, & Niklaus J. Grünwald. Sudden_Oak_Death_in_Oregon_Forests: Spatial and temporal population dynamics of the sudden oak death epidemic in Oregon Forests. ZENODO, doi:10.5281/zenodo.13007, 2014.
Kamvar, Z. N., Larsen, M. M., Kanaskie, A. M., Hansen, E. M., & Grünwald, N. J. (2015). Spatial and temporal analysis of populations of the sudden oak death pathogen in Oregon forests. Phytopathology 105:982-989. doi: doi:10.1094/PHYTO-12-14-0350-FI
Ruzica Bruvo, Nicolaas K. Michiels, Thomas G. D'Souza, and Hinrich Schulenburg. A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Molecular Ecology, 13(7):2101-2106, 2004.
fix_replen
bruvo.dist
bruvo.msn
bruvo.boot
data(Pram) (Pram_replen <- setNames(c(3, 2, 4, 4, 4), locNames(Pram))) test_replen(Pram, Pram_replen)
data(Pram) (Pram_replen <- setNames(c(3, 2, 4, 4, 4), locNames(Pram))) test_replen(Pram, Pram_replen)
UPGMA clustering. Just a wrapper function around hclust
.
upgma(d)
upgma(d)
d |
A distance matrix. |
A phylogenetic tree of class phylo
.
Klaus Schliep [email protected]
library(ape) data(woodmouse) dm <- dist.dna(woodmouse) tree <- upgma(dm) plot(tree)
library(ape) data(woodmouse) dm <- dist.dna(woodmouse) tree <- upgma(dm) plot(tree)
Genlight objects can contain millions of loci. Since it does not make much sense to calculate the index of association over that many loci, this function will scan windows across the loci positions and calculate the index of association.
win.ia( x, window = 100L, min.snps = 3L, threads = 1L, quiet = FALSE, name_window = TRUE, chromosome_buffer = TRUE )
win.ia( x, window = 100L, min.snps = 3L, threads = 1L, quiet = FALSE, name_window = TRUE, chromosome_buffer = TRUE )
x |
|
window |
an integer specifying the size of the window. |
min.snps |
an integer specifying the minimum number of snps allowed per
window. If a window does not meet this criteria, the value will return as
|
threads |
The maximum number of parallel threads to be used within this
function. Defaults to 1 thread, in which the function will run serially. A
value of 0 will attempt to use as many threads as there are available
cores/CPUs. In most cases this is ideal for speed. Note: this option is
passed to |
quiet |
if |
name_window |
if |
chromosome_buffer |
DEPRECATED if |
A value of the standardized index of association for all windows in each chromosome.
this will calculate the standardized index of association from Agapow
and Burt, 2001. See ia()
for details.
Zhian N. Kamvar, Jonah C. Brooks
genlight, snpclone, ia()
, samp.ia()
, bitwise.dist()
# with structured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- win.ia(x, window = 300L) # Calculate for windows of size 300 plot(res, type = "l") ## Not run: # unstructured snps set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 1e3, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- win.ia(x, window = 300L) # Calculate for windows of size 300 plot(res, type = "l") # Accounting for chromosome coordinates set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) position(x) <- as.vector(vapply(1:10, function(x) sort(sample(1e3, 100)), integer(100))) chromosome(x) <- rep(1:10, each = 100) res <- win.ia(x, window = 100L) plot(res, type = "l") # Converting chromosomal coordinates to tidy data library("dplyr") library("tidyr") res_tidy <- res %>% tibble(rd = ., chromosome = names(.)) %>% # create two column data frame separate(chromosome, into = c("chromosome", "position")) %>% # get the position info mutate(position = as.integer(position)) %>% # force position as integers mutate(chromosome = factor(chromosome, unique(chromosome))) # force order chromosomes res_tidy # Plotting with ggplot2 library("ggplot2") ggplot(res_tidy, aes(x = position, y = rd, color = chromosome)) + geom_line() + facet_wrap(~chromosome, nrow = 1) + ylab(expression(bar(r)[d])) + xlab("terminal position of sliding window") + labs(caption = "window size: 100bp") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + theme(legend.position = "top") ## End(Not run)
# with structured snps assuming 1e4 positions set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- win.ia(x, window = 300L) # Calculate for windows of size 300 plot(res, type = "l") ## Not run: # unstructured snps set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 1e3, ploidy = 2) position(x) <- sort(sample(1e4, 1e3)) res <- win.ia(x, window = 300L) # Calculate for windows of size 300 plot(res, type = "l") # Accounting for chromosome coordinates set.seed(999) x <- glSim(n.ind = 10, n.snp.nonstruc = 5e2, n.snp.struc = 5e2, ploidy = 2) position(x) <- as.vector(vapply(1:10, function(x) sort(sample(1e3, 100)), integer(100))) chromosome(x) <- rep(1:10, each = 100) res <- win.ia(x, window = 100L) plot(res, type = "l") # Converting chromosomal coordinates to tidy data library("dplyr") library("tidyr") res_tidy <- res %>% tibble(rd = ., chromosome = names(.)) %>% # create two column data frame separate(chromosome, into = c("chromosome", "position")) %>% # get the position info mutate(position = as.integer(position)) %>% # force position as integers mutate(chromosome = factor(chromosome, unique(chromosome))) # force order chromosomes res_tidy # Plotting with ggplot2 library("ggplot2") ggplot(res_tidy, aes(x = position, y = rd, color = chromosome)) + geom_line() + facet_wrap(~chromosome, nrow = 1) + ylab(expression(bar(r)[d])) + xlab("terminal position of sliding window") + labs(caption = "window size: 100bp") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + theme(legend.position = "top") ## End(Not run)