What is it?
allelematch is a package of functions for the statistical programming language R. It performs matching and clustering microsatellite or other multilocus genotype data. Typical applications of the package include finding unique individuals and identifying potential genotyping errors. The package is targeted at those working with large datasets and databases containing multiple samples of individuals, a situation that is common in non-invasive wildlife sampling applications. It has also proven useful for applications with clonal organisms. Matching and clustering both explicitly incorporate missing data, and can tolerate genotyping error.
Why use it?
We developed allelematch for working with large non-invasively sampled genetic data sets. Non-invasive sampling is usually done from fecal or hair samples, and typically the individual from which a genetic sample is obtained is not known. For some study systems, such data sets will consist of many repeat samples (or recaptures) of the same individual. In addition, samples from these source materials can often be poor in quality, resulting in markers failing to amplify (i.e. missing data) or amplification errors. These analyses help in identifying candidate individuals and ensuring data quality in such data sets.
What can it do?
Identify unique genotypes: Finding unique multilocus genotypes (candidate individuals) appears on the surface to be trivial; an exercise in sorting genotypes into identical groups. This has been the approach taken by other existing software, where implicit assumptions are that there is high confidence in each genotype (often because multiple replicates of each sample have been used before a final genotype is declared), and there are very little missing data. However, when we have missing data, and when we want to compensate for a potentially small number of errors in genotyping, the process of matching quickly becomes complicated. Available software packages are sensitive to small differences in genotypes when reporting matches.
We take a novel approach to this matching problem by combining three types of analyses: (1) Identify the dissimilarity between pairs of genotypes using our matching score metric; (2) Cluster this dissimilarity matrix using a standard hierarchical agglomerative clustering approach; and (3) Use a dynamic tree cutting approach that has been recently developed to identify patterns in gene coexpression data. This final step identifies groups of matching genotypes on the cluster dendrogram. Those genotypes that are not members of a cluster emerge as singletons.
Pairwise matching of genotypes: Match multilocus genotypes in a focal set with genotypes in a second comparison set. This uses a a matching score which is a percentage similarity (or Hamming distance) between two genotypes modified to take into account the possibility that a missing datum can match (given a specified proportion or probability of a match). This tool can be used to identify genotyping errors. Importantly, this matching score pays no attention to genetics, but rather declares matches on the superficial similarity of two rows of a matrix.
How reliable are unique genotype identifications?
allelematch for identifying unique genotypes performs well under a variety of common dataset conditions. Under some less frequent conditions, the software performs poorly. The correct identification of unique individuals depends largely on the following: (1) average allelic diversity of the data set; (2) the missing data load of the data set; (3) the genotyping error rate; and to a much less extent (4) the frequency of sampling of a genotype. An extensive simulation experiment has been conducted using the software, and is published in Molecular Ecology Resources.