Assessment of factors affecting imputation accuracy The SNP Density and sample size were considered as factors that could impact the imputation accuracy. For each dataset-imputation method combination, imputation accuracy were averaged across dataset versions NA10, NA30, NA50, NA70 and NA90 and referred as imputation accuracy.
Indirect association as a result of linkage disequilibrium (LD) is a key factor in genetic association studies. Because of LD, a disease-susceptibility single-nucleotide polymorphism (SNP) need not be genotyped, as long as it is tagged by a SNP or set of SNPs that are genotyped. This concept has been further exploited by the introduction of methods to impute missing genotypes at untyped markers, based on known genotypes at typed markers and information about LD within the region from a reference panel [1, 2, 3, 4]. Such imputation methods can also be applied in the context of combining data across studies with different sets of correlated SNPs genotyped in different studies.
Two recent studies compared imputation accuracy of several methods [5, 6]; however, these studies did not assess performance of association tests based on the imputed genotypes. In this paper, we compare the performance of several imputation methods when combining two datasets that have been genotyped at different sets of markers or when completely missing (i.e., 'untyped') markers are analyzed. Four commonly used software packages were evaluated: IMPUTE [2], MACH [4], PLINK [7], and fastPHASE [8]. Imputation error rates and performance of association tests using the imputed data were compared. The Genetic Analysis Workshop (GAW) 16 Problem 1 dataset provided by the North American Rheumatoid Arthritis Consortium (NARAC) was used.