Chennai Mathematical Institute


3.30 pm, Seminar Hall
Are we analyzing microbiome data correctly? CoMicZ: A novel statistical framework to characterize microbial composition

Siddhartha Mandal
Department of Genes and Environment, Norwegian Institute of Public Health, Oslo, Norway.


Complex ensembles of microbial phylotypes (collectively ‚'microbiota') are critical components of many ecological systems. Microbiome data have several special features and understanding the associations between various factors and microbial taxa composition requires a suitable probability model and a statistical framework. Current approaches either discount the underlying compositional structure in microbiome data or inappropriately account for the underlying structure using ad-hoc normalization procedures, both of which may potentially result in incorrect interpretation of the data. In this talk, using synthetic data we demonstrate that some of the existing methods, such as ZIG [Paulson et al., Nature Methods, 2013], which do not suitably account for the underlying structure, have unacceptably high false discovery rates (FDR) (as high as 40% in some cases) accompanied by very small power (as small as 5%). We also introduce a novel statistical frame work (CoMicZ) that accounts for the structure and constraints in the data. The resulting methodology can be used for performing a broad range of analyses, such as, comparisons of microflora between two or more populations, estimation of taxa compositions with associated trends, covariate adjusted analyses, and prediction analysis (e.g. predicting disease status). Our simulation studies reveal that CoMicZ controls FDR at the desired nominal level (5%) with a substantial gain in power in comparison to ZIG (for example, 73% power for CoMicZ vs. only 21% for ZIG). Lastly CoMicZ can also be used for performing model diagnostics such as detecting outliers, influential observations in the data. The proposed methodology is illustrated using publicly available 16S rRNA human gut microbiome data and a 16S rRNA soil microbiome data.