Big data approaches differ substantially from established biostatistics approaches. Whereas biostatisticians try to keep the number of variables low and put great effort into controlling the experiment, big data analytics accepts as a fundamental premise the heterogeneity of data with respect to both quality and type. Big data approaches try to understand complex systems and try to overcome problems with lacking data by a virtuosic combination of data, knowledge, and imputation of missing values. Interoperability of data and knowledge is key to big data approaches and, therefore, shared semantics (for example, controlled vocabularies; ontologies) play a crucial role not only for metadata annotation, but also for data integration and information extraction procedures. In dementia research, we deal with a wide range of data coming from different levels: omics technologies produce large amounts of quantitative data (such as gene expression data) or qualitative data (for example, single nucleotide polymorphism, SNP, data); neuroimaging generates huge amounts of imaging data that require complex image analysis workflows to extract features from images that can be used in integrative modeling and mining approaches. Dementia research is therefore inherently multi-modal (having different modes of data acquisition) and multi-scalar (ranging from the molecular scale (omics) to the organism (neuroimaging; clinical and cognitive data) and population scales (epidemiological data)).
A reality check of the current situation in dementia with respect to the adoption of big data principles leads us to the DREAM challenge, the first data analytics challenge in the area of research on Alzheimer’s disease [3]. The tasks in this challenge were highly relevant for current Alzheimer’s research, but not too complex and aimed at identifying those features that were informative for the prediction of “cognitive scores 24 months after initial assessment” (subchallenge 1); to “predict the set of cognitively normal individuals whose biomarkers are suggestive of amyloid perturbation” (subchallenge 2); and to “classify individuals into diagnostic groups using MR images” (subchallenge 3).
The data sets made available for the DREAM challenge did not bring along the challenge to harmonize and curate them across different scales or modes of measurement. In that respect, the DREAM challenge on Alzheimer’s disease was not truly a big data challenge. It was rather a statistical quantitative data mining exercise without a decent biological context or knowledge component to integrate. However, attempts to collect and to centrally provision research data in the dementia arena (neuGrid4you [4]; AETIONOMY [5]; Alzheimer’s & Dementia knowledge resource [6]) force us to pay much more attention to the most important V’s: V(ariety) and V(eracity). Over the last three years, our team has spent considerable effort on the quality assessment and curation of all publicly available omics data in the area of neurodegenerative diseases. In order to represent the relevant knowledge in a computable form, we and other groups have generated models of disease [7, 8] that represent a good part of the knowledge about Alzheimer’s and Parkinson’s diseases and make this knowledge amenable for computer-based reasoning approaches [9]. Together with curated omics data and additional efforts on clinical data, these models form the basis for the most comprehensive knowledge base on neurodegenerative diseases, the AETIONOMY knowledge base [10]. This knowledge base will support future big data approaches in dementia research by harmonized annotations across heterogeneous data sets and tight integration of disease models, literature-based knowledge, and primary experimental data. Other resources that are currently being built include the AMP AD partnership [11]; a dedicated topic on big data in Alzheimer’s research can be found on the website indicated in Ref. [12].