Big Data Technology In the U.S. Government by Michael Erbschloe - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub for a complete version.

Big Data to Knowledge at NIH

As biomedical tools and technologies rapidly improve, researchers are producing and analyzing an ever-expanding amount of complex biological data called “big data.” The Big Data to Knowledge (BD2K) program is a trans-NIH initiative that was launched in 2013 to support the research and development of innovative and transformative approaches and tools to maximize and accelerate the integration of big data and data science into biomedical research. The BD2K Program also supported initial efforts toward making data sets “FAIR” Findable, Accessible, Interoperable, and Reusable. Learn more about the FAIR principles.

 

Big Data to Knowledge Phase I & II

In its first phase (FY2014-FY2017), BD2K invested $200 million in grant awards to address some major data science challenges and to stimulate data-driven discovery. It focused on facilitating broad use of biomedical big data, developing and disseminating analysis methods and software, enhancing training relevant for large-scale data analysis,and establishing centers of excellence for biomedical big data. These awards will continue through award end dates, and lessons from this initial investment will help inform the second phase of the program (FY2018-FY2021).

BD2K has now entered a second phase that will focus on making the products of research developed in Phase I usable, discoverable, and disseminated to their intended end-users. In addition, the program will continue to pursue approaches to making biomedical big data Findable, Accessible, Interoperable, and Reusable or “FAIR.” It will support the NIH Data Commons Pilot Phase, a trans-NIH initiative to test the feasibility of, and develop best practices for, making NIH-funded data sets and computational tools available through communal, collaborative platforms on public clouds.

 

BD2K Centers

BD2K funded 13 Centers of Excellence, which are large-scale projects developing new approaches, methods, software tools, and related resources, and are also providing training to advance Big Data science in the context of their biomedical area of focus. The Centers are located all across the United States and function with the other BD2K grantees as a consortium and collaborate with one another for the purpose of furthering every aspect of the field of biomedical data science research.

 

Big Data for Discovery Science (BDDS)

Researchers at the BDDS focus on proteomics, genomics, and images of cells and brains collected from patients and subjects across the globe. They enable detection of patterns, trends, and relationships among these data for the efficient large-scale analysis of biomedical data.

 

BD2K-LINCS Data Coordination and Integration Center (BD2K-LINCS DCIC)

The BD2K-LINCS DCIC conducts data science research focused on perturbation-response data obtained from experiments with human cells and tissues, and provides access to and analysis of this data by the broader biomedical research community.

 

Center for Big Data in Translational Genomics (BDTG )

The BDTG creates data models and analysis tools to analyze massive datasets of genomic information to uncover the contribution of gene variants to disease with an initial focus on cancer.

 

Center for Causal Modeling and Discovery of Biomedical Knowledge from Big Data (CCD )

Center for Causal Modeling & Discovery of Biomedical Knowledge from Big Data (CCD) develops computational methods known as causal discovery algorithms that can be used to discover causal relationships from a combination of observational data, experimental data, and prior knowledge.

 

Center for Expanded Data Annotation and Retrieval (CEDAR )

Center for Expanded Data Annotation and Retrieval (CEDAR) is building new web-based technology to make it easier for biomedical scientists to author detailed metadata that describe their experiments completely, adhere to appropriate community-based standards, and incorporate controlled terms that facilitate interoperability with other online data sets.

 

Center for Mobility Data Integration to Insight (The Mobilize Center )

The Mobilize Center is analyzing movement data from over 6 million individuals using a smartphone app, revealing new insights about physical activity levels around the world and the factors predictive of these activity levels.

 

BD2K Centers Resources

The BD2K Centers have developed a wide array of tools and resources. A list of these resources is maintained by the BD2K Centers Coordination Center (BD2KCCC) . The BD2KCCC helps to promote collaboration among the Centers and across the BD2K program, and coordinates BD2K Centers Consortium activities.

Resources from each BD2K Center were highlighted in a special issue of the Summer 2017 Biomedical Computational review . Learn more about some of the BD2K Centers exciting accomplishments.

Tools and resources are also available on the individual Centers resource pages:

 

BDDS

BD2K-LINCS DCIC

CCD

CEDAR

The Mobilize Center

CPCP

MD2K

ENIGMA

HeartBD2K

KnowEnG

PIC-SURE

 

 

The Big Data to Knowledge Cloud Credits Model

What is the Cloud Credits Model?

On-demand, cloud-based storage and computing resources are becoming more commonplace in biomedical research. However, direct acquisition of these services by investigators can be complicated and inefficient. As one step to help overcome these barriers, the NIH BD2K program has developed a Cloud Credits Model. This pilot project provides pre-paid credits that allow NIH-funded investigators to access modern, cloud-based technology to support their research objectives. Through the credits, NIH can test a business model for making data storage and analysis in the cloud more efficient.

The Cloud Credits Model is a component of the first phase of the Big Data to Knowledge Program (BD2K). This project supports B2DK’s aim to facilitate access and use of biomedical big data. NIH expects that the lessons learned from the Cloud Credits Model will apply to second-phase BD2K initiatives, including the efforts of the NIH Data Commons Pilot Phase, which will make use of cloud technologies to advance biomedical research.

 

The Cloud Credits Model involves:

Using existing, centralized agreements with cloud vendors that conform to NIH requirements for capacity, security, accessibility, and other key criteria

Creating a marketplace of cloud resource providers and cloud resource consumers (investigators), to test the application of private sector operating approaches to a government sponsored effort

Assessing the model for effectiveness and efficiency with respect to ease to of use and cost

 

Participating investigators can select services from multiple cloud service providers that conform to NIH specifications. Cloud credits are requested through and issued by the Centers for Medicare & Medicaid Services (CMS) Alliance to Modernize Healthcare (CAMH) Federally Funded Research and Development Center (FFRDC), operated by The MITRE Corporation. The Cloud Credits Model is not connected to the NIH grants mechanism and does not impact investigators’ grants.

 

What does NIH hope to gain from the Cloud Credits Model?

NIH hopes to learn whether this model could be used in the future to acquire services and to compute on, store, and share biomedical big data at scale in the cloud. The primary goals are to create efficiencies for investigators, facilitate cost savings for NIH, and promote and share data generated by the community.

To date, the Cloud Credits Model has issued an “alpha” round of credits to 8 investigators to test the process of issuing credits, gaining access to cloud resources, and tracking usage. Over $100,000 in credits have been consumed, and reporting systems and dashboards have been created that inform participants as well as the NIH of spending by account and category.

Additionally, in preparation for issuance of additional credits to investigators, the Cloud Credits Model project team has built capacity for providing basic technical assistance across conformant cloud resources. Lastly, tools for capturing final outcomes, including a survey and post implementation report, have been developed.

 

In summary, the Cloud Credits Model offers the following:

An “on-ramp” to the cloud for participating investigators

Training opportunities for the community and for NIH

Processing efficiencies (accounts, invoices, and other administration efforts)

Incentives for investigators to create and share the data / tools / workflows that they created using credits

Insight into resource use to inform decisions about cost-sustainability

Experience working with cloud resource providers, including the ability to extend business relationships for future initiatives

 

Resource Indexing

To harness the full potential of Big Data scientists must be able to readily find, cite, and access existing data and other digital objects, such as software. There is no existing infrastructure or incentive that enables this. These basic goals maximize data use, enable sharing, limit duplication of effort, and allow areas of sparse research coverage to be more readily identified. To advance the infrastructure and policies needed to meet these goals, awards in this area address the challenges of resource discovery, citation, and access.

The Data Discovery Index concept furthers BD2K’s goal of improving the sharing of biomedical data. It will enable researchers to make better use of what already exists. It will also allow them to produce datasets that complement existing data for greater analytic potential.

In 2014, BD2K awarded a Data Discovery Index Coordination Consortium grant to the BioCADDIE project. BD2K also made a series of Data Discovery Index Supplement awards. These grants permit existing NIH-funded projects to join the consortium activities.

 

Data Discovery Index Coordination Consortium (DDICC) Award

Biological and HealthCare Data Discovery and Indexing Ecosystem (bioCADDIE)

bioCADDIE seeks to develop a prototype DDI that will enable finding, accessing and citing biomedical big data. bioCADDIE has a Community Engagement mandate that seeks to work with the broader biomedical community to better identify data, and other digital objects, so that they may find shared data in ways that allow for extracting maximal knowledge.

www.biocaddie.orgExit Link Disclaimer

 

Data Discovery Index (DDI) Supplement Awards

The Cardiovascular Research Grid

Johns Hopkins University

PI: Raimond Lester Winslow

Grant Number: 3R24HL085343-08S1

The Cardiovascular Research Grid (CVRG) Project is a national resource providing the capability to store, manage, and analyze data on the structure and function of the cardiovascular system in health and disease. The CVRG will develop new tools that will enhance the ability of researchers to explore and analyze their data to understand the cause and treatment of heart disease.

 

Computational tools for the analysis of high-throughput immunoglobulin sequencing

Yale University

PI: Steven H. Kleinstein

Grant Number: 1R01AI104739-01A1

This project will develop and validate computational methods to analyze large-scale immunology sequencing data sets. These methods will provide insights into the mechanisms underlying autoimmune disease, as well as biomarkers for susceptibility to infection or vaccination response.

 

Discovering and Applying Knowledge in Clinical Databases

Columbia University Health Sciences

PI: George M. Hripcsak

Grant Number: 3R01LM006910-15S1

This project uses data mining and knowledge engineering studies the electronic health record in order to better understand how health care processes cause systematic bias and other problems in the data which complicate incorporation into scientific studies. By avoiding or correcting those problems, we hope to improve reuse of the data for purposes such as clinical research and quality improvement.

 

fMRI-based Biomarkers for Multiple Components of Pain

University of Colorado

PI: Tor Dessart Wager

Grant Number: 3R01DA035484-02S1

Current treatments for pain are only modestly effective, in large part because pain is created through a complex set of brain processes and can be measured only by patients' self-reports, which presents a serious barrier to effective research and treatment. This project capitalizes on recent breakthroughs in measuring human brain activity and using it to objectively assess the brain processes that underlie pain experience, which could transform the way pain is measured and new treatments are developed.

 

Generation of a centralized and integrated resource for exposure data

North Carolina State University – Raleigh

PI: Carolyn J. Mattingly

Grant Number: 3R01ES019604-04S1

Most human diseases involve interactions between genetic and environmental factors; however, the basis of these complex interactions is not well understood. This project will enhance the capacity for prediction, analysis and interpretation of environment-disease networks by developing novel analysis and visualization tools that include exposure data. These tools will leverage the public Comparative Toxicogenomics Database (CTD), which aims to promote understanding about environment-disease relationships.

 

A Hub for the Nuclear Receptor Signaling Atlas

Baylor College of Medicine

PIs: Bert W. O’Malley, Ronald Evans, and Neil McKenna

Grant Number: 3U24DK097748-03S1

Nuclear receptors (NRs) and their coregulators are important therapeutic targets in many different disease states including cancer, obesity, diabetes, inflammation, neurological disorders and senescent diseases. This project will produce a NR research community resource hub for information and data analysis tools and will provide community research grants to generate datasets to populate the hub. These initiatives will have tangible benefits for the progress of research in the field towards developing novel NR- and coregulator-based therapeutics.

 

Natural language processing for clinical and translational research

The Mayo Clinic – Rochester

PIs: Hongfang Liu, Serguei Pakhomov, and Hua Xu

Grant Number: 3R01GM102282-02S1

Rapid growth in the clinical implementation of large electronic medical records (EMRs) has led to an unprecedented expansion of datasets for clinical and translational research. This project will develop a novel natural language processing framework to enable the use of information embedded in clinical narratives for research.

 

Using Biomedical Knowledge to Identify Plausible Signals for Pharmacovigilance

Drexel University

PI: Andrew Robert Cohen

Grant Number: 5R01NS076709-04

This project will develop and evaluate methods to identify automatically biologically plausible adverse drug events found within clinical patient records, using knowledge extracted from the biomedical literature. If successful, these methods will provide the means for earlier detection of harmful drug effects, limiting consequent morbidity and mortality.

 

 

Funded Research

NIH Data Commons Pilot Phase (OT3) RM17-026

PI Name

Institution Name

Title

AHALT, STANLEY CARLTON

UNIV OF NORTH CAROLINA CHAPEL HILL

A Collaboration for the NIH Data Commons

BROWN, C TITUS

UNIVERSITY OF CALIFORNIA AT DAVIS

Tools and Workflows for Mining Genomic Data on Many Clouds

CROSAS, MERCE

HARVARD UNIVERSITY

Towards a FAIR Digital Ecosystem in the Cloud

DAVIS-DUSENBERY
BRANDI NICOLE

SEVEN BRIDGES GENOMICS, INC.

FAIR Data to Drive CURES

FOSTER, IAN

UNIVERSITY OF CHICAGO

A Commons Platform for Promoting Continuous FAIRness

KOHANE, ISAAC S

HARVARD MEDICAL SCHOOL

Patient-Centric Information Commons under FAIR Principles (PIC-FAIR)

MA'AYAN, AVI

ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI

Development and Implementation Plan for Community Supported FAIR Guidelines and Metrics

OHNO-MACHADO, LUCILA (contact) 
FARCAS, CLAUDIU C 
JIANG, XIAOQIAN 
SANSONE, SUSANNA-ASSUNTA 
XU, HUA 

UNIVERSITY OF CALIFORNIA SAN DIEGO

CALIFORNIA: Cloud-agnostic Architecture to Locate Indexed FAIR Objects and safely Reuse them in New Integrated Analyses

PATEN, BENEDICT (contact) 
GROSSMAN, ROBERT L. 
PHILIPPAKIS, ANTHONY

UNIVERSITY OF CALIFORNIA SANTA CRUZ

The Commons Alliance: A Partnership to Catalyze the Creation of an NIH Data Commons

WHITE, OWEN R

UNIVERSITY OF MARYLAND BALTIMORE

University of Maryland NIH Data Commons Facilitation Center

 

Administrative Supplements to Existing NIH Grants and Cooperative Agreements  (Admin Supp) PA16-287

PI Name

Institution Name

Title

ABECASIS, GONCALO

UNIVERSITY OF MICHIGAN

Studies of Rare Genetic Variation in the Isolated Population of Sardinia

ARDLIE, KRISTIN (contact) 
GETZ, GAD 

BROAD INSTITUTE, INC.

A portal and integrative collaborative analysis platform for GTEx

CHERRY, JOE MICHAEL

STANFORD UNIVERSITY

Genomic Resource for the Yeast Saccharomyces

PSATY, BRUCE M (contact) 
RICE, KENNETH M. 
RICH, STEPHEN S.

UNIVERSITY OF WASHINGTON

Rare variants and NHLBI traits in deeply phenotyped cohorts

 

NIH Data Commons Pilot Support Services
Contract Number: HHSM500201200008I-HHSN276201700165U

 

Contractor Name

Title

MITRE Corporation

NIH Data Commons Pilot Support Services

 

Big Data to Knowledge (BD2K) Community-Based Data and Metadata Standards Efforts (R24) RFA-ES-16-010

PI Name

Institution Name

Title

DEUTSCH, ERIC

INSTITUTE FOR SYSTEMS BIOLOGY

Advancing data and metadata standards for proteomics mass spectra

PETERS, BJOERN (contact) 
MUNGALL, CHRISTOPHER J

LA JOLLA INST FOR ALLERGY & IMMUNOLGY

Services to support the OBO foundry standards

SIM, IDA

UNIVERSITY OF CALIFORNIA, SAN FRANCISCO

Open mHealth: Community-Based Data and Metadata Standards for Mobile Health

 

BD2K Support for Meetings of Data Science Related Organizations (U13) RFA-CA-16-020

PI Name

Institution Name

Title

HAENDEL, MELISSA A (contact) 
ROBINSON, PETER NICHOLAS

OREGON HEALTH & SCIENCE UNIVERSITY

Forums for Integrative Phenomics

 

BD2K Research Education Curriculum Development:  Data Science Overview for Biomedical Scientists (R25) RFA-ES-16-011

PI Name

Institution Name

Title

SARKAR, INDRA NEIL (contact) 
BROCK, JEFFREY 
GATSONIS, CONSTANTINE A 
ISTRAIL, SORIN C. 
SANDSTEDE, BJORN

BROWN UNIVERSITY

Training and Teaching for Transforming Big Data to Knowledge

 

BD2K Enhancing Diversity in Biomedical Data Science (R25) RFA-MD-16-002

PI Name

Institution Name

Title

BAI, YONGSHENG (contact) 
COOMBES, KEVIN ROBERT 
HUANG, KUN

INDIANA STATE UNIVERSITY

BD4ISU: Big Data for Indiana State University

GIANNOPOULOU, EVGENIA (contact) 
PATHAK, JYOTISHMAN

NEW YORK CITY COLLEGE OF TECHNOLOGY

City Tech-WCM Big Data Training Program in Biomedical Informatics

MARQUEZ-MAGANA, LETICIA MARIA (contact) 
AKOM, ANTWI AARON

SAN FRANCISCO STATE UNIVERSITY    

HEART & SOUL: Enabling full representation in biomedical Big Data science