Biomedical fields like public health, chemistry, and toxicology are heavily reliant on health informatics data. While we’re in a golden age of health data, its access, reuse, distribution, and discovery have become complex challenges.

BioBricks.ai is part of the Big Data Platform for toxicology created by the EU funded Ontox Project, and part of NSF SBIR 2012214.

1 - A Single Standard To Access All Data

BioBricks.ai streamlines data access through a command line tool that lets users install standardized databases, termed “bricks”. Users simply install and configure biobricks and they can then start installing bricks.

> pip install biobricks
> biobricks configure
> biobricks install clinvar
Get started with biobricks with pip install biobricks or read the docs at docs.biobricks.ai

All BioBricks assets follow a consistent format. To retrieve data, simply:

  1. Check its availability on BioBricks GitHub Repository.
  2. Install the required brick with biobricks install <brickname>

Data retrieval shouldn’t be a maze of new APIs or a hunt for the right link. With BioBricks.ai developers can just install assets and start using them with common packages like arrow, pandas (python), and tidyverse (R).

Imagine the simplicity if a 23andMe variant report could be combined with clinvar, a registry of clinically significant genetic variants, with just a command: biobricks install clinvar.

In this demo we find enriched pathological variants in a 23andme report by comparing it to cllinvar, start to finish in less than 20 minutes. youtu.be/i3gQuhMylfY

Our goal is to reduce the cost to of exploration and make a wealth of valuable databases more accessible and usable.

2 - Data Reuse And Distribution With Data Dependencies

Data acquisition is pivotal in health informatics. Yet, maintaining and updating data assets is equally crucial. Current methods often lead to scripts that break or produce irreproducible results due to version inconsistencies.

BioBricks.ai introduces “data dependencies”. Users can develop Bricks that reference other bricks, ensuring data integrity and version consistency. For instance, the Chemharmony project harnesses this by integrating various databases, as seen on biobricks-ai/chemharmony.

BioBricks treats databases like dependencies so that you can build bricks that depend on other bricks youtu.be/qrCP3UVJ7tE

Just as modern programming leans on package managers like pip and npm for code collaboration, BioBricks.ai seeks to be the missing link for data assets.

3 - Data Discovery In The Age Of The Large Language Model

The 2020 anaconda.com State of Data Science survey revealed that 45% of data scientists’ time is spent on data preparation. But, how much time is spent merely searching for it?

Despite the existence of vast public health databases, many remain underexplored. Large Language Models (LLMs), like chatgpt, are transforming this landscape. They can generate semantic embeddings for databases, enabling more nuanced searches and interpretations.

We are working towards an LLM powered data discovery engine. A demo for which is at https://youtu.be/8EWdSB5vw2s.

Talk to data w/ ChatGPT and BioBricks youtu.be/8EWdSB5vw2s

Why Build BioBricks

Today’s methods of accessing health informatics are fragmented and complex. This not only hinders developers but also weakens the very foundation of the biomedical ecosystem.

BioBricks.ai seeks to revolutionize this by proposing a unified, open-source platform. Instead of countless developers creating redundant and inconsistent data pipelines, BioBricks.ai offers a centralized repository. This approach not only ensures the development of top-tier, standardized data pipelines but also promotes efficient distribution and collaboration across the community.