Biomedical fields like public health, chemistry, and toxicology are heavily reliant on health informatics data. While we’re in a golden age of health data, its access, reuse, distribution, and discovery have become complex challenges.
1 - A Single Standard To Access All Data
BioBricks.ai streamlines data access through a command line tool that lets users install standardized databases, termed “bricks”. Users simply install and configure biobricks and they can then start installing bricks.
> pip install biobricks > biobricks configure > biobricks install clinvar
All BioBricks assets follow a consistent format. To retrieve data, simply:
- Check its availability on BioBricks GitHub Repository.
- Install the required brick with
biobricks install <brickname>
Data retrieval shouldn’t be a maze of new APIs or a hunt for the right link. With BioBricks.ai developers can just install assets and start using them with common packages like arrow, pandas (python), and tidyverse (R).
Imagine the simplicity if a 23andMe variant report could be combined with clinvar, a registry of clinically significant genetic variants, with just a command:
biobricks install clinvar.
Our goal is to reduce the cost to of exploration and make a wealth of valuable databases more accessible and usable.
2 - Data Reuse And Distribution With Data Dependencies
Data acquisition is pivotal in health informatics. Yet, maintaining and updating data assets is equally crucial. Current methods often lead to scripts that break or produce irreproducible results due to version inconsistencies.
BioBricks.ai introduces “data dependencies”. Users can develop Bricks that reference other bricks, ensuring data integrity and version consistency. For instance, the Chemharmony project harnesses this by integrating various databases, as seen on biobricks-ai/chemharmony.
Just as modern programming leans on package managers like pip and npm for code collaboration, BioBricks.ai seeks to be the missing link for data assets.
3 - Data Discovery In The Age Of The Large Language Model
The 2020 anaconda.com State of Data Science survey revealed that 45% of data scientists' time is spent on data preparation. But, how much time is spent merely searching for it?
Despite the existence of vast public health databases, many remain underexplored. Large Language Models (LLMs), like chatgpt, are transforming this landscape. They can generate semantic embeddings for databases, enabling more nuanced searches and interpretations.
We are working towards an LLM powered data discovery engine. A demo for which is at https://youtu.be/8EWdSB5vw2s.
Why Build BioBricks
Today’s methods of accessing health informatics are fragmented and complex. This not only hinders developers but also weakens the very foundation of the biomedical ecosystem.
BioBricks.ai seeks to revolutionize this by proposing a unified, open-source platform. Instead of countless developers creating redundant and inconsistent data pipelines, BioBricks.ai offers a centralized repository. This approach not only ensures the development of top-tier, standardized data pipelines but also promotes efficient distribution and collaboration across the community.