Today BioBricks.ai supports 50+ public health databases. Many of those resources are strongly related to toxicology. During my PhD I had to work with many of these databases and I found it difficult to learn which databases to use and how to access them let alone analyse them. BioBricks is an answer to that problem, to demonstrate that lets pull in 6 databases.

biobricks install toxcast # 0.7 GB
biobricks install tox21 # 1.3 GB
biobricks install chembl # 2.6 GB
biobricks install ice # 0.1 GB
biobricks install chemharmony # 18 GB
biobricks install pubmed # 93 GB
biobricks install toxrefdb # 0.5 GB
biobricks install bindingdb # 1.7 GB

Installing 7 biobricks is quick and easy. Warning, these are large assets that require a lot of disk space.

That is 100 gigabytes of compressed public health data as fast as you can download it, admittedly most of that belongs to pubmed. This is a strong demonstration that most public health assets are not that large, rather than relying on APIs that require maintenance and may have uptime issues, it is often better to just download the whole asset.

Let’s take a look at some of that data.

Exploring In Vitro Data With Toxcast

Exploring each of these sources online would be time consuming. Even using programmatic tools like the NCBI eutils for accessing pubmed is very slow. Now that these data sources are installed we can quickly explore them and search for our in vitro data.

Toxcast is a good place to start looking for in vitro data.

toxcast <- bbassets('toxcast')$invitrodb_parquet |> 
    arrow::open_dataset() |> collect() |> 
    mutate(Response = ifelse(hitc==0,"negative","positive")) |>
    select(Chemical=chnm, Assay=aenm, Response)

# A tibble: 3,982,544 × 3
#   Chemical                   Assay                    Response
# 1 2-Methylbenzamide          TOX21_TSHR_HTRF_wt_ratio negative
# 2 1H,1H,5H-Perfluoropentanol BSK_IMphg_SRB.Mphg_up    negative

Now you would need to look for another data resource to figure out what those assays are, but I can tell you they are in vitro assays. One place we might look is the ICE database, which annotates some assays:

Learning About Our Assays With ICE

icens <- bbassets("ice") |> map(~arrow::open_dataset(.x) |> collect())
assay <- icens$cHTSMT_ALL_invitrodb34_AssayAnnotation_parquet
assay <- assay |> inner_join(toxcast, by = "Assay") |> 
    filter(!is.na(Gene)) |>
    select(Assay, Source = `Assay Source`, Species, Tissue, Gene) |> 
    distinct()
assay
# A tibble: 1,459 × 5
#   Assay              Source           Species Tissue Gene  
# 1 ACEA_ER_80hr       ACEA Biosciences human   breast ESR1  
# 2 APR_HepG2_Micro... Apredica         human   liver  TUBA1A
# 3 APR_HepG2_Micro... Apredica         human   liver  TUBA1A
# 4 APR_HepG2_Mitot... Apredica         human   liver  H3F3A

We just joined data from one brick to another brick! Pretty cool. We couldn’t have done this without knowing where to look. Future posts will show how large language models can help us find all the data and do all the right joins. The ICE database is way bigger than this and contains a lot of interesting information, but we’ll stop here for now.

Now that we have gene names, we might want to start linking to more assets. The first place to look is the uniprot database, which provides some standard identifiers that are used in many resources like chembl and bindingdb

uniprotns <- bbassets('uniprot')
uniprot <- arrow::open_dataset(uniprotns$uniprot_sprot_parquet) |> collect()
df <- uniprot |> select(accession, name, gene.name.primary) 
df <- df |> inner_join(assay,by=c("gene.name.primary"="Gene")) |> 
    distinct() |> sample_frac(1.0) |> head()
df
# A tibble: 6 × 6
#   Assay                     Source     Species Tissue Gene     uniprot.accession
# 1 ERF_ENZ_hPRKAA1_dn        Eurofins   human   NA     AAPK1_P… Q09136           
# 2 NVS_ENZ_hAMPKa1_Activator Novascreen human   NA     AAPK1_P… Q09136           
# 3 NVS_ENZ_hAMPKa1           Novascreen human   NA     AAPK1_P… Q09136           
# 4 NVS_ENZ_hAMPKa1           Novascreen human   NA     AAPK1_H… Q13131           
# 5 NVS_ENZ_hAMPKa1_Activator Novascreen human   NA     AAPK1_H… Q13131           
# 6 ERF_ENZ_hPRKAA1_dn        Eurofins   human   NA     AAPK1_H… Q13131

We didn’t show toxrefdb, because there’s already a short post on there here.

ECETOC

Use biodegradation data
NAMs work
scientists from industry, academia and regulatory bodies - we get quite a lot of traction to engage with scientists

TRA Task Force TRA

targeted risk assessment tool measures exposure - Chesar tools chemical safety assessment tool for ECHA

Integrated Approach for NAMS

objective: identify opportunities for NAMS

Statged assessment for low tonnage chemicals

objective: hazard identification with confidence

Smart in vivo studies

make recommendations on how current OECD guideline studies could be amended.

Omics data interpretation framework for regulatory application

objective: Task force is mapping ongoing activities to ensure appropriate niche is defined.
Increasing desire to use omics data for next generation risk assessment

Does ECETOC have any american citizen members? Could I have a copy of those slides? How do we identify the best opportunities for collaborative development?

TODO: Share the office hours in january next wednesday