A Data Ecosystem For Toxicology
Today BioBricks.ai supports 50+ public health databases. Many of those resources are strongly related to toxicology. During my PhD I had to work with many of these databases and I found it difficult to learn which databases to use and how to access them let alone analyse them. BioBricks is an answer to that problem, to demonstrate that lets pull in 6 databases.
biobricks install toxcast # 0.7 GB
biobricks install tox21 # 1.3 GB
biobricks install chembl # 2.6 GB
biobricks install ice # 0.1 GB
biobricks install chemharmony # 18 GB
biobricks install pubmed # 93 GB
biobricks install toxrefdb # 0.5 GB
biobricks install bindingdb # 1.7 GB
That is 100 gigabytes of compressed public health data as fast as you can download it, admittedly most of that belongs to pubmed. This is a strong demonstration that most public health assets are not that large, rather than relying on APIs that require maintenance and may have uptime issues, it is often better to just download the whole asset.
Let’s take a look at some of that data.
Exploring In Vitro Data With Toxcast
Exploring each of these sources online would be time consuming. Even using programmatic tools like the NCBI eutils for accessing pubmed is very slow. Now that these data sources are installed we can quickly explore them and search for our in vitro data.
Toxcast is a good place to start looking for in vitro data.
toxcast <- bbassets('toxcast')$invitrodb_parquet |>
arrow::open_dataset() |> collect() |>
mutate(Response = ifelse(hitc==0,"negative","positive")) |>
select(Chemical=chnm, Assay=aenm, Response)
# A tibble: 3,982,544 × 3
# Chemical Assay Response
# 1 2-Methylbenzamide TOX21_TSHR_HTRF_wt_ratio negative
# 2 1H,1H,5H-Perfluoropentanol BSK_IMphg_SRB.Mphg_up negative
Learning About Our Assays With ICE
icens <- bbassets("ice") |> map(~arrow::open_dataset(.x) |> collect())
assay <- icens$cHTSMT_ALL_invitrodb34_AssayAnnotation_parquet
assay <- assay |> inner_join(toxcast, by = "Assay") |>
filter(!is.na(Gene)) |>
select(Assay, Source = `Assay Source`, Species, Tissue, Gene) |>
distinct()
assay
# A tibble: 1,459 × 5
# Assay Source Species Tissue Gene
# 1 ACEA_ER_80hr ACEA Biosciences human breast ESR1
# 2 APR_HepG2_Micro... Apredica human liver TUBA1A
# 3 APR_HepG2_Micro... Apredica human liver TUBA1A
# 4 APR_HepG2_Mitot... Apredica human liver H3F3A
We just joined data from one brick to another brick! Pretty cool. We couldn’t have done this without knowing where to look. Future posts will show how large language models can help us find all the data and do all the right joins. The ICE database is way bigger than this and contains a lot of interesting information, but we’ll stop here for now.
Now that we have gene names, we might want to start linking to more assets. The first place to look is the uniprot database, which provides some standard identifiers that are used in many resources like chembl and bindingdb
uniprotns <- bbassets('uniprot')
uniprot <- arrow::open_dataset(uniprotns$uniprot_sprot_parquet) |> collect()
df <- uniprot |> select(accession, name, gene.name.primary)
df <- df |> inner_join(assay,by=c("gene.name.primary"="Gene")) |>
distinct() |> sample_frac(1.0) |> head()
df
# A tibble: 6 × 6
# Assay Source Species Tissue Gene uniprot.accession
# 1 ERF_ENZ_hPRKAA1_dn Eurofins human NA AAPK1_P… Q09136
# 2 NVS_ENZ_hAMPKa1_Activator Novascreen human NA AAPK1_P… Q09136
# 3 NVS_ENZ_hAMPKa1 Novascreen human NA AAPK1_P… Q09136
# 4 NVS_ENZ_hAMPKa1 Novascreen human NA AAPK1_H… Q13131
# 5 NVS_ENZ_hAMPKa1_Activator Novascreen human NA AAPK1_H… Q13131
# 6 ERF_ENZ_hPRKAA1_dn Eurofins human NA AAPK1_H… Q13131
We didn’t show toxrefdb, because there’s already a short post on there here.
ECETOC
- Use biodegradation data
- NAMs work
- scientists from industry, academia and regulatory bodies - we get quite a lot of traction to engage with scientists
TRA Task Force TRA
- targeted risk assessment tool measures exposure - Chesar tools chemical safety assessment tool for ECHA
Integrated Approach for NAMS
- objective: identify opportunities for NAMS
Statged assessment for low tonnage chemicals
- objective: hazard identification with confidence
Smart in vivo studies
- make recommendations on how current OECD guideline studies could be amended.
Omics data interpretation framework for regulatory application
- objective: Task force is mapping ongoing activities to ensure appropriate niche is defined.
- Increasing desire to use omics data for next generation risk assessment
Does ECETOC have any american citizen members? Could I have a copy of those slides? How do we identify the best opportunities for collaborative development?
TODO: Share the office hours in january next wednesday