Coding notebook at Working with CPDB in python - Notebook

CPDB is a database about chemicals and cancer. In this post, we’ll use to look up chemicals that cause cancer, and learn some basic toxicology. To start, install and then install cpdb:

pipx install biobricks
biobricks configure # follow the prompts
biobricks install cpdb

CPDB, the Carcinogenic Potency Database, contains a bunch of tables about chemicals and their carcinogenic potency. We can use PySpark to explore it.

import biobricks as bb, pyspark
cpdb = bb.assets('cpdb')
spark = pyspark.sql.SparkSession.builder.getOrCreate()
for table in cpdb.__dict__.keys():
# [1]"ncintp_parquet"  "ncntdose_parquet" "tumor_parquet"    
# [4]"species_parquet" "route_parquet"    "chemname_parquet"
# several more ...

In this analysis we look at 6 tables:

Installing cpdb is as simple as `biobricks install cpdb`.

The ncintp_parquet table has a lot of columns, but we can focus on chemcode, species, route and td50. A TD50 describes the dose of a chemical at which 50% of a population will develop cancer. Those doses are measured in mg/kg/day. A TD50 of 1 mg/kg/day means half of a population of 1kg animals given 1mg of a chemical every day will develop cancer.

So let’s write some code to give us an idea of TD50 values for chemicals that cause cancer. We’ll focus on chemicals with a TD50 less than 1000 mg/kg/day. in general, we can consider these compounds carcinogenic.

ncintp =
species =
route =
df = ncintp\
    .join(species, 'species')\
    .join(route, 'route')\
With this dataframe we can start investigating distributions of TD50 values. We start by looking at all chemicals with TD50 less than 1000 mg/kg/day in CPDB. For this analysis we used the minimum recorded TD50 value for each chemical.

TD50s for all chemicals with TD50 less than 1000 mg/kg/day in CPDB

We can build these figures for different species:

TD50s for rats and mice in CPDB

Or for different routes:

TD50s for different administration routes in CPDB

Which of these chemicals has the lowest recorded TD50? Let’s take a look:

chemname =
intp =
min_td50 = intp.groupBy('chemcode').agg(F.min('td50').alias('td50'))
df = min_td50.join(chemname,'chemcode').sort(F.col('td50').asc())'name','td50').show(10,truncate=False)

# +-----------------------------------+-------+
# |name                                |td50   |
# +------------------------------------+-------+
# |HCDD MIXTURE                        |5.96E-4|
# |OZONE                               |0.0156 |
# |RIDDELLIINE                         |0.0267 |
# |THIO-TEPA                           |0.0332 |
# |OCHRATOXIN A                        |0.0579 |
# |LASIOCARPINE                        |0.102  |
# +------------------------------------+-------+

So there you go, by one measure, the most toxic compound in CPDB is 2,3,7,8-TETRACHLORODIBENZO-p-DIOXIN, or TCDD, with a TD50 of 1.21E-5 mg/kg/day. TCDD is a well studied compound, and is known to be highly toxic.