Analyzing the Clintox Dataset with MoleculeNet and Python

The Clintox dataset, a part of the MoleculeNet suite, focuses on drug toxicity, including clinical trial toxicity and FDA-approved drug toxicity. This post will demonstrate how to analyze the Clintox dataset using Python with BioBricks and PySpark.

First, install BioBricks:

bash

pip install biobricks
biobricks configure # follow the prompts

Visit the BioBricks documentation for more information.

Next, install the MoleculeNet collection, which includes the Clintox dataset:

bash

biobricks install MoleculeNet

This command makes the Clintox dataset accessible for analysis.

Now, proceed with the data analysis using PySpark:

python

import pyspark
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("ClintoxAnalysis").getOrCreate()
clintox_data = spark.read.parquet('brick/clintox.parquet')
clintox_data.show(5)

# Perform analysis
total_count = clintox_data.count()
fda_clinical_counts = clintox_data.groupBy('FDA_APPROVED', 'CT_TOX').count() \
    .withColumn("Proportion", F.col("count") / total_count)
fda_clinical_counts.show()
#|FDA_APPROVED|CT_TOX|count|          Proportion|
#+------------+------+-----+--------------------+
#|           1|     0| 1372|  0.9245283018867925|
#|           1|     1|   18|0.012129380053908356|
#|           0|     1|   94| 0.06334231805929919|
#+------------+------+-----+--------------------+
# FDA Approved but Clinically Toxic: 18

The Clintox dataset, accessible through BioBricks and analyzed with PySpark, offers a valuable resource for understanding drug toxicity, crucial for drug safety assessment.