Analyzing the Clintox Dataset with MoleculeNet and Python
The Clintox dataset, a part of the MoleculeNet suite, focuses on drug toxicity, including clinical trial toxicity and FDA-approved drug toxicity. This post will demonstrate how to analyze the Clintox dataset using Python with BioBricks and PySpark.
First, install BioBricks:
bash
pip install biobricks
biobricks configure # follow the prompts
Visit the BioBricks documentation for more information.
Next, install the MoleculeNet collection, which includes the Clintox dataset:
bash
biobricks install MoleculeNet
This command makes the Clintox dataset accessible for analysis.
Now, proceed with the data analysis using PySpark:
python
import pyspark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("ClintoxAnalysis").getOrCreate()
clintox_data = spark.read.parquet('brick/clintox.parquet')
clintox_data.show(5)
# Perform analysis
total_count = clintox_data.count()
fda_clinical_counts = clintox_data.groupBy('FDA_APPROVED', 'CT_TOX').count() \
.withColumn("Proportion", F.col("count") / total_count)
fda_clinical_counts.show()
#|FDA_APPROVED|CT_TOX|count| Proportion|
#+------------+------+-----+--------------------+
#| 1| 0| 1372| 0.9245283018867925|
#| 1| 1| 18|0.012129380053908356|
#| 0| 1| 94| 0.06334231805929919|
#+------------+------+-----+--------------------+
# FDA Approved but Clinically Toxic: 18
The Clintox dataset, accessible through BioBricks and analyzed with PySpark, offers a valuable resource for understanding drug toxicity, crucial for drug safety assessment.