pahelix.datasets¶
bace_dataset¶
Processing of bace dataset.
It contains quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE=1). The data are experimental values collected from the scientific literature which contains 152 compounds and their 2D structures and properties。
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators
- pahelix.datasets.bace_dataset.get_default_bace_task_names()[source]¶
Get that default bace task names.
- pahelix.datasets.bace_dataset.load_bace_dataset(data_path, task_names=None)[source]¶
Load bace dataset ,process the classification labels and the input information.
Description:
The data file contains a csv table, in which columns below are used:
mol: The smile representation of the molecular structure;
pIC50: The negative log of the IC50 binding affinity;
class: The binary labels for inhibitor.
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_bace_dataset('./bace') print(len(dataset))
References:
[1]Subramanian, Govindan, et al. “Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches.” Journal of chemical information and modeling 56.10 (2016): 1936-1949.
bbbp_dataset¶
Processing of Blood-Brain Barrier Penetration dataset
The Blood-brain barrier penetration (BBBP) dataset is extracted from a study on the modeling and prediction of the barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system. This dataset includes binary labels for over 2000 compounds on their permeability properties.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators
- pahelix.datasets.bbbp_dataset.get_default_bbbp_task_names()[source]¶
Get that default bbbp task names and return the binary labels
- pahelix.datasets.bbbp_dataset.load_bbbp_dataset(data_path, task_names=None)[source]¶
Load bbbp dataset ,process the classification labels and the input information.
Description:
The data file contains a csv table, in which columns below are used:
Num:number
name:Name of the compound
smiles:SMILES representation of the molecular structure
p_np:Binary labels for penetration/non-penetration
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_bbbp_dataset('./bbbp') print(len(dataset))
References:
[1] Martins, Ines Filipa, et al. “A Bayesian approach to in silico blood-brain barrier penetration modeling.” Journal of chemical information and modeling 52.6 (2012): 1686-1697.
chembl_filtered_dataset¶
Processing of chembl filtered dataset.
The ChEMBL dataset containing 456K molecules with 1310 kinds of diverse and extensive biochemical assays. The database is unique because of its focus on all aspects of drug discovery and its size, containing information on more than 1.8 million compounds and over 15 million records of their effects on biological systems.
- pahelix.datasets.chembl_filtered_dataset.get_chembl_filtered_task_num()[source]¶
Get that default bace task names and return class
- pahelix.datasets.chembl_filtered_dataset.load_chembl_filtered_dataset(data_path)[source]¶
Load chembl_filtered dataset ,process the classification labels and the input information.
Introduction:
Note that, in order to load this dataset, you should have other datasets (bace, bbbp, clintox, esol, freesolv, hiv, lipophilicity, muv, sider, tox21, toxcast) downloaded. Since the chembl dataset may overlap with the above listed dataset, the overlapped smiles for test will be filtered for a fair evaluation.
Description:
The data file contains a csv table, in which columns below are used:
It contains the ID, SMILES/CTAB, InChI and InChIKey compound information
smiles: SMILES representation of the molecular structure
- Parameters:
data_path (str) – the path to the cached npz path
- Returns:
an InMemoryDataset instance.
Example
dataset = load_bbbp_dataset('./bace') print(len(dataset))
References:
[1] Gaulton, A; et al. (2011). “ChEMBL: a large-scale bioactivity database for drug discovery”. Nucleic Acids Research. 40 (Database issue): D1100-7.
clintox_dataset¶
Processing of clintox dataset
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status. List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators
- pahelix.datasets.clintox_dataset.get_default_clintox_task_names()[source]¶
Get that default clintox task names and return class
- pahelix.datasets.clintox_dataset.load_clintox_dataset(data_path, task_names=None)[source]¶
Load Clintox dataset ,process the classification labels and the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure
FDA_APPROVED: FDA approval status
CT_TOX: Clinical trial results
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_clintox_dataset('./clintox') print(len(dataset))
References:
[1] Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.
[2] Artemov, Artem V., et al. “Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.” bioRxiv (2016): 095653.
[3] Novick, Paul A., et al. “SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.” PloS one 8.11 (2013): e79568.
[4] Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database
davis_dataset¶
Processing of davis dataset
ddi_dataset¶
Processing of ddi dataset. The DDI dataset includes 23,052 Drug-Drug Synergy pairs from 39 celllines. You can download the dataset from http://www.bioinf.jku.at/software/DeepSynergy/labels.csv and load it into pahelix reader creators
- pahelix.datasets.ddi_dataset.get_default_ddi_task_names()[source]¶
Get that default ddi task names and return class label
- pahelix.datasets.ddi_dataset.load_ddi_dataset(data_path, task_names=None, cellline=None)[source]¶
Load ddi dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
drug_a_name: drug name
drug_b_name: drug name
cell_line: cell line which the drug pairs were tested on
synergy: continuous values represent the synergy effect, we use 30 as threshold to binarize the data into binary labels. 1 as positive and 0 as negative
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
cellline – the exact cellline model you want to test on.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_hddi_dataset('./ddi/raw') print(len(dataset))
References:
[1] Drug-Drug Dynergy Data. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btx806/4747884
dti_dataset¶
Processing of DTi dataset. The DTI dataset were extracted from the DrugCombDB. You can download the dataset from http://drugcombdb.denglab.org/download/drug_protein_links.rar and load it into pahelix reader creators
- pahelix.datasets.dti_dataset.load_dti_dataset(data_path, task_names=None, featurizer=None)[source]¶
Load dti dataset,process the input information and the featurizer.
Description:
The data file contains a tsv table, in which columns below are used:
chemical: drug name
protein: targeted protein name
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_hddi_dataset('./dti/raw') print(len(dataset))
esol_dataset¶
Processing of esol dataset.
ESOL (delaney) is a standard regression data set,which is also called delaney dataset. In the dataset, you can find the structure and water solubility data of 1128 compounds. It’s a good choice to validate machine learning models and to estimate solubility directly based on molecular structure which was encoded in SMILES string.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.esol_dataset.get_default_esol_task_names()[source]¶
Get that default esol task names and return measured values
- pahelix.datasets.esol_dataset.get_esol_stat(data_path, task_names)[source]¶
Return mean and std of labels
- pahelix.datasets.esol_dataset.load_esol_dataset(data_path, task_names=None)[source]¶
Load esol dataset ,process the classification labels and the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure
Compound ID: Name of the compound
measured log solubility in mols per litre: Log-scale water solubility of the compound, used as label
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_esol_dataset('./esol') print(len(dataset))
References:
[1] Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.
freesolv_dataset¶
Processing of freesolv dataset.
The Free Solvation Dataset provides rich information. It contains calculated values and experimental values about hydration free energy of small molecules in water.You can get the calculated values by molecular dynamics simulations,which are derived from alchemical free energy calculations. However,the experimental values are included in the benchmark collection.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.freesolv_dataset.get_default_freesolv_task_names()[source]¶
Get that default freesolv task names and return measured expt
- pahelix.datasets.freesolv_dataset.get_freesolv_stat(data_path, task_names)[source]¶
Return mean and std of labels
- pahelix.datasets.freesolv_dataset.load_freesolv_dataset(data_path, task_names=None)[source]¶
Load freesolv dataset,process the input information and the featurizer.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure
Compound ID: Name of the compound
measured log solubility in mols per litre: Log-scale water solubility of the compound, used as label.
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_freesolv_dataset('./freesolv') print(len(dataset))
References:
[1] Mobley, David L., and J. Peter Guthrie. “FreeSolv: a database of experimental and calculated hydration free energies, with input files.” Journal of computer-aided molecular design 28.7 (2014): 711-720.
hiv_dataset¶
Processing of hiv dataset.
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators
- pahelix.datasets.hiv_dataset.get_default_hiv_task_names()[source]¶
Get that default hiv task names and return class label
- pahelix.datasets.hiv_dataset.load_hiv_dataset(data_path, task_names=None)[source]¶
Load hiv dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure
activity: Three-class labels for screening results: CI/CM/CA.
HIV_active: Binary labels for screening results: 1 (CA/CM) and 0 (CI)
- Parameters:
data_path (str) – the path to the cached npz path
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_hiv_dataset('./hiv') print(len(dataset))
References:
[1] AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data
inmemory_dataset¶
In-memory dataset.
- class pahelix.datasets.inmemory_dataset.InMemoryDataset(data_list=None, npz_data_path=None, npz_data_files=None)[source]¶
- Description:
The InMemoryDataset manages
data_list
which is a list of data and the data is a dict of numpy ndarray. And each dict has the same keys.It works like a list: you can call dataset[i] to get the i-th element of the ``data_list` and call len(dataset) to get the length of
data_list
.The
data_list
can be cached in npz files by calling dataset.save_data(data_path) and after that, call InMemoryDataset(data_path) to reload.
- data_list¶
a list of dict of numpy ndarray.
- Type:
list
Example
data_list = [{'a': np.zeros([4, 5])}, {'a': np.zeros([7, 5])}] dataset = InMemoryDataset(data_list=data_list) print(len(dataset)) dataset.save_data('./cached_npz') # save data_list to ./cached_npz dataset2 = InMemoryDataset(npz_data_path='./cached_npz') # will load the saved `data_list` print(len(dataset))
- get_data_loader(batch_size, num_workers=4, shuffle=False, collate_fn=None)[source]¶
It returns an batch iterator which yields a batch of data. Firstly, a sub-list of data of size
batch_size
will be draw from thedata_list
, then the functioncollate_fn
will be applied to the sub-list to create a batch and yield back. This process is accelerated by multiprocess.- Parameters:
batch_size (int) – the batch_size of the batch data of each yield.
num_workers (int) – the number of workers used to generate batch data. Required by multiprocess.
shuffle (bool) – whether to shuffle the order of the
data_list
.collate_fn (function) – used to convert the sub-list of
data_list
to the aggregated batch data.
- Yields:
the batch data processed by
collate_fn
.
kiba_dataset¶
Processing of kiba dataset
lipophilicity_dataset¶
Processing of lipohilicity dataset.
Lipophilicity is a dataset curated from ChEMBL database containing experimental results on octanol/water distribution coefficient (logD at pH=7.4).As the Lipophilicity plays an important role in membrane permeability and solubility. Related work deserves more attention.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.lipophilicity_dataset.get_default_lipophilicity_task_names()[source]¶
Get that default lipophilicity task names and return measured expt
- pahelix.datasets.lipophilicity_dataset.get_lipophilicity_stat(data_path, task_names)[source]¶
Return mean and std of labels
- pahelix.datasets.lipophilicity_dataset.load_lipophilicity_dataset(data_path, task_names=None)[source]¶
Load lipophilicity dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure
exp: Measured octanol/water distribution coefficient (logD) of the compound, used as label
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_lipophilicity_dataset('./lipophilicity') print(len(dataset))
References:
[1]Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361
muv_dataset¶
Processing of muv dataset.
The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90,000 compounds and is specifically designed for validation of virtual screening techniques.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.muv_dataset.get_default_muv_task_names()[source]¶
Get that default hiv task names and return the measured results for bioassays
- pahelix.datasets.muv_dataset.load_muv_dataset(data_path, task_names=None)[source]¶
Load muv dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure.
mol_id: PubChem CID of the compound.
MUV-XXX: Measured results (Active/Inactive) for bioassays.
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_muv_dataset('./muv') print(len(dataset))
References:
[1]Rohrer, Sebastian G., and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.” Journal of chemical information and modeling 49.2 (2009): 169-184.
ppi_dataset¶
Processing of PPI dataset. The DDI dataset were extracted from DrugCombDB. You can download the dataset from http://drugcombdb.denglab.org/download/protein_protein_links.rar and load it into pahelix reader creators
- pahelix.datasets.ppi_dataset.load_ppi_dataset(data_path, task_names=None, featurizer=None)[source]¶
Load ppi dataset,process the input information and the featurizer.
Description:
The data file contains a txt file, in which columns below are used:
protein1: protein1 name
protein2: protein2 name
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the txt file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_ppi_dataset('./ppi/raw') print(len(dataset))
sider_dataset¶
Processing of sider dataset.
The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.sider_dataset.get_default_sider_task_names()[source]¶
Get that default sider task names and return the side results for the drug
- pahelix.datasets.sider_dataset.load_sider_dataset(data_path, task_names=None)[source]¶
Load sider dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure.
Hepatobiliary disorders: Injury, poisoning and procedural complications, recorded side effects for the drug
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_sider_dataset('./sider') print(len(dataset))
References:
[1]Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.
[2]Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.
[3]Medical Dictionary for Regulatory Activities. http://www.meddra.org/
[4]Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.
tox21_dataset¶
Processing of tox21 dataset.
The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.tox21_dataset.get_default_tox21_task_names()[source]¶
Get that default tox21 task names and return the bioassays results
- pahelix.datasets.tox21_dataset.load_tox21_dataset(data_path, task_names=None)[source]¶
Load tox21 dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure.
NR-XXX: Nuclear receptor signaling bioassays results.
SR-XXX: Stress response bioassays results
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_tox21_dataset('./tox21') print(len(dataset))
References:
[1]Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/
[2]please refer to the links at https://tripod.nih.gov/tox21/challenge/data.jsp for details.
toxcast_dataset¶
Processing of toxcast dataset.
ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.
You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.
- pahelix.datasets.toxcast_dataset.get_default_toxcast_task_names(data_path)[source]¶
Get that default toxcast task names and return the list of the input information
- pahelix.datasets.toxcast_dataset.load_toxcast_dataset(data_path, task_names=None)[source]¶
Load toxcast dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure.
ACEA_T47D_80hr_Negative: “Tanguay_ZF_120hpf_YSE_up” - Bioassays results
SR-XXX: Stress response bioassays results
- Parameters:
data_path (str) – the path to the cached npz path.
task_names (list) – a list of header names to specify the columns to fetch from the csv file.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_toxcast_dataset('./toxcast') print(len(dataset))
References:
[1]Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.
[2]please refer to the section “high-throughput assay information” at https://www.epa.gov/chemical-research/toxicity-forecaster-toxcasttm-data for details.
zinc_dataset¶
Processing of ZINC dataset.
The ZINC database is a curated collection of commercially available chemical compounds prepared especially for virtual screening. ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully programmable for chemoinformaticians and computational biologists.
- pahelix.datasets.zinc_dataset.load_zinc_dataset(data_path)[source]¶
Load ZINC dataset,process the input information.
Description:
The data file contains a csv table, in which columns below are used:
smiles: SMILES representation of the molecular structure.
zinc_id: the id of the compound
- Parameters:
data_path (str) – the path to the cached npz path.
- Returns:
an InMemoryDataset instance.
Example
dataset = load_zinc_dataset('./zinc') print(len(dataset))
References:
[1]Teague Sterling and John J. Irwin. Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337, 2015. doi: 10.1021/acs.jcim.5b00559. PMID: 26479676.
Helpful Link¶
Please refer to our GitHub repo to see the whole module.