pahelix.datasets

bace_dataset

Processing of bace dataset.

It contains quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE=1). The data are experimental values collected from the scientific literature which contains 152 compounds and their 2D structures and properties。

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators

pahelix.datasets.bace_dataset.get_default_bace_task_names()[source]

Get that default bace task names.

pahelix.datasets.bace_dataset.load_bace_dataset(data_path, task_names=None)[source]

Load bace dataset ,process the classification labels and the input information.

Description:

The data file contains a csv table, in which columns below are used:

mol: The smile representation of the molecular structure;

pIC50: The negative log of the IC50 binding affinity;

class: The binary labels for inhibitor.

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_bace_dataset('./bace')
print(len(dataset))

References:

[1]Subramanian, Govindan, et al. “Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches.” Journal of chemical information and modeling 56.10 (2016): 1936-1949.

bbbp_dataset

Processing of Blood-Brain Barrier Penetration dataset

The Blood-brain barrier penetration (BBBP) dataset is extracted from a study on the modeling and prediction of the barrier permeability. As a membrane separating circulating blood and brain extracellular fluid, the blood-brain barrier blocks most drugs, hormones and neurotransmitters. Thus penetration of the barrier forms a long-standing issue in development of drugs targeting central nervous system. This dataset includes binary labels for over 2000 compounds on their permeability properties.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators

pahelix.datasets.bbbp_dataset.get_default_bbbp_task_names()[source]

Get that default bbbp task names and return the binary labels

pahelix.datasets.bbbp_dataset.load_bbbp_dataset(data_path, task_names=None)[source]

Load bbbp dataset ,process the classification labels and the input information.

Description:

The data file contains a csv table, in which columns below are used:

Num:number

name:Name of the compound

smiles:SMILES representation of the molecular structure

p_np:Binary labels for penetration/non-penetration

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_bbbp_dataset('./bbbp')
print(len(dataset))

References:

[1] Martins, Ines Filipa, et al. “A Bayesian approach to in silico blood-brain barrier penetration modeling.” Journal of chemical information and modeling 52.6 (2012): 1686-1697.

chembl_filtered_dataset

Processing of chembl filtered dataset.

The ChEMBL dataset containing 456K molecules with 1310 kinds of diverse and extensive biochemical assays. The database is unique because of its focus on all aspects of drug discovery and its size, containing information on more than 1.8 million compounds and over 15 million records of their effects on biological systems.

pahelix.datasets.chembl_filtered_dataset.get_chembl_filtered_task_num()[source]

Get that default bace task names and return class

pahelix.datasets.chembl_filtered_dataset.load_chembl_filtered_dataset(data_path)[source]

Load chembl_filtered dataset ,process the classification labels and the input information.

Introduction:

Note that, in order to load this dataset, you should have other datasets (bace, bbbp, clintox, esol, freesolv, hiv, lipophilicity, muv, sider, tox21, toxcast) downloaded. Since the chembl dataset may overlap with the above listed dataset, the overlapped smiles for test will be filtered for a fair evaluation.

Description:

The data file contains a csv table, in which columns below are used:

It contains the ID, SMILES/CTAB, InChI and InChIKey compound information

smiles: SMILES representation of the molecular structure

Parameters:

data_path (str) – the path to the cached npz path

Returns:

an InMemoryDataset instance.

Example

dataset = load_bbbp_dataset('./bace')
print(len(dataset))

References:

[1] Gaulton, A; et al. (2011). “ChEMBL: a large-scale bioactivity database for drug discovery”. Nucleic Acids Research. 40 (Database issue): D1100-7.

clintox_dataset

Processing of clintox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status. List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators

pahelix.datasets.clintox_dataset.get_default_clintox_task_names()[source]

Get that default clintox task names and return class

pahelix.datasets.clintox_dataset.load_clintox_dataset(data_path, task_names=None)[source]

Load Clintox dataset ,process the classification labels and the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure

FDA_APPROVED: FDA approval status

CT_TOX: Clinical trial results

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_clintox_dataset('./clintox')
print(len(dataset))

References:

[1] Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials.” Cell chemical biology 23.10 (2016): 1294-1301.

[2] Artemov, Artem V., et al. “Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.” bioRxiv (2016): 095653.

[3] Novick, Paul A., et al. “SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.” PloS one 8.11 (2013): e79568.

[4] Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database

davis_dataset

Processing of davis dataset

pahelix.datasets.davis_dataset.load_davis_dataset(data_path, featurizer)[source]

tbd

ddi_dataset

Processing of ddi dataset. The DDI dataset includes 23,052 Drug-Drug Synergy pairs from 39 celllines. You can download the dataset from http://www.bioinf.jku.at/software/DeepSynergy/labels.csv and load it into pahelix reader creators

pahelix.datasets.ddi_dataset.get_default_ddi_task_names()[source]

Get that default ddi task names and return class label

pahelix.datasets.ddi_dataset.load_ddi_dataset(data_path, task_names=None, cellline=None)[source]

Load ddi dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

drug_a_name: drug name

drug_b_name: drug name

cell_line: cell line which the drug pairs were tested on

synergy: continuous values represent the synergy effect, we use 30 as threshold to binarize the data into binary labels. 1 as positive and 0 as negative

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

  • cellline – the exact cellline model you want to test on.

Returns:

an InMemoryDataset instance.

Example

dataset = load_hddi_dataset('./ddi/raw')
print(len(dataset))

References:

[1] Drug-Drug Dynergy Data. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btx806/4747884

dti_dataset

Processing of DTi dataset. The DTI dataset were extracted from the DrugCombDB. You can download the dataset from http://drugcombdb.denglab.org/download/drug_protein_links.rar and load it into pahelix reader creators

pahelix.datasets.dti_dataset.get_default_dti_task_names()[source]

Get that default dti task names

pahelix.datasets.dti_dataset.load_dti_dataset(data_path, task_names=None, featurizer=None)[source]

Load dti dataset,process the input information and the featurizer.

Description:

The data file contains a tsv table, in which columns below are used:

chemical: drug name

protein: targeted protein name

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_hddi_dataset('./dti/raw')
print(len(dataset))

esol_dataset

Processing of esol dataset.

ESOL (delaney) is a standard regression data set,which is also called delaney dataset. In the dataset, you can find the structure and water solubility data of 1128 compounds. It’s a good choice to validate machine learning models and to estimate solubility directly based on molecular structure which was encoded in SMILES string.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.esol_dataset.get_default_esol_task_names()[source]

Get that default esol task names and return measured values

pahelix.datasets.esol_dataset.get_esol_stat(data_path, task_names)[source]

Return mean and std of labels

pahelix.datasets.esol_dataset.load_esol_dataset(data_path, task_names=None)[source]

Load esol dataset ,process the classification labels and the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure

Compound ID: Name of the compound

measured log solubility in mols per litre: Log-scale water solubility of the compound, used as label

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_esol_dataset('./esol')
print(len(dataset))

References:

[1] Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

freesolv_dataset

Processing of freesolv dataset.

The Free Solvation Dataset provides rich information. It contains calculated values and experimental values about hydration free energy of small molecules in water.You can get the calculated values by molecular dynamics simulations,which are derived from alchemical free energy calculations. However,the experimental values are included in the benchmark collection.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.freesolv_dataset.get_default_freesolv_task_names()[source]

Get that default freesolv task names and return measured expt

pahelix.datasets.freesolv_dataset.get_freesolv_stat(data_path, task_names)[source]

Return mean and std of labels

pahelix.datasets.freesolv_dataset.load_freesolv_dataset(data_path, task_names=None)[source]

Load freesolv dataset,process the input information and the featurizer.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure

Compound ID: Name of the compound

measured log solubility in mols per litre: Log-scale water solubility of the compound, used as label.

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_freesolv_dataset('./freesolv')
print(len(dataset))

References:

[1] Mobley, David L., and J. Peter Guthrie. “FreeSolv: a database of experimental and calculated hydration free energies, with input files.” Journal of computer-aided molecular design 28.7 (2014): 711-720.

[2] https://github.com/MobleyLab/FreeSolv

hiv_dataset

Processing of hiv dataset.

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). We further combine the latter two labels, making it a classification task between inactive (CI) and active (CA and CM).

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators

pahelix.datasets.hiv_dataset.get_default_hiv_task_names()[source]

Get that default hiv task names and return class label

pahelix.datasets.hiv_dataset.load_hiv_dataset(data_path, task_names=None)[source]

Load hiv dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure

activity: Three-class labels for screening results: CI/CM/CA.

HIV_active: Binary labels for screening results: 1 (CA/CM) and 0 (CI)

Parameters:
  • data_path (str) – the path to the cached npz path

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_hiv_dataset('./hiv')
print(len(dataset))

References:

[1] AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data

inmemory_dataset

In-memory dataset.

class pahelix.datasets.inmemory_dataset.InMemoryDataset(data_list=None, npz_data_path=None, npz_data_files=None)[source]
Description:

The InMemoryDataset manages data_list which is a list of data and the data is a dict of numpy ndarray. And each dict has the same keys.

It works like a list: you can call dataset[i] to get the i-th element of the ``data_list` and call len(dataset) to get the length of data_list.

The data_list can be cached in npz files by calling dataset.save_data(data_path) and after that, call InMemoryDataset(data_path) to reload.

data_list

a list of dict of numpy ndarray.

Type:

list

Example

data_list = [{'a': np.zeros([4, 5])}, {'a': np.zeros([7, 5])}]
dataset = InMemoryDataset(data_list=data_list)
print(len(dataset))
dataset.save_data('./cached_npz')   # save data_list to ./cached_npz

dataset2 = InMemoryDataset(npz_data_path='./cached_npz')    # will load the saved `data_list`
print(len(dataset))
get_data_loader(batch_size, num_workers=4, shuffle=False, collate_fn=None)[source]

It returns an batch iterator which yields a batch of data. Firstly, a sub-list of data of size batch_size will be draw from the data_list, then the function collate_fn will be applied to the sub-list to create a batch and yield back. This process is accelerated by multiprocess.

Parameters:
  • batch_size (int) – the batch_size of the batch data of each yield.

  • num_workers (int) – the number of workers used to generate batch data. Required by multiprocess.

  • shuffle (bool) – whether to shuffle the order of the data_list.

  • collate_fn (function) – used to convert the sub-list of data_list to the aggregated batch data.

Yields:

the batch data processed by collate_fn.

save_data(data_path)[source]

Save the data_list to the disk specified by data_path with npz format. After that, call InMemoryDataset(data_path) to reload the data_list.

Parameters:

data_path (str) – the path to the cached npz path.

transform(transform_fn, num_workers=4, drop_none=False)[source]

Inplace apply transform_fn on the data_list with multiprocess.

kiba_dataset

Processing of kiba dataset

pahelix.datasets.kiba_dataset.load_kiba_dataset(data_path, featurizer)[source]

tbd

lipophilicity_dataset

Processing of lipohilicity dataset.

Lipophilicity is a dataset curated from ChEMBL database containing experimental results on octanol/water distribution coefficient (logD at pH=7.4).As the Lipophilicity plays an important role in membrane permeability and solubility. Related work deserves more attention.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.lipophilicity_dataset.get_default_lipophilicity_task_names()[source]

Get that default lipophilicity task names and return measured expt

pahelix.datasets.lipophilicity_dataset.get_lipophilicity_stat(data_path, task_names)[source]

Return mean and std of labels

pahelix.datasets.lipophilicity_dataset.load_lipophilicity_dataset(data_path, task_names=None)[source]

Load lipophilicity dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure

exp: Measured octanol/water distribution coefficient (logD) of the compound, used as label

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_lipophilicity_dataset('./lipophilicity')
print(len(dataset))

References:

[1]Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361

muv_dataset

Processing of muv dataset.

The Maximum Unbiased Validation (MUV) group is a benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90,000 compounds and is specifically designed for validation of virtual screening techniques.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.muv_dataset.get_default_muv_task_names()[source]

Get that default hiv task names and return the measured results for bioassays

pahelix.datasets.muv_dataset.load_muv_dataset(data_path, task_names=None)[source]

Load muv dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure.

mol_id: PubChem CID of the compound.

MUV-XXX: Measured results (Active/Inactive) for bioassays.

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_muv_dataset('./muv')
print(len(dataset))

References:

[1]Rohrer, Sebastian G., and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data.” Journal of chemical information and modeling 49.2 (2009): 169-184.

ppi_dataset

Processing of PPI dataset. The DDI dataset were extracted from DrugCombDB. You can download the dataset from http://drugcombdb.denglab.org/download/protein_protein_links.rar and load it into pahelix reader creators

pahelix.datasets.ppi_dataset.get_default_ppi_task_names()[source]

Get that default ppi task names

pahelix.datasets.ppi_dataset.load_ppi_dataset(data_path, task_names=None, featurizer=None)[source]

Load ppi dataset,process the input information and the featurizer.

Description:

The data file contains a txt file, in which columns below are used:

protein1: protein1 name

protein2: protein2 name

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the txt file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_ppi_dataset('./ppi/raw')
print(len(dataset))

sider_dataset

Processing of sider dataset.

The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.sider_dataset.get_default_sider_task_names()[source]

Get that default sider task names and return the side results for the drug

pahelix.datasets.sider_dataset.load_sider_dataset(data_path, task_names=None)[source]

Load sider dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure.

Hepatobiliary disorders: Injury, poisoning and procedural complications, recorded side effects for the drug

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_sider_dataset('./sider')
print(len(dataset))

References:

[1]Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.

[2]Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.

[3]Medical Dictionary for Regulatory Activities. http://www.meddra.org/

[4]Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

tox21_dataset

Processing of tox21 dataset.

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.tox21_dataset.get_default_tox21_task_names()[source]

Get that default tox21 task names and return the bioassays results

pahelix.datasets.tox21_dataset.load_tox21_dataset(data_path, task_names=None)[source]

Load tox21 dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure.

NR-XXX: Nuclear receptor signaling bioassays results.

SR-XXX: Stress response bioassays results

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_tox21_dataset('./tox21')
print(len(dataset))

References:

[1]Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/

[2]please refer to the links at https://tripod.nih.gov/tox21/challenge/data.jsp for details.

toxcast_dataset

Processing of toxcast dataset.

ToxCast is an extended data collection from the same initiative as Tox21, providing toxicology data for a large library of compounds based on in vitro high-throughput screening. The processed collection includes qualitative results of over 600 experiments on 8k compounds.

You can download the dataset from http://moleculenet.ai/datasets-1 and load it into pahelix reader creators.

pahelix.datasets.toxcast_dataset.get_default_toxcast_task_names(data_path)[source]

Get that default toxcast task names and return the list of the input information

pahelix.datasets.toxcast_dataset.load_toxcast_dataset(data_path, task_names=None)[source]

Load toxcast dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure.

ACEA_T47D_80hr_Negative: “Tanguay_ZF_120hpf_YSE_up” - Bioassays results

SR-XXX: Stress response bioassays results

Parameters:
  • data_path (str) – the path to the cached npz path.

  • task_names (list) – a list of header names to specify the columns to fetch from the csv file.

Returns:

an InMemoryDataset instance.

Example

dataset = load_toxcast_dataset('./toxcast')
print(len(dataset))

References:

[1]Richard, Ann M., et al. “ToxCast chemical landscape: paving the road to 21st century toxicology.” Chemical research in toxicology 29.8 (2016): 1225-1251.

[2]please refer to the section “high-throughput assay information” at https://www.epa.gov/chemical-research/toxicity-forecaster-toxcasttm-data for details.

zinc_dataset

Processing of ZINC dataset.

The ZINC database is a curated collection of commercially available chemical compounds prepared especially for virtual screening. ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully programmable for chemoinformaticians and computational biologists.

pahelix.datasets.zinc_dataset.load_zinc_dataset(data_path)[source]

Load ZINC dataset,process the input information.

Description:

The data file contains a csv table, in which columns below are used:

smiles: SMILES representation of the molecular structure.

zinc_id: the id of the compound

Parameters:

data_path (str) – the path to the cached npz path.

Returns:

an InMemoryDataset instance.

Example

dataset = load_zinc_dataset('./zinc')
print(len(dataset))

References:

[1]Teague Sterling and John J. Irwin. Zinc 15 – ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324–2337, 2015. doi: 10.1021/acs.jcim.5b00559. PMID: 26479676.