4. pahelix.utils ¶

Table of Contents

pahelix.utils

4.1. compound_tools ¶

Tools for compound features.

Adapted from https://github.com/snap-stanford/pretrain-gnns/blob/master/chem/loader.py

class pahelix.utils.compound_tools.CompoundConstants[source]¶: Constants of atom and bond properties.

pahelix.utils.compound_tools.atom_numeric_feat(n, allowable, to_one_hot=True)[source]¶: Restrict the numeric feature to [0, max_n].

pahelix.utils.compound_tools.check_smiles_validity(smiles)[source]¶: Check whether the smile can’t be converted to rdkit mol object.

pahelix.utils.compound_tools.create_standardized_mol_id(smiles)[source]¶

Parameters: smiles – smiles sequence
Returns: inchi

pahelix.utils.compound_tools.get_gasteiger_partial_charges(mol, n_iter=12)[source]¶

Calculates list of gasteiger partial charges for each atom in mol object.

Parameters

mol – rdkit mol object
n_iter (int) – number of iterations. Default 12

Returns

list of computed partial charges for each atom.

pahelix.utils.compound_tools.get_largest_mol(mol_list)[source]¶

Given a list of rdkit mol objects, returns mol object containing the largest num of atoms. If multiple containing largest num of atoms, picks the first one

Parameters: mol_list (list) – a list of rdkit mol object.
Returns: the largest mol.

pahelix.utils.compound_tools.mol_to_graph_data(mol, add_self_loop=True)[source]¶

Converts rdkit mol object to graph data which is a dict of numpy ndarray.

NB: Uses simplified atom and bond features, and represent as indices.

Parameters

mol – rdkit mol object.
add_self_loop – whether to add self loop or not.

Returns

a dict of numpy ndarray for the graph data. It consists of atom attibutes, edge attibutes and edge index.

pahelix.utils.compound_tools.smiles_to_graph_data(smiles, add_self_loop=True)[source]¶: Convert smiles to graph data.

pahelix.utils.compound_tools.split_rdkit_mol_obj(mol)[source]¶

Split rdkit mol object containing multiple species or one species into a list of mol objects or a list containing a single object respectively

Parameters: mol – rdkit mol object.

4.2. data_utils ¶

Tools for data.

pahelix.utils.data_utils.get_part_files(data_path, trainer_id, trainer_num)[source]¶: Split the files in data_path so that each trainer can train from different examples.

pahelix.utils.data_utils.load_npz_to_data_list(npz_file)[source]¶

Reload the data list save by save_data_list_to_npz.

Parameters: npz_file (str) – the npz file location.
Returns: a list of data where each data is a dict of numpy ndarray.

pahelix.utils.data_utils.save_data_list_to_npz(data_list, npz_file)[source]¶

Save a list of data to the npz file. Each data is a dict of numpy ndarray.

Parameters

data_list (list) – a list of data.
npz_file (str) – the npz file location.

4.3. language_model_tools ¶

Tools for language models.

pahelix.utils.language_model_tools.apply_bert_mask(token_ids, tokenizer)[source]¶

Apply BERT mask to the token_ids.

Parameters: token_ids – The list of token ids.
Returns: The list of masked token ids. labels: The labels for traininig BERT.
Return type: masked_token_ids

4.4. paddle_utils ¶

Paddle utils.

pahelix.utils.paddle_utils.get_distributed_optimizer(optimizer)[source]¶: Get the default collective distributed optimizer under fleet.

pahelix.utils.paddle_utils.load_partial_params(exe, init_model, main_program)[source]¶

Load partial params by checking whether it’s in the init_model folder.

Parameters

exe – Paddle executor.
init_model (str) – the model folder to load from.
main_program – Paddle program.

4.5. protein_tools ¶

Tools for protein features.

class pahelix.utils.protein_tools.ProteinTokenizer[source]¶

Protein Tokenizer.

convert_token_to_id(token)[source]¶

Converts a token to an id.

Parameters: token – Token.
Returns: The id of the input token.
Return type: id

convert_tokens_to_ids(tokens)[source]¶

Convert multiple tokens to ids.

Parameters: tokens – The list of tokens.
Returns: The id list of the input tokens.
Return type: ids

gen_token_ids(sequence)[source]¶

Generate the list of token ids according the input sequence.

Parameters: sequence – Sequence to be tokenized.

Retuens:: token_ids: The list of token ids.

tokenize(sequence)[source]¶

Split the sequence into token list.

Parameters: sequence – The sequence to be tokenized.
Returns: The token lists.
Return type: tokens

4.6. splitters ¶

Splitters

class pahelix.utils.splitters.RandomSplitter[source]¶

Random splitter.

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]¶

Parameters

dataset (InMemoryDataset) – the dataset to split.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.
seed (int|None) – the random seed.

class pahelix.utils.splitters.IndexSplitter[source]¶

Split daatasets that has already been orderd. The first frac_train proportion is used for train set, the next frac_valid for valid set and the final frac_test for test set.

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]¶

Parameters

dataset (InMemoryDataset) – the dataset to split.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.ScaffoldSplitter[source]¶

Adapted from https://github.com/deepchem/deepchem/blob/master/deepchem/splits/splitters.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]¶

Parameters

dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.RandomScaffoldSplitter[source]¶

Adapted from https://github.com/pfnet-research/chainer-chemistry/blob/master/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]¶

Parameters

dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.
seed (int|None) – the random seed.

pahelix.utils.splitters.generate_scaffold(smiles, include_chirality=False)[source]¶

Obtain Bemis-Murcko scaffold from smiles

Parameters

smiles – smiles sequence
include_chirality – Default=False

Returns

the scaffold of the given smiles.

4.7. Helpful Link ¶

Please refer to our GitHub repo to see the whole module.