4. pahelix.utils

4.1. compound_tools

class pahelix.utils.compound_tools.CompoundConstants[source]

Constants of atom and bond properties.

pahelix.utils.compound_tools.atom_numeric_feat(n, allowable, to_one_hot=True)[source]

Restrict the numeric feature to [0, max_n].

pahelix.utils.compound_tools.check_smiles_validity(smiles)[source]

Check whether the smile can’t be converted to rdkit mol object.

pahelix.utils.compound_tools.create_standardized_mol_id(smiles)[source]
Parameters

smiles – smiles sequence

Returns

inchi

pahelix.utils.compound_tools.get_gasteiger_partial_charges(mol, n_iter=12)[source]

Calculates list of gasteiger partial charges for each atom in mol object.

Parameters
  • mol – rdkit mol object

  • n_iter (int) – number of iterations. Default 12

Returns

list of computed partial charges for each atom.

pahelix.utils.compound_tools.get_largest_mol(mol_list)[source]

Given a list of rdkit mol objects, returns mol object containing the largest num of atoms. If multiple containing largest num of atoms, picks the first one

Parameters

mol_list (list) – a list of rdkit mol object.

Returns

the largest mol.

pahelix.utils.compound_tools.mol_to_graph_data(mol, add_self_loop=True)[source]
Converts rdkit mol object to graph data which is a dict of numpy ndarray.
NB: Uses simplified atom and bond features, and represent as indices.
Parameters
  • mol – rdkit mol object.

  • add_self_loop – whether to add self loop or not.

Returns

a dict of numpy ndarray for the graph data. It consists of atom attibutes, edge attibutes and edge index.

pahelix.utils.compound_tools.smiles_to_graph_data(smiles, add_self_loop=True)[source]

Convert smiles to graph data.

pahelix.utils.compound_tools.split_rdkit_mol_obj(mol)[source]

Split rdkit mol object containing multiple species or one species into a list of mol objects or a list containing a single object respectively

Parameters

mol – rdkit mol object.

4.2. data_utils

Tools for data.
pahelix.utils.data_utils.get_part_files(data_path, trainer_id, trainer_num)[source]

Split the files in data_path so that each trainer can train from different examples.

pahelix.utils.data_utils.load_npz_to_data_list(npz_file)[source]

Reload the data list save by save_data_list_to_npz.

Parameters

npz_file (str) – the npz file location.

Returns

a list of data where each data is a dict of numpy ndarray.

pahelix.utils.data_utils.save_data_list_to_npz(data_list, npz_file)[source]

Save a list of data to the npz file. Each data is a dict of numpy ndarray.

Parameters
  • data_list (list) – a list of data.

  • npz_file (str) – the npz file location.

4.3. language_model_tools

Tools for language models.
pahelix.utils.language_model_tools.apply_bert_mask(token_ids, tokenizer)[source]

Apply BERT mask to the token_ids.

Parameters

token_ids – The list of token ids.

Returns

The list of masked token ids. labels: The labels for traininig BERT.

Return type

masked_token_ids

4.4. paddle_utils

Paddle utils.
pahelix.utils.paddle_utils.get_distributed_optimizer(optimizer)[source]

Get the default collective distributed optimizer under fleet.

pahelix.utils.paddle_utils.load_partial_params(exe, init_model, main_program)[source]

Load partial params by checking whether it’s in the init_model folder.

Parameters
  • exe – Paddle executor.

  • init_model (str) – the model folder to load from.

  • main_program – Paddle program.

4.5. protein_tools

Tools for protein features.

class pahelix.utils.protein_tools.ProteinTokenizer[source]

Protein Tokenizer.

convert_token_to_id(token)[source]

Converts a token to an id.

Parameters

token – Token.

Returns

The id of the input token.

Return type

id

convert_tokens_to_ids(tokens)[source]

Convert multiple tokens to ids.

Parameters

tokens – The list of tokens.

Returns

The id list of the input tokens.

Return type

ids

gen_token_ids(sequence)[source]

Generate the list of token ids according the input sequence.

Parameters

sequence – Sequence to be tokenized.

Retuens:

token_ids: The list of token ids.

tokenize(sequence)[source]

Split the sequence into token list.

Parameters

sequence – The sequence to be tokenized.

Returns

The token lists.

Return type

tokens

4.6. splitters

Splitters
class pahelix.utils.splitters.RandomSplitter[source]

Random splitter.

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]
Parameters
  • dataset (InMemoryDataset) – the dataset to split.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

  • seed (int|None) – the random seed.

class pahelix.utils.splitters.IndexSplitter[source]

Split daatasets that has already been orderd. The first frac_train proportion is used for train set, the next frac_valid for valid set and the final frac_test for test set.

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]
Parameters
  • dataset (InMemoryDataset) – the dataset to split.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.ScaffoldSplitter[source]

Adapted from https://github.com/deepchem/deepchem/blob/master/deepchem/splits/splitters.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]
Parameters
  • dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.RandomScaffoldSplitter[source]

Adapted from https://github.com/pfnet-research/chainer-chemistry/blob/master/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]
Parameters
  • dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

  • seed (int|None) – the random seed.

pahelix.utils.splitters.generate_scaffold(smiles, include_chirality=False)[source]

Obtain Bemis-Murcko scaffold from smiles

Parameters
  • smiles – smiles sequence

  • include_chirality – Default=False

Returns

the scaffold of the given smiles.