4. pahelix.utils

4.1. basic_utils

Basic utils
pahelix.utils.basic_utils.load_json_config(path)[source]

tbd

pahelix.utils.basic_utils.mp_pool_map(list_input, func, num_workers)[source]

list_output = [func(input) for input in list_input]

4.2. compound_tools

class pahelix.utils.compound_tools.Compound3DKit[source]

the 3Dkit of Compound

static get_2d_atom_poses(mol)[source]

get 2d atom poses

static get_MMFF_atom_poses(mol, numConfs=None, return_energy=False)[source]

the atoms of mol will be changed in some cases.

static get_atom_poses(mol, conf)[source]

tbd

static get_bond_lengths(edges, atom_poses)[source]

get bond lengths

static get_superedge_angles(edges, atom_poses, dir_type='HT')[source]

get superedge angles

class pahelix.utils.compound_tools.CompoundKit[source]
static atom_to_feat_vector(atom)[source]

tbd

static check_partial_charge(atom)[source]

tbd

static get_atom_feature_id(atom, name)[source]

get atom features id

static get_atom_feature_size(name)[source]

get atom features size

static get_atom_names(mol)[source]

get atom name list TODO: to be remove in the future

static get_atom_value(atom, name)[source]

get atom values

static get_bond_feature_id(bond, name)[source]

get bond features id

static get_bond_feature_size(name)[source]

get bond features size

static get_bond_value(bond, name)[source]

get bond values

static get_daylight_functional_group_counts(mol)[source]

get daylight functional group counts

static get_maccs_fingerprint(mol)[source]

get maccs fingerprint

static get_morgan2048_fingerprint(mol, radius=2)[source]

get morgan2048 fingerprint

static get_morgan_fingerprint(mol, radius=2)[source]

get morgan fingerprint

static get_ring_size(mol)[source]

return (N,6) list

pahelix.utils.compound_tools.check_smiles_validity(smiles)[source]

Check whether the smile can’t be converted to rdkit mol object.

pahelix.utils.compound_tools.create_standardized_mol_id(smiles)[source]
Parameters:

smiles – smiles sequence.

Returns:

inchi.

pahelix.utils.compound_tools.get_atom_feature_dims(list_acquired_feature_names)[source]

tbd

pahelix.utils.compound_tools.get_bond_feature_dims(list_acquired_feature_names)[source]

tbd

pahelix.utils.compound_tools.get_gasteiger_partial_charges(mol, n_iter=12)[source]

Calculates list of gasteiger partial charges for each atom in mol object.

Parameters:
  • mol – rdkit mol object.

  • n_iter (int) – number of iterations. Default 12.

Returns:

list of computed partial charges for each atom.

pahelix.utils.compound_tools.get_largest_mol(mol_list)[source]

Given a list of rdkit mol objects, returns mol object containing the largest num of atoms. If multiple containing largest num of atoms, picks the first one.

Parameters:

mol_list (list) – a list of rdkit mol object.

Returns:

the largest mol.

pahelix.utils.compound_tools.mol_to_geognn_graph_data(mol, atom_poses, dir_type)[source]

mol: rdkit molecule dir_type: direction type for bond_angle grpah

pahelix.utils.compound_tools.mol_to_geognn_graph_data_MMFF3d(mol)[source]

tbd

pahelix.utils.compound_tools.mol_to_geognn_graph_data_raw3d(mol)[source]

tbd

pahelix.utils.compound_tools.mol_to_graph_data(mol)[source]
Parameters:
  • atom_features – Atom features.

  • edge_features – Edge features.

  • morgan_fingerprint – Morgan fingerprint.

  • functional_groups – Functional groups.

pahelix.utils.compound_tools.new_mol_to_graph_data(mol)[source]

mol_to_graph_data

Parameters:
  • atom_features – Atom features.

  • edge_features – Edge features.

  • morgan_fingerprint – Morgan fingerprint.

  • functional_groups – Functional groups.

pahelix.utils.compound_tools.new_smiles_to_graph_data(smiles, **kwargs)[source]

Convert smiles to graph data.

pahelix.utils.compound_tools.rdchem_enum_to_list(values)[source]

values = {0: rdkit.Chem.rdchem.ChiralType.CHI_UNSPECIFIED, 1: rdkit.Chem.rdchem.ChiralType.CHI_TETRAHEDRAL_CW, 2: rdkit.Chem.rdchem.ChiralType.CHI_TETRAHEDRAL_CCW, 3: rdkit.Chem.rdchem.ChiralType.CHI_OTHER}

pahelix.utils.compound_tools.safe_index(alist, elem)[source]

Return index of element e in list l. If e is not present, return the last index

pahelix.utils.compound_tools.split_rdkit_mol_obj(mol)[source]

Split rdkit mol object containing multiple species or one species into a list of mol objects or a list containing a single object respectively.

Parameters:

mol – rdkit mol object.

4.3. data_utils

Tools for data.
pahelix.utils.data_utils.get_part_files(data_path, trainer_id, trainer_num)[source]

Split the files in data_path so that each trainer can train from different examples.

pahelix.utils.data_utils.load_npz_to_data_list(npz_file)[source]

Reload the data list save by save_data_list_to_npz.

Parameters:

npz_file (str) – the npz file location.

Returns:

a list of data where each data is a dict of numpy ndarray.

pahelix.utils.data_utils.save_data_list_to_npz(data_list, npz_file)[source]

Save a list of data to the npz file. Each data is a dict of numpy ndarray.

Parameters:
  • data_list (list) – a list of data.

  • npz_file (str) – the npz file location.

4.4. language_model_tools

Tools for language models.
pahelix.utils.language_model_tools.apply_bert_mask(inputs, pad_mask, tokenizer)[source]

Apply BERT mask to the token_ids.

Parameters:

token_ids – The list of token ids.

Returns:

The list of masked token ids. labels: The labels for traininig BERT.

Return type:

masked_token_ids

4.5. protein_tools

class pahelix.utils.protein_tools.ProteinTokenizer[source]

Protein Tokenizer.

convert_token_to_id(token)[source]

Converts a token to an id.

Parameters:

token – Token.

Returns:

The id of the input token.

Return type:

id

convert_tokens_to_ids(tokens)[source]

Convert multiple tokens to ids.

Parameters:

tokens – The list of tokens.

Returns:

The id list of the input tokens.

Return type:

ids

gen_token_ids(sequence)[source]

Generate the list of token ids according the input sequence.

Parameters:

sequence – Sequence to be tokenized.

Returns:

The list of token ids.

Return type:

token_ids

tokenize(sequence)[source]

Split the sequence into token list.

Parameters:

sequence – The sequence to be tokenized.

Returns:

The token lists.

Return type:

tokens

4.6. splitters

Splitters
class pahelix.utils.splitters.RandomSplitter[source]

Random splitter.

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]
Parameters:
  • dataset (InMemoryDataset) – the dataset to split.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

  • seed (int|None) – the random seed.

class pahelix.utils.splitters.IndexSplitter[source]

Split daatasets that has already been orderd. The first frac_train proportion is used for train set, the next frac_valid for valid set and the final frac_test for test set.

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]
Parameters:
  • dataset (InMemoryDataset) – the dataset to split.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.ScaffoldSplitter[source]

Adapted from https://github.com/deepchem/deepchem/blob/master/deepchem/splits/splitters.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]
Parameters:
  • dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.RandomScaffoldSplitter[source]

Adapted from https://github.com/pfnet-research/chainer-chemistry/blob/master/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]
Parameters:
  • dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.

  • frac_train (float) – the fraction of data to be used for the train split.

  • frac_valid (float) – the fraction of data to be used for the valid split.

  • frac_test (float) – the fraction of data to be used for the test split.

  • seed (int|None) – the random seed.

pahelix.utils.splitters.generate_scaffold(smiles, include_chirality=False)[source]

Obtain Bemis-Murcko scaffold from smiles

Parameters:
  • smiles – smiles sequence

  • include_chirality – Default=False

Returns:

the scaffold of the given smiles.