4. pahelix.utils ¶

4.1. basic_utils ¶

Basic utils

pahelix.utils.basic_utils.load_json_config(path)[source]¶: tbd

pahelix.utils.basic_utils.mp_pool_map(list_input, func, num_workers)[source]¶: list_output = [func(input) for input in list_input]

4.2. compound_tools ¶

Tools for compound features.

Adapted from https://github.com/snap-stanford/pretrain-gnns/blob/master/chem/loader.py

class pahelix.utils.compound_tools.Compound3DKit[source]¶

the 3Dkit of Compound

static get_2d_atom_poses(mol)[source]¶: get 2d atom poses

static get_MMFF_atom_poses(mol, numConfs=None, return_energy=False)[source]¶: the atoms of mol will be changed in some cases.

static get_atom_poses(mol, conf)[source]¶: tbd

static get_bond_lengths(edges, atom_poses)[source]¶: get bond lengths

static get_superedge_angles(edges, atom_poses, dir_type='HT')[source]¶: get superedge angles

class pahelix.utils.compound_tools.CompoundKit[source]¶

static atom_to_feat_vector(atom)[source]¶: tbd

static check_partial_charge(atom)[source]¶: tbd

static get_atom_feature_id(atom, name)[source]¶: get atom features id

static get_atom_feature_size(name)[source]¶: get atom features size

static get_atom_names(mol)[source]¶: get atom name list TODO: to be remove in the future

static get_atom_value(atom, name)[source]¶: get atom values

static get_bond_feature_id(bond, name)[source]¶: get bond features id

static get_bond_feature_size(name)[source]¶: get bond features size

static get_bond_value(bond, name)[source]¶: get bond values

static get_daylight_functional_group_counts(mol)[source]¶: get daylight functional group counts

static get_maccs_fingerprint(mol)[source]¶: get maccs fingerprint

static get_morgan2048_fingerprint(mol, radius=2)[source]¶: get morgan2048 fingerprint

static get_morgan_fingerprint(mol, radius=2)[source]¶: get morgan fingerprint

static get_ring_size(mol)[source]¶: return (N,6) list

pahelix.utils.compound_tools.check_smiles_validity(smiles)[source]¶: Check whether the smile can’t be converted to rdkit mol object.

pahelix.utils.compound_tools.create_standardized_mol_id(smiles)[source]¶

Parameters:: smiles – smiles sequence.
Returns:: inchi.

pahelix.utils.compound_tools.get_atom_feature_dims(list_acquired_feature_names)[source]¶: tbd

pahelix.utils.compound_tools.get_bond_feature_dims(list_acquired_feature_names)[source]¶: tbd

pahelix.utils.compound_tools.get_gasteiger_partial_charges(mol, n_iter=12)[source]¶

Calculates list of gasteiger partial charges for each atom in mol object.

Parameters:

mol – rdkit mol object.
n_iter (int) – number of iterations. Default 12.

Returns:

list of computed partial charges for each atom.

pahelix.utils.compound_tools.get_largest_mol(mol_list)[source]¶

Given a list of rdkit mol objects, returns mol object containing the largest num of atoms. If multiple containing largest num of atoms, picks the first one.

Parameters:: mol_list (list) – a list of rdkit mol object.
Returns:: the largest mol.

pahelix.utils.compound_tools.mol_to_geognn_graph_data(mol, atom_poses, dir_type)[source]¶: mol: rdkit molecule dir_type: direction type for bond_angle grpah

pahelix.utils.compound_tools.mol_to_geognn_graph_data_MMFF3d(mol)[source]¶: tbd

pahelix.utils.compound_tools.mol_to_geognn_graph_data_raw3d(mol)[source]¶: tbd

pahelix.utils.compound_tools.mol_to_graph_data(mol)[source]¶

Parameters:

atom_features – Atom features.
edge_features – Edge features.
morgan_fingerprint – Morgan fingerprint.
functional_groups – Functional groups.

pahelix.utils.compound_tools.new_mol_to_graph_data(mol)[source]¶

mol_to_graph_data

Parameters:

atom_features – Atom features.
edge_features – Edge features.
morgan_fingerprint – Morgan fingerprint.
functional_groups – Functional groups.

pahelix.utils.compound_tools.new_smiles_to_graph_data(smiles, **kwargs)[source]¶: Convert smiles to graph data.

pahelix.utils.compound_tools.rdchem_enum_to_list(values)[source]¶: values = {0: rdkit.Chem.rdchem.ChiralType.CHI_UNSPECIFIED, 1: rdkit.Chem.rdchem.ChiralType.CHI_TETRAHEDRAL_CW, 2: rdkit.Chem.rdchem.ChiralType.CHI_TETRAHEDRAL_CCW, 3: rdkit.Chem.rdchem.ChiralType.CHI_OTHER}

pahelix.utils.compound_tools.safe_index(alist, elem)[source]¶: Return index of element e in list l. If e is not present, return the last index

pahelix.utils.compound_tools.split_rdkit_mol_obj(mol)[source]¶

Split rdkit mol object containing multiple species or one species into a list of mol objects or a list containing a single object respectively.

Parameters:: mol – rdkit mol object.

4.3. data_utils ¶

Tools for data.

pahelix.utils.data_utils.get_part_files(data_path, trainer_id, trainer_num)[source]¶: Split the files in data_path so that each trainer can train from different examples.

pahelix.utils.data_utils.load_npz_to_data_list(npz_file)[source]¶

Reload the data list save by save_data_list_to_npz.

Parameters:: npz_file (str) – the npz file location.
Returns:: a list of data where each data is a dict of numpy ndarray.

pahelix.utils.data_utils.save_data_list_to_npz(data_list, npz_file)[source]¶

Save a list of data to the npz file. Each data is a dict of numpy ndarray.

Parameters:

data_list (list) – a list of data.
npz_file (str) – the npz file location.

4.4. language_model_tools ¶

Tools for language models.

pahelix.utils.language_model_tools.apply_bert_mask(inputs, pad_mask, tokenizer)[source]¶

Apply BERT mask to the token_ids.

Parameters:: token_ids – The list of token ids.
Returns:: The list of masked token ids. labels: The labels for traininig BERT.
Return type:: masked_token_ids

4.5. protein_tools ¶

class pahelix.utils.protein_tools.ProteinTokenizer[source]¶

Protein Tokenizer.

convert_token_to_id(token)[source]¶

Converts a token to an id.

Parameters:: token – Token.
Returns:: The id of the input token.
Return type:: id

convert_tokens_to_ids(tokens)[source]¶

Convert multiple tokens to ids.

Parameters:: tokens – The list of tokens.
Returns:: The id list of the input tokens.
Return type:: ids

gen_token_ids(sequence)[source]¶

Generate the list of token ids according the input sequence.

Parameters:: sequence – Sequence to be tokenized.
Returns:: The list of token ids.
Return type:: token_ids

tokenize(sequence)[source]¶

Split the sequence into token list.

Parameters:: sequence – The sequence to be tokenized.
Returns:: The token lists.
Return type:: tokens

4.6. splitters ¶

Splitters

class pahelix.utils.splitters.RandomSplitter[source]¶

Random splitter.

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]¶

Parameters:

dataset (InMemoryDataset) – the dataset to split.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.
seed (int|None) – the random seed.

class pahelix.utils.splitters.IndexSplitter[source]¶

Split daatasets that has already been orderd. The first frac_train proportion is used for train set, the next frac_valid for valid set and the final frac_test for test set.

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]¶

Parameters:

dataset (InMemoryDataset) – the dataset to split.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.ScaffoldSplitter[source]¶

Adapted from https://github.com/deepchem/deepchem/blob/master/deepchem/splits/splitters.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None)[source]¶

Parameters:

dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.

class pahelix.utils.splitters.RandomScaffoldSplitter[source]¶

Adapted from https://github.com/pfnet-research/chainer-chemistry/blob/master/chainer_chemistry/dataset/splitters/scaffold_splitter.py

Split dataset by Bemis-Murcko scaffolds

split(dataset, frac_train=None, frac_valid=None, frac_test=None, seed=None)[source]¶

Parameters:

dataset (InMemoryDataset) – the dataset to split. Make sure each element in the dataset has key “smiles” which will be used to calculate the scaffold.
frac_train (float) – the fraction of data to be used for the train split.
frac_valid (float) – the fraction of data to be used for the valid split.
frac_test (float) – the fraction of data to be used for the test split.
seed (int|None) – the random seed.

pahelix.utils.splitters.generate_scaffold(smiles, include_chirality=False)[source]¶

Obtain Bemis-Murcko scaffold from smiles

Parameters:

smiles – smiles sequence
include_chirality – Default=False

Returns:

the scaffold of the given smiles.

4.7. Helpful Link ¶

Please refer to our GitHub repo to see the whole module.