MolForge

所属分类:collect
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-05-30 06:58:28
上 传 者sh-1993
说明:  莫尔福德,,
(MolForge,,)

文件列表:
Figures/ (0, 2023-05-29)
Figures/Attention.png (371763, 2023-05-29)
Figures/IG_attribution.png (385583, 2023-05-29)
Figures/pairwise_selfies_results.png (237933, 2023-05-29)
Figures/pairwise_smiles_results.png (259069, 2023-05-29)
Interpretation.ipynb (1173294, 2023-05-29)
LICENSE.md (20177, 2023-05-29)
Main_Results.ipynb (709706, 2023-05-29)
MolForge/ (0, 2023-05-29)
MolForge/__init__.py (145, 2023-05-29)
MolForge/decoder.py (6175, 2023-05-29)
MolForge/evaluate.py (12986, 2023-05-29)
MolForge/fingerprints.py (4396, 2023-05-29)
MolForge/interpretability.py (7561, 2023-05-29)
MolForge/parameters.py (2752, 2023-05-29)
MolForge/predict.py (4833, 2023-05-29)
MolForge/tokenizer.py (3355, 2023-05-29)
MolForge/train.py (11515, 2023-05-29)
MolForge/transformer.py (8353, 2023-05-29)
MolForge/utils.py (11941, 2023-05-29)
MolForge/web_api.py (3612, 2023-05-29)
data/ (0, 2023-05-29)
data/fingerprints/ (0, 2023-05-29)
data/fingerprints/AEs.selfies.test (5933550, 2023-05-29)
data/fingerprints/AEs.smiles.test (4117217, 2023-05-29)
data/fingerprints/Avalon.selfies.test (9734481, 2023-05-29)
data/fingerprints/Avalon.smiles.test (7919477, 2023-05-29)
data/fingerprints/ECFP0.selfies.test (3292133, 2023-05-29)
data/fingerprints/ECFP0.smiles.test (1479284, 2023-05-29)
data/fingerprints/ECFP2.selfies.test (4111707, 2023-05-29)
data/fingerprints/ECFP2.smiles.test (2306521, 2023-05-29)
data/fingerprints/ECFP4.selfies.test (4943243, 2023-05-29)
data/fingerprints/ECFP4.smiles.test (3123811, 2023-05-29)
data/fingerprints/FCFP2.selfies.test (3644813, 2023-05-29)
data/fingerprints/FCFP2.smiles.test (1828603, 2023-05-29)
data/fingerprints/FCFP4.selfies.test (4392710, 2023-05-29)
data/fingerprints/FCFP4.smiles.test (2574082, 2023-05-29)
data/fingerprints/HashAP.selfies.test (6910009, 2023-05-29)
... ...

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC_BY--NC_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![DOI](https://zenodo.org/badge/451459811.svg)](https://zenodo.org/badge/latestdoi/451459811) [![J. Cheminformatics DOI](https://img.shields.io/badge/J._Cheminformatics-10.1186%2Fs13321--023--00693--0-blue)](https://doi.org/10.1186/s13321-023-00693-0) ## Reconstruction of lossless molecular representations from fingerprints >Ucak UV, Ashyrmamatov I, Lee J (2023) Reconstruction of lossless molecular representations from fingerprints. J Cheminformatics 15:26. https://doi.org/10.1186/s13321-023-00693-0 The simplified molecular-input line-entry system (SMILES) is the most prevalent molecular representation used in AI-based chemical applications. However, there are innate limitations associated with the internal structure of SMILES representations. In this context, this study exploits the resolution and robustness of unique molecular representations, i.e., SMILES and SELFIES (SELF-referencIng Embedded strings), reconstructed from a set of structural fingerprints, which are proposed and used herein as vital representational tools for chemical and natural language processing (NLP) applications. This is achieved by restoring the connectivity information lost during fingerprint transformation with high accuracy. Notably, the results reveal that seemingly irreversible molecule-to-fingerprint conversion is feasible. More specifically, four structural fingerprints, extended connectivity, topological torsion, atom pairs, and atomic environments can be used as inputs and outputs of chemical NLP applications. Therefore, this comprehensive study addresses the major limitation of structural fingerprints that precludes their use in NLP models. Our findings will facilitate the development of text- or fingerprint-based chemoinformatic models for generative and translational tasks.
### Code usage #### Requirements The source code is tested on Linux operating systems. After cloning the repository, we recommend creating a new conda environment and install the package locally. Users should install required packages described in `environments.yml` prior to direct use. ```shell conda env create --name MolForge_env --file=environment.yml conda activate MolForge_env python -m pip install . ``` #### Prediction & Demo: First, checkpoint files ([top-performing](https://drive.google.com/uc?id=1zl6HBdwYsnA4JcnOi1o6OmcrRDB5iySK) or [all the oher models](https://drive.google.com/uc?id=1jCtbc9lMacCyiZ3iZFEtFgOfOQYtWEuD)) should be downloaded and extracted. The checkpoints files should be placed in `./saved_models/` directory. Then,run below commands to conduct an inference with the trained model. ```shell python predict.py --fp --model_type --input --checkpoint ``` where: - `--fp` : The name of fingerprint. - `--model_type` : Molecular representation e.g. 'smiles' or 'selfies' - `--input` : Bit number of the fingerprint (`--fp`). - `--checkpoint` : Checkpoint file for the given model. If `None`, it uses the downloaded checkpoints in the `./saved_models/`. - `--decode`: Decoding algorithm (either `'greedy'` or `'beam'`), (by default: `greedy`) Example prediction; ```shell python predict.py --fp='ECFP4' --model_type='smiles' --input='1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970' ``` and its sample output; ```shell Here we go.. fp : ECFP4 model_type : smiles input : 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970 input_file : None checkpoint : saved_models/ECFP4_smiles_checkpoint.pth decode : greedy src_vocab_size : 2052 trg_vocab_size : 109 src_seq_len : 104 trg_seq_len : 130 root_dir : /home/tmp/MolForge fp_datadir : /home/tmp/MolForge/data/fingerprints/ECFP4 src_sp_prefix : /home/tmp/MolForge/data/sp/ECFP4_vocab_sp trg_sp_prefix : /home/tmp/MolForge/data/sp/smiles_vocab_sp rank : cuda device : cuda The size of src vocab is 2052 and that of trg vocab is 109. Loading checkpoint... ECFP4 smiles Preprocessing input sentence... Encoding input sentence... Greedy decoding selected. Input: 1 80 94 114 237 241 255 294 392 411 425 695 743 747 786 875 1057 1171 1238 1365 1380 1452 1544 1750 1773 1853 1873 1970 Result: C C O C 1 = C ( C = C ( C = C 1 ) C ( C ( C ) ( C ) C ) N ) O C C Inference finished! || Total inference time: 0mins 0secs ```
## Result Each cell shows the Tanimoto exactness (%) of selected fingerprint transformation to SMILES (row-wise) computed at the respective fingerprint encodings(columns-wise). The consistency in color code reflects the robustness, while the jumps represent the effect of selection bias. ECFP2* and ECFP4* represent explicit bit versions. | | MACCS | Avalon | RDK4 | RDK4_L | HashAP | TT | HashTT | ECFP0 | ECFP2 | ECFP4 | FCFP2 | FCFP4 | AEs | ECFP2* | ECFP4* | |:-------|--------:|---------:|-------:|-----------:|---------:|-----:|---------:|--------:|--------:|--------:|--------:|--------:|------:|---------:|---------:| | MACCS | 77.4 | 33.3 | 38 | 39.8 | 32.2 | 33.2 | 33.2 | 52.2 | 34.7 | 32.5 | 48.6 | 33.5 | 34.7 | 37 | 33.3 | | Avalon | 72.6 | 67.9 | 72.2 | 73.5 | 63.4 | 64.7 | 64.7 | 69.5 | 65.6 | 63.6 | 68.9 | 64.7 | 65.6 | 68.5 | 64.6 | | RDK4 | 66.9 | 60 | 90.9 | 91.5 | 59.8 | 61.1 | 61.1 | 62.5 | 60.2 | 58.3 | 62.3 | 59.6 | 60.2 | 64.3 | 59.6 | | RDK4_L | 52.6 | 46.7 | 64.7 | 88.8 | 46.7 | 47.7 | 47.7 | 49.1 | 46.9 | 45.5 | 48.8 | 46.5 | 46.9 | 49.3 | 46.2 | | HashAP | 86.5 | 83.8 | 89.6 | 90.2 | 85.2 | 85.5 | 85.5 | 84.3 | 83.1 | 82.5 | 84 | 82.8 | 83.1 | 86.1 | 84.1 | | TT | 88.4 | 83.5 | 92.3 | 92.5 | 84.1 | 87.3 | 87.3 | 85.8 | 85.2 | 82.3 | 85.7 | 83.8 | 85.2 | 91.4 | 84.2 | | HashTT | 86.2 | 81.4 | 90.2 | 90.5 | 82.1 | 85.3 | 85.5 | 83.9 | 83.3 | 80.4 | 83.8 | 81.8 | 83.3 | 89.2 | 82.2 | | ECFP0 | 3.3 | 1.3 | 2.1 | 2.7 | 1.2 | 1.3 | 1.3 | 4 | 1.4 | 1.2 | 2.9 | 1.3 | 1.4 | 1.8 | 1.4 | | ECFP2 | 86 | 75.8 | 83.1 | 83.1 | 73.6 | 76 | 76 | 84.7 | 82.7 | 74.4 | 84.5 | 76.5 | 82.7 | 96.2 | 76 | | ECFP4 | 95.1 | 92.6 | 95.7 | 95.7 | 90.8 | 92.4 | 92.4 | 93.5 | 93.1 | 92.1 | 93.3 | 92.4 | 93.1 | 96.6 | 94.8 | | FCFP2 | 25.6 | 16.3 | 20.1 | 21.6 | 15.5 | 16 | 16 | 28.6 | 16.9 | 15.7 | 38.7 | 20.4 | 16.9 | 19.6 | 16.1 | | FCFP4 | 71.5 | 67.5 | 73.7 | 73.8 | 65.5 | 67.3 | 67.3 | 69.2 | 68.5 | 66.3 | 87.6 | 86.7 | 68.5 | 74.4 | 68.1 | | AEs | 86.7 | 76.2 | 83.5 | 83.6 | 74 | 76.3 | 76.3 | 85.3 | 83.5 | 74.7 | 85.2 | 76.8 | 83.5 | 97 | 76.5 | For more results see the `Main_Results.ipynb` notebook.
## Cite ``` @article{10.1186/s13321-023-00693-0, year = {2023}, title = {{Reconstruction of lossless molecular representations from fingerprints}}, author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Lee, Juyong}, journal = {Journal of Cheminformatics}, issn = {1758-2946}, doi = {10.1186/s13321-023-00693-0}, pmid = {36823647}, pmcid = {PMC9948316}, pages = {26}, number = {1}, volume = {15} } ``` ### License [![CC BY-SA 4.0][cc-by-sa-shield]][cc-by-sa] This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa]. [![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa] [cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/ [cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png [cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg

近期下载者

相关文件


收藏者