MELODI-Presto

所属分类:大数据
开发工具:Jupyter Notebook
文件大小:3752KB
下载次数:0
上传日期:2023-01-23 20:37:54
上 传 者sh-1993
说明:  快速编程的MELODI
(A fast and programmatic MELODI)

文件列表:
Dockerfile (341, 2023-01-24)
LICENSE (35149, 2023-01-24)
Usage.md (394, 2023-01-24)
create (0, 2023-01-24)
create\Creation.md (3798, 2023-01-24)
create\create_semmed_freqs.py (3911, 2023-01-24)
create\index_semmeddb_citations.py (4439, 2023-01-24)
create\index_semmeddb_freqs.py (3793, 2023-01-24)
create\index_semmeddb_predicate.py (7186, 2023-01-24)
create\index_semmeddb_sentences.py (6293, 2023-01-24)
create\mysql_to_csv.py (3599, 2023-01-24)
django_project (0, 2023-01-24)
django_project\django_project (0, 2023-01-24)
django_project\django_project\__init__.py (0, 2023-01-24)
django_project\django_project\apps.py (93, 2023-01-24)
django_project\django_project\serializers.py (629, 2023-01-24)
django_project\django_project\settings.py (4823, 2023-01-24)
django_project\django_project\static (0, 2023-01-24)
django_project\django_project\static\django_project (0, 2023-01-24)
django_project\django_project\static\django_project\Jupyter_logo.svg.png (76311, 2023-01-24)
django_project\django_project\static\django_project\MELODI_Lite_Logo.png (16777, 2023-01-24)
django_project\django_project\static\django_project\MELODI_Presto_Logo.png (20097, 2023-01-24)
django_project\django_project\static\django_project\MRC_IEU.png (205379, 2023-01-24)
django_project\django_project\static\django_project\colab_favicon_256px.png (5500, 2023-01-24)
django_project\django_project\static\django_project\favicon.ico (6987, 2023-01-24)
django_project\django_project\static\django_project\github_logo.png (42820, 2023-01-24)
django_project\django_project\static\django_project\index.css (368, 2023-01-24)
django_project\django_project\static\django_project\logo.png (7264, 2023-01-24)
django_project\django_project\static\django_project\melodi_logo.png (11550, 2023-01-24)
django_project\django_project\static\django_project\paper-figure.png (143682, 2023-01-24)
django_project\django_project\static\django_project\uob.jpg (30534, 2023-01-24)
django_project\django_project\templates (0, 2023-01-24)
django_project\django_project\templates\django_project (0, 2023-01-24)
django_project\django_project\templates\django_project\about.html (1299, 2023-01-24)
django_project\django_project\templates\django_project\app.html (689, 2023-01-24)
django_project\django_project\templates\django_project\base.html (6052, 2023-01-24)
django_project\django_project\templates\django_project\enrich.html (5404, 2023-01-24)
... ...

### Publication [MELODI Presto: A fast and agile tool to explore semantic triples derived from biomedical literature](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa726/5893950) ### API and App [https://melodi-presto.mrcieu.ac.uk/](https://melodi-presto.mrcieu.ac.uk/) [![DOI](https://zenodo.org/badge/259267887.svg)](https://zenodo.org/badge/latestdoi/259267887) ### Usage Details on how to use the method can be found [here](Usage.md) ### Creation Details on how the method was created can be found [here](create/Creation.md) ### About Previously we created MELODI, a method and tool to derive overlapping enriched literature elements connecting two biomedical terms, e.g. an exposure and a disease, [(Elsworth et al., 2018)](https://doi.org/10.1093/ije/dyx251). The main data involved were derived from SemMedDB [(Kilicoglu et al., 2012)](https://academic.oup.com/bioinformatics/article/28/23/3158/195282), in particular a set of annotated ‘subject-predicate-object’ triples created from the titles and abstracts of almost 30 million biomedical articles. All data were housed in a Neo4j graph, and each query term of interest created connections between the query and the associated literature nodes. This approach provided a suitable method for this type of analysis, and storing the data in a graph made sense, however, creating new links and performing large queries (those that search large parts of the graph and return large amounts of data) was not efficient. In addition, the graph contained all data from the PREDICATION table from SemMedDB, which contains lots of predicates and types that were not informative. It was also becoming apparent that limiting searches to two query terms was not ideal. For example, cases where a set of genes had been identified with potential links to a disease could not be queried efficiently and the results were impossible to disentangle. There was also a developing need to do many queries, and doing this via the web application was not practical, therefore the development of a programmatic method was required. To address all these issues, we created MELODI Presto. A quicker and more agile method to identify overlapping elements between any number of exposures and outcomes. The modifications made to the data, architecture and method are listed below: ##### Filter by term type SemMedDB triples were filtered to include only those matching particular ‘term types’. These types are defined by the UMLS semantic type abbreviations (https://mmtx.nlm.nih.gov/MMTx/semanticTypes.shtml). We decided to focus on terms that would be most relevant to mechanistic inference. Table 1 lists the terms that were selected. Table 1. UMLS semantic types included in MELODI Presto ``` curl -X GET "localhost:9200/semmeddb-v40/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs" : { "sub_type" : { "terms" : { "field" : "SUBJECT_SEMTYPE" , "size" : 10000 } } } } ``` ' |Type acronym | Type full name | Subject Count | Object Count | |---|---|---|---| |aapp |Amino Acid, Peptide, or Protein |2,796,833 |1,506,909| |gngm |Gene or Genome |1,172,***3 |1,957,313| |orch |Organic Chemical |1,106,038 |556,152| |dsyn |Disease or Syndrome |877,924 |2,144,961| |horm |Hormone |235,704 |104,903| |hops |Hazardous or Poisonous Substance |167,979 |99,867| |inch |Inorganic Chemical |134,810 |160,096| |enzy |Enzyme |35,497 |46,044| |chem |Chemical |15,318 |13,156| ##### Filter by predicate type To improve the usabilty of the data some of the more ambiguous predicates. Table 2 lists the predicates that were excluded from the data set. ``` zless semmedVER40_R_PREDICATION.tsv.gz | cut -f 4 | sort | uniq -c | sort -nr ``` Table 2. Exluded SemMedDB predicates and their frequency counts |Predicate |Count| |---|---| | PROCESS_OF | 19,628,9*** | | LOCATION_OF | 16,***7,580 | | PART_OF | 9,920,521 | | ISA | 5,886,751 | | USES | 4,487,945 | | compared_with | 1,056,***2 | | ADMINISTERED_TO | 1,535,833 | | METHOD_OF | 581,303 | Combined, these two criteria reduce the number of PREDICATE triples from 97,972,561 to 6,533,824. ``` curl -XGET 'localhost:9200/semmeddb-v40/_count?pretty' { "count" : 6533824, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 } } ``` ##### Use Elasticsearch instead of Neo4j As this was now a simpler lookup problem, and not something requiring complex relatioships, Elasticsearch was more selected as the architecture within which to search thedata. Previous experience (https://ieup4.blogs.bristol.ac.uk/2019/04/16/exploring-elasticsearch-architectures-with-oracle-cloud/) had also identified this method as suitably quick for this type of analysis. ##### Enrichment For enrichment analysis counts of all triples were performed using Elasticsearch aggregation calls and added to a separate index. The enrichment method follows the same principle as MELODI, using a standard 2x2 Fisher’s exct test. For example, if a query ‘Sleep duration’ returned a set of triples "Sleep Apnea, Obstructive:PREDISPOSES:Hypertensive disease" then we can count the number of these triples (a), the number of total triples matched to the query (b), the total number of these triples in the data base (c), and the total number of triples in the database (d). ``` import scipy.stats as stats a,b,c,d=[10,3505,147,6611441] oddsratio, pvalue = stats.fisher_exact([[a, b-a], [c, d-c]]) oddsratio,pvalue (128.68323065993206, 3.002903135377263e-18) ``` ##### Performance A first pass creates local copies of the enrichment data, as seen above. For this reason if a variable has not been run already it may take a few moments. However, if an existing variable is queried, the function runs in seconds. ``` q="chronic kidney disease" time curl -o "ckd.melodi-presto.json" -X POST "https://melodi-presto.mrcieu.ac.uk/api/enrich/" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"query\": \"$q\" }" 4.013 total ``` ### Limitations [SemMedDB](https://ii.nlm.nih.gov/SemRep_SemMedDB_SKR/SemMedDB/SemMedDB_download.shtml) is an excellent resource but it is not a perfect representation of the literature. Precision ranges from 73-96% (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3509487/) meaning that many of the semantic triples are incorrect and missing. By refining the SemMedDB data we have attempted to reduce some of the noise, but will have inadvertently removed some useful content too. This trade off of signal to noise, as well as performance is difficult to obtain. ### Notes The call to PubMed is limited to most recent 1 million articles. ``` http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? {'db': 'pubmed', 'term': '', 'retmax': '1000000', 'rettype': 'uilist'} ```

近期下载者

相关文件


收藏者