Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents"
Dataset Description |
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency.
## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper
The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper
## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading.
The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format." |
Subject |
Social Sciences |
Keywords |
MEDLINE;MeSH;Medical Subject Headings;Indexing |
License |
CC BY |
Corresponding Creator |
Adam K. Kehoe |
Downloaded |
1246 times |
| Version | DOI | Comment | Publication Date |
|---|---|---|---|
| 1 | 10.13012/B2IDB-8020612_V1 | 2019-07-08 |
Contact the Research Data Service for help interpreting this log.
| Dataset | update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Social Sciences"]} | 2019-07-25T21:53:15Z |
| Dataset | update: {"publication_state"=>["metadata embargo", "released"]} | 2019-07-08T06:00:09Z |