Illinois Data Bank Dataset Search Results
Results
published:
2025-09-08
Si, Luyang; Salami, Malik Oyewale; Schneider, Jodi
(2025)
This project evaluates the quality of retraction indexing metadata in Crossref. We investigated 208 DOIs that were indexed as retracted in Crossref in our April 2023 union list (Schneider et al., 2023), but were no longer indexed as retracted in the July 2024 union list (Salami et al., 2024), despite still being covered in the Crossref database. Therefore, we manually checked the current retraction status of these 208 DOIs on their publishers’ websites to ascertain their actual status.
keywords:
Crossref; Data Quality; Retraction indexing; Retracted papers; Retraction notices; Retraction status; RISRS
published:
2023-07-27
Feng, Ling; Takiya, Daniela; Krishnankutty, Sindhu; Dietrich, Christopher; Zhang, Yalin
(2023)
The text file contains the original aligned DNA nucleotide sequence data used in the phylogenetic analyses of Feng et al. (in review), comprising the 3 protein-coding genes (histone H3, cytochrome oxidase I and 2) and 2 ribosomal genes (28S D8 and 16S). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The first six lines of the file identify the file as NEXUS, indicate that the file contains data for 257 taxa (species) and 2995 characters (nucleotide positions), indicate that the characters are DNA sequence, that gaps inserted into the DNA sequence alignment are indicated by a dash, and that missing data are indicated by a question mark. The remainder of the file contains the aligned nucleotide sequence data for the five genes. Data partitions, representing the individual genes and different codon positions of the protein-coding genes, are indicated by the lines beginning "charset" near the end of the file. Two supplementary tables in the provided PDF file provide additional information on the species in the dataset, including the GenBank accession numbers for the sequence data (Table S1) and the DNA substitution models used for each of the data partitions used for analyses in the phylogenetic analysis program IQ-Tree (version 1.6.8) (Table S3), as described in the Methods section of the paper. The supplemental tables will also be linked to the article upon publication at the journal website.
keywords:
Insect; leafhopper; dispersal; vicariance; evolution
published:
2025-11-20
Raj, Tirath; Singh, Vijay
(2025)
In a novel approach, metabolically engineered sugarcane “Oilcane” has been investigated for fractionation of lipid and cellulose-rich pulp, using certain Natural deep eutectic solvents (NADES). The exploration of eco-friendly solvents are at the forefront of harnessing the biofuel potential of modern bioenergy crops. For this, six combinations of NADES were prepared using choline chloride (ChCl) as HBA and lactic acid (LA), oxalic acid (OA) and glycerol (Gly) as HBD and were further explored for pretreatment of oilcane bagasse in a molar ratio of 1:1 and 1:2. The impact of NADES ratio, biomass loading (10–50%), residence time (1–2 h), and temperature (90–140 °C) were evaluated for delignification, lipid content, sugar release after enzymatic hydrolysis. The finding demonstrated that under the optimal condition of ChCl: LA (1:2 molar ratio), 140 °C with 2 h retention time, the lipid content in the pre-treated substrate was increased to 2.5-fold (∼8% w/w) and > 80% glucose yield was achieved after 72 h of hydrolysis of pre-treated bagasse. High solid loading (∼50%) during pretreatment resulted in a similar glucose yield. Furthermore, recycling studies demonstrated that nearly 95 to 98% NADES could be recycled after each pretreatment for up to five consecutive cycles without any significant loss in chemical structure as confirmed by 1H NMR and FT IR. FT IR and XRD analyses of native and pre-treated biomass were performed to visualize the morphological changes during NADES pretreatment and their impact on sugar yield. The findings of the study may be used to establish NADES-based biorefinery for the valorization of lipids, and carbohydrates for fuels and chemicals production.
keywords:
Conversion;Hydrolysate;Lipidomics
published:
2025-10-27
Cheng, Ming-Hsun; Singh, Shuchi; Carr Clennon, Aidan N.; Dien, Bruce; Singh, Vijay
(2025)
Xylan accounts for up to 40% of the structural carbohydrates in lignocellulosic feedstocks. Along with xylan, acetic acid in sources of hemicellulose can be recovered and marketed as a commodity chemical. Through vibrant bioprocessing innovations, converting xylose and acetic acid into high-value bioproducts via microbial cultures improves the feasibility of lignocellulosic biorefineries. Enzymatic hydrolysis using xylanase supplemented with acetylxylan esterase (AXE) was applied to prepare xylose-acetic acid enriched hydrolysates from bioenergy sorghum, oilcane, or energycane using sequential hydrothermal-mechanical pretreatment. Various biomass solids contents (15 to 25%, w/v) and xylanase loadings (140 to 280 FXU/g biomass) were tested to maximize xylose and acetic acid titers. The xylose and acetic acid yields were significantly improved by supplementing with AXE. The optimal yields of xylose and acetic acid were 92.29% and 62.26% obtained from hydrolyzing energycane and oilcane at 25% and 15% w/v biomass solids using 280 FXU xylanase/g biomass and AXE, respectively.
keywords:
Conversion;Biomass Analytics;Feedstock Bioprocessing;Hydrolysate
published:
2022-08-31
Chen, Wenxiang; Zhan, Xun; Yuan, Renliang; Pidaparthy, Saran; Yong, Adrian Xiao Bin; An, Hyosung; Tang, Zhichu; Yin, Kaijun; Patra, Arghya; Jeong, Heonjae; Zhang, Cheng; Ta, Kim; Riedel, Zachary; Stephens, Ryan; Shoemaker, Daniel; Yang, Hong; Gewirth, Andrew; Braun, Paul; Ertekin, Elif; Zuo, Jian-Min; Chen, Qian
(2022)
These datasets are for the four-dimensional scanning transmission electron microscopy (4D-STEM) and electron energy loss spectroscopy (EELS) experiments for cathode nanoparticles at different cutoff voltages and in different electrolytes. The raw 4D-STEM experiment datasets were collected by TEM image & analysis software (FEI) and were saved as SER files. The raw 4D-STEM datasets of SER files can be opened and viewed in MATLAB using our analysis software package of imToolBox available at <a href="https://github.com/flysteven/imToolBox">https://github.com/flysteven/imToolBox</a>. The raw EELS datasets were collected by DigitalMicrograph software and were saved as DM4 files. The raw EELS datasets can be opened and viewed in DigitalMicrograph software or using our analysis codes available at <a href="https://github.com/chenlabUIUC/OrientedPhaseDomain">https://github.com/chenlabUIUC/OrientedPhaseDomain</a>. All the datasets are from the work "Formation and impact of nanoscopic oriented phase domains in electrochemical crystalline electrodes" (2022).
The 4D-STEM experiment data include four example datasets for cathode nanoparticles collected at different cutoff voltages and in different electrolytes as described below. Each dataset contains a stack of diffraction patterns collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine cathode particle: "Pristine particle 4D-STEM.ser"
2. Cathode particle at the cutoff voltage of 0.09V during discharge at C/10 in the aqueous electrolyte: "Intermediate cutoff0_09V discharge (aqueous) 4D-STEM.ser"
3. Fully discharged cathode particle at C/10 in the aqueous electrolyte: "Fully discharged particle 4D-STEM.ser"
4. Fully discharged cathode particle at C/10 in the dry organic electrolyte: "Fully discharge particle (dry organic electrolyte).ser"
The EELS experiment data includes three example datasets for cathode nanoparticles collected at different cutoff voltages during discharge in the aqueous electrolyte (in "EELS datasets.zip") as described below. Each EELS dataset contains the zero-loss and core-loss EELS spectra collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine cathode particle: "Pristine particle EELS.zip"
2. Cathode particle at the cutoff voltage of 0.09V during discharge at C/10 in the aqueous electrolyte: "intermediate discharge (aqueous) EELS.zip"
3. Fully discharged cathode particle at C/10 in the aqueous electrolyte: "fully discharge (aqueous) EELS.zip"
The details of the software package and codes that can be used to analyze the 4D-STEM datasets and EELS datasets are available at: https://github.com/chenlabUIUC/OrientedPhaseDomain. Once our paper is formally published, we will update the relationship of these datasets with our paper.
keywords:
4D-STEM; microstructure; phase transformation; strain; cathode; nanoparticle; energy storage
published:
2025-09-18
Cao, Mingfeng; Fatma, Zia; Song, Xiaofei; Hsieh, Ping-Hung; Tran, Vinh G.; Lyon, William L.; Sayadi, Maryam; Shao, Zengyi; Yoshikuni, Yasuo; Zhao, Huimin
(2025)
The nonconventional yeast Issatchenkia orientalis can grow under highly acidic conditions and has been explored for production of various organic acids. However, its broader application is hampered by the lack of efficient genetic tools to enable sophisticated metabolic manipulations. We recently constructed an episomal plasmid based on the autonomously replicating sequence (ARS) from Saccharomyces cerevisiae (ScARS) in I. orientalis and developed a CRISPR/Cas9 system for multiplexed gene deletions. Here we report three additional genetic tools including: (1) identification of a 0.8 kb centromere-like (CEN-L) sequence from the I. orientalis genome by using bioinformatics and functional screening; (2) discovery and characterization of a set of constitutive promoters and terminators under different culture conditions by using RNA-Seq analysis and a fluorescent reporter; and (3) development of a rapid and efficient in vivo DNA assembly method in I. orientalis, which exhibited ~100% fidelity when assembling a 7 kb-plasmid from seven DNA fragments ranging from 0.7 kb to 1.7 kb. As proof of concept, we used these genetic tools to rapidly construct a functional xylose utilization pathway in I. orientalis.
keywords:
Conversion;Genome Engineering;Genomics;Transcriptomics
published:
2025-10-01
Wang, Yajie; Huang, Xiaoqiang; Hui, Jingshu; Vo, Lam Tung; Zhao, Huimin
(2025)
There is a growing interest in developing cooperative chemoenzymatic reactions to harness the reactivity of chemical catalysts and the selectivity of enzymes for the synthesis of nonracemic chiral compounds. However, existing chemoenzymatic systems with more than one chemical reaction and one enzymatic reaction working cooperatively are rare. Moreover, the application of oxidoreductases in cooperative chemoenzymatic reactions is limited by the necessity of using expensive and unstable redox equivalents such as nicotinamide cofactors. Here, we report a light-driven cooperative chemoenzymatic system comprised of a photoinduced electron transfer reaction (PET) and a photosensitized energy transfer reaction (PEnT) with an enzymatic reduction in one-pot to synthesize chiral building blocks of bioactive compounds. As a proof of concept, ene-reductase was directly regenerated by PET in the absence of external cofactors. Meanwhile, enzymatic reduction worked cooperatively with photocatalyst-catalyzed energy transfer that continuously replenished the reactive isomer from the less reactive one. The whole system stereoconvergently reduced E/Z mixtures of alkenes to the enantiopure products. Additionally, enantioselective enzymatic reduction worked competitively with photocatalyst-catalyzed racemic background reaction and side reactions to channel the overall electron flow to the single enantiopure product. Such a light-driven cooperative chemoenzymatic system holds great potential for asymmetric synthesis using inexpensive petroleum or biomass-derived alkenes.
keywords:
Conversion;Catalysis
published:
2025-09-29
Wang, Sheng; Guan, Kaiyu; Wang, Zhihui; Ainsworth, Elizabeth; Zheng, Ting; Townsend, Philip; Li, Kaiyuan; Moller, Christopher; Wu, Genghong; Jiang, Chongya
(2025)
The photosynthetic capacity or the CO2-saturated photosynthetic rate (Vmax), chlorophyll, and nitrogen are closely linked leaf traits that determine C4 crop photosynthesis and yield. Accurate, timely, rapid, and non-destructive approaches to predict leaf photosynthetic traits from hyperspectral reflectance are urgently needed for high-throughput crop monitoring to ensure food and bioenergy security. Therefore, this study thoroughly evaluated the state-of-the-art physically based radiative transfer models (RTMs), data-driven partial least squares regression (PLSR), and generalized PLSR (gPLSR) models to estimate leaf traits from leaf-clip hyperspectral reflectance, which was collected from maize (Zea mays L.) bioenergy plots with diverse genotypes, growth stages, treatments with nitrogen fertilizers, and ozone stresses in three growing seasons. The results show that leaf RTMs considering bidirectional effects can give accurate estimates of chlorophyll content (Pearson correlation r=0.95), while gPLSR enabled retrieval of leaf nitrogen concentration (r=0.85). Using PLSR with field measurements for training, the cross-validation indicates that Vmax can be well predicted from spectra (r=0.81). The integration of chlorophyll content (strongly related to visible spectra) and nitrogen concentration (linked to shortwave infrared signals) can provide better predictions of Vmax (r=0.71) than only using either chlorophyll or nitrogen individually. This study highlights that leaf chlorophyll content and nitrogen concentration have key and unique contributions to Vmax prediction.
keywords:
Feedstock Production;Sustainability;Biomass Analytics;Modeling
published:
2021-02-26
Bauder, Javan M; Allen, Maximilian L.
(2021)
These data were used in the survival and cause-specific mortality analyses of translocated nuisance American black bear in Wisconsin published in Animal Conservation (Bauder, J.M., N.M. Roberts, D. Ruid, B. Kohn, and M.L. Allen. Accepted. Lower survival of nuisance American black bears (Ursus americanus) is not due to translocation. Animal Conservation). Included are CSV files including each bear's capture history and associated covariates and meta-data for each CSV file. Also included is an example R script of how to conduct the analyses (this R script is also included as supporting information with the published paper).
keywords:
black bear; survival; translocation; nuisance wildlife management
published:
2021-03-10
Trivellone, Valeria; Wei, Wei; Filippin, Luisa; Dietrich, Christopher H
(2021)
The PhytoplasmasRef_Trivellone_etal.fas fasta file contains the original final sequence alignment used in the phylogenetic analyses of Trivellone et al. (Ecology and Evolution, in review). The 27 sequences (21 phytoplasma reference strains and 6 phytoplasmas strains from the present study) were aligned using the Muscle algorithm as implemented in MEGA 7.0 with default settings. The final dataset contains 952 positions of the F2n/R2 fragment of the 16S rRNA gene.
The data analyses are further described in the cited original paper.
keywords:
Hemiptera; Cicadellidae; Mollicutes; Phytoplasma; biorepository
published:
2025-09-18
Jagtap, Sujit; Bedekar, Ashwini; Liu, Jing-Jing; Jin, Yong-Su; Rao, Christopher V.
(2025)
Sugar alcohols are commonly used as low-calorie sweeteners and can serve as potential building blocks for bio-based chemicals. Previous work has shown that the oleaginous yeast Rhodosporidium toruloides IFO0880 can natively produce arabitol from xylose at relatively high titers, suggesting that it may be a useful host for sugar alcohol production. In this work, we explored whether R. toruloides can produce additional sugar alcohols. Rhodosporidium toruloides is able to produce galactitol from galactose. During growth in nitrogen-rich medium, R. toruloides produced 3.2 ± 0.6 g/L, and 8.4 ± 0.8 g/L galactitol from 20 to 40 g/L galactose, respectively. In addition, R. toruloides was able to produce galactitol from galactose at reduced titers during growth in nitrogen-poor medium, which also induces lipid production. These results suggest that R. toruloides can potentially be used for the co-production of lipids and galactitol from galactose. We further characterized the mechanism for galactitol production, including identifying and biochemically characterizing the critical aldose reductase. Intracellular metabolite analysis was also performed to further understand galactose metabolism. Rhodosporidium toruloides has traditionally been used for the production of lipids and lipid-based chemicals. Our work demonstrates that R. toruloides can also produce galactitol, which can be used to produce polymers with applications in medicine and as a precursor for anti-cancer drugs. Collectively, our results further establish that R. toruloides can produce multiple value-added chemicals from a wide range of sugars.
keywords:
Conversion;Genomics;Metabolomics
published:
2021-03-08
Mickalide, Harry (Avery); Kuehn, Seppe
(2021)
These are abundance dynamics data and simulations for the paper "Higher-order interaction between species inhibits bacterial invasion of a phototroph-predator microbial community".
In this V2, data were converted in Python, in addition to MATLAB and more information on how to work with the data was included in the Readme.
keywords:
Microbial community; Higher order interaction; Invasion; Algae; Bacteria; Ciliate
published:
2023-05-02
Larsen, Ryan; Stanke, Kayla L. ; Rund, Laurie; Leyshon, Brian; Louie, Allison; Steelman, Andrew
(2023)
This dataset includes structural MRI head scans of 32 piglets, at 28 days of age, scanned at the University of Illinois. The dataset also includes manually drawn brain masks of each of the piglets. The dataset also includes brain masks that were generated automatically using Region-Based Convolutional Neural Networks (Mask R-CNN), trained on the manually drawn brain masks.
keywords:
Brain extraction; Machine learning; MRI; Piglet; neural networks
published:
2021-10-10
This data set describes temperature, dissolved oxygen, and secchi depth in 1-m interval profiles in the deepest point in 10 Illinois reservoirs between the years 1995 and 2016.
keywords:
Water temperature; dissolved oxygen; secchi depth; climate change
published:
2022-09-01
Di Giovanni, Alexander; Ward, Michael
(2022)
These data and code are associated with a study on differences in the rate of hatching failure of eggs across 14 free-living grassland and shrubland birds. We used a device to measure the embryonic heart rate of eggs and found there was variation across species related to factors such as nest type and nest safety. This work is to be published in Ornithology.
keywords:
embryonic death; grassland birds; egg mortality; heart rate
published:
2021-02-10
Stickley, Samuel; Fraterrigo, Jennifer
(2021)
This dataset consists of microclimatic temperature and vegetation structure maps at a 3-meter spatial resolution across the Great Smoky Mountains National Park. Included are raster models for sub-canopy, near-surface, minimum and maximum temperature averaged across the study period, season, and month during the growing season months of March through November from 2006-2010. Also available are the topographic and vegetation inputs developed for the microclimate models, including LiDAR-derived vegetation height, LiDAR-derived vegetation structure within four height strata, solar insolation, distance-to-stream, and topographic convergence index (TCI).
keywords:
microclimate buffering; forest vegetation structure; temperature; Appalachian Mountains; climate downscaling; understory; LiDAR
published:
2021-03-17
Imker, Heidi J; Luong, Hoa; Mischo, William H; Schlembach, Mary C; Wiley, Chris
(2021)
This dataset was developed as part of a study that assessed data reuse. Through bibliometric analysis, corresponding authors of highly cited papers published in 2015 at the University of Illinois at Urbana-Champaign in nine STEM disciplines were identified and then surveyed to determine if data were generated for their article and their knowledge of reuse by other researchers. Second, the corresponding authors who cited those 2015 articles were identified and surveyed to ascertain whether they reused data from the original article and how that data was obtained. The project goal was to better understand data reuse in practice and to explore if research data from an initial publication was reused in subsequent publications.
keywords:
data reuse; data sharing; data management; data services; Scopus API
published:
2021-10-11
Peng, Jianhao; Ochoa, Idoia
(2021)
This dataset contains the ClonalKinetic dataset that was used in SimiC and its intermediate results for comparison. The Detail description can be found in the text file 'clonalKinetics_Example_data_description.txt' and 'ClonalKinetics_filtered.DF_data_description.txt'. The required input data for SimiC contains:
1. ClonalKinetics_filtered.clustAssign.txt => cluster assignment for each cell.
2. ClonalKinetics_filtered.DF.pickle => filtered scRNAseq matrix.
3. ClonalKinetics_filtered.TFs.pickle => list of driver genes.
The results after running SimiC contains:
1. ClonalKinetics_filtered_L10.01_L20.01_Ws.pickle => inferred GRNs for each cluster
2. ClonalKinetics_filtered_L10.01_L20.01_AUCs.pickle => regulon activity scores for each cell and each driver gene.
<b>NOTE:</b> “ClonalKinetics_filtered.rds” file which is mentioned in “ClonalKinetics_filtered.DF_data_description.txt” is an intermediate file and the authors have put all the processed in the pickle/txt file as described in the filtered data text.
keywords:
GRNs;SimiC;RDS;ClonalKinetic
published:
2021-08-12
Ferguson, John; Fernandes, Samuel; Monier, Brandon; Miller, Nathan; Allen, Dylan; Dmitrieva, Anna; Schmuker, Peter; Lozano, Roberto; Valluru, Ravi; Buckler, Edward; Gore, Michael; Brown, Patrick; Spalding, Edgar; Leakey, Andrew
(2021)
This dataset contains the images of a photoperiod sensitive sorghum accession population used for a GWAS/TWAS study of leaf traits related to water use efficiency in 2016 and 2017.
*<b>Note:</b> new in this second version is that JPG images outputted from the nms files were added
<b>Accessions_2016.zip</b> and <b>Accessions_2017.zip</b>: contain raw images produced by Optical Topometer (nms files) for all sorghum accessions. Images can be opened with Nanofocus μsurf analysis extended software (Oberhausen,Germany).
<b>Accessions_2016_jpg.zip</b> and <b>Accessions_2017_jpg.zip</b>: contain jpg images outputted from the nms files and used in the machine learning phenotyping.
keywords:
stomata; segmentation; water use efficiency
published:
2021-08-15
Felix, Hanau; Hannes, Rost; Ochoa, Idoia
(2021)
This data set contains mass spectrometry data used for the publication "mspack: efficient lossless and lossy mass spectrometry data compression".
keywords:
mass-spectrometry data; compression; proteomics
published:
2025-09-01
Chronic wasting disease (CWD) surveillance data from Illinois and Wisconsin, USA between the fiscal years 2003 and 2022 (calendar years 2002 and 2021). Data is reported at the township level as defined by the US Public Survey System. CWD cases, animals tested for CWD, and the apparent prevalence calculated from these values are given by township and fiscal year. Data has been anonymized by replacing original township names with identification numbers to maintain the privacy of landowners. Variables include Tests, Cases, and nonlinear transformations of Tests and Cases (inverse, square root, and log transformations).
keywords:
chronic wasting disease; cwd; white-tailed deer; deer; cervid; prion; apparent prevalence; prevalence; surveillance
published:
2025-12-15
Xiao, Tianxia; Khan, Artem; Shen, Yihui; Chen, Li; Rabinowitz, Joshua
(2025)
Ethanol and lactate are typical waste products of glucose fermentation. In mammals, glucose is catabolized by glycolysis into circulating lactate, which is broadly used throughout the body as a carbohydrate fuel. Individual cells can both uptake and excrete lactate, uncoupling glycolysis from glucose oxidation. Here we show that similar uncoupling occurs in budding yeast batch cultures of Saccharomyces cerevisiae and Issatchenkia orientalis. Even in fermenting S. cerevisiae that is net releasing ethanol, media 13C-ethanol rapidly enters and is oxidized to acetaldehyde and acetyl-CoA. This is evident in exogenous ethanol being a major source of both cytosolic and mitochondrial acetyl units. 2H-tracing reveals that ethanol is also a major source of both NADH and NADPH high-energy electrons, and this role is augmented under oxidative stress conditions. Thus, uncoupling of glycolysis from the oxidation of glucose-derived carbon via rapidly reversible reactions is a conserved feature of eukaryotic metabolism.
keywords:
Conversion;Metabolomics
published:
2025-10-01
Dai, Tao; Ellebracht, Nathan; Hunter Sellars, Elwin; Aui, Alvina; Hanna, Goldstein; Li, Wenqin; Hellwinckel, Chad; Price, Lydia; Wong, Andrew; Nico, Peter; Basso, Bruno; Robertson, G Philip; Pett-Ridge, Jennifer; Langholtz, Matthew; Baker, Sarah; Pang, Simon; Scown, Corinne
(2025)
Gigatonne-scale atmospheric carbon dioxide removal (CDR), alongside deep emission cuts, is critical to stabilizing the climate. However, some of the most scalable CDR technologies are also the most land intensive. Here, we examine whether adequate land resources exist in the contiguous United States to meet CDR targets when prioritizing grid emissions reduction, food production, and the protection of sensitive ecosystems. We focus on biomass carbon removal and storage (BiCRS) and direct air capture and storage (DACS) and show that suitable lands exceed the expected needs: 37.6 million hectares of land are available for BiCRS, resulting in 0.26 GtCO2 of CDR/year, and 34 million hectares are suitable for wind- and solar-powered DACS, resulting in 4.8 GtCO2 of CDR/year if facilities are co-located with geologic CO2 storage. We identify biomass and energy supply hotspots to meet CDR targets while ensuring land protection and minimizing land competition.
keywords:
carbon; geospatial
published:
2021-04-22
Torvik, Vetle; Smalheiser, Neil
(2021)
Author-ity 2018 dataset
Prepared by Vetle Torvik Apr. 22, 2021
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries:
####################
File 1: au2id2018.tsv
####################
Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields:
1. Author ID
2. PMID
3. Author name position
########################
File 2: authority2018.tsv
#########################
Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields:
1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance.
2. cluster size (number of author name instances on papers)
3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
4. last name variants separated by '|'
5. first name variants separated by '|'
6. middle initial variants separated by '|' ('-' if none)
7. suffix variants separated by '|' ('-' if none)
8. email addresses separated by '|' ('-' if none)
9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML
10. range of years (e.g., 1997-2009)
11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none)
13. Journal names with counts in parenthesis (separated by '|'),
14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
16. Author name instances (PMID_auno separated by '|')
17. Grant IDs (after normalization; '-' if none given; separated by '|'),
18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC).
19. h-index
20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
keywords:
author name disambiguation; PubMed
published:
2021-05-07
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018), and for ORCIDs, primarily, the 2019 ORCID Public Data File https://orcid.org/.
Matching an ORCID to an individual author name on a PMID is a non-trivial process. Anyone can create an ORCID and claim to have contributed to any published work. Many records claim too many articles and most claim too few. Even though ORCID records are (most?) often populated by author name searches in popular bibliographic databases, there is no confirmation that the person's name is listed on the article. This dataset is the product of mapping ORCIDs to individual author names on PMIDs, even when the ORCID name does not match any author name on the PMID, and when there are multiple (good) candidate author names. The algorithm avoids assigning the ORCID to an article when there are no good candidates and when there are multiple equally good matches. For some ORCIDs that clearly claim too much, it triggers a very strict matching procedure (for ORCIDs that claim too much but the majority appear correct, e.g., 0000-0002-2788-5457), and sometimes deletes ORCIDs altogether when all (or nearly all) of its claimed PMIDs appear incorrect. When an individual clearly has multiple ORCIDs it deletes the least complete of them (e.g., 0000-0002-1651-2428 vs 0000-0001-6258-4628). It should be noted that the ORCIDs that claim to much are not necessarily due nefarious or trolling intentions, even though a few appear so. Certainly many are are due to laziness, such as claiming everything with a particular last name. Some cases appear to be due to test engineers (e.g., 0000-0001-7243-8157; 0000-0002-1595-6203), or librarians assisting faculty (e.g., ; 0000-0003-3289-5681), or group/laboratory IDs (0000-0003-4234-1746), or having contributed to an article in capacities other than authorship such as an Investigator, an Editor, or part of a Collective (e.g., 0000-0003-2125-4256 as part of the FlyBase Consortium on PMID 22127867), or as a "Reply To" in which case the identity of the article and authors might be conflated. The NLM has, in the past, limited the total number of authors indexed too. The dataset certainly has errors but I have taken great care to fix some glaring ones (individuals who claim to much), while still capturing authors who have published under multiple names and not explicitly listed them in their ORCID profile. The final dataset provides a "matchscore" that could be used for further clean-up.
Four files:
person.tsv: 7,194,692 rows, including header
1. orcid
2. lastname
3. firstname
4. creditname
5. othernames
6. otherids
7. emails
employment.tsv: 2,884,981 rows, including header
1. orcid
2. putcode
3. role
4. start-date
5. end-date
6. id
7. source
8. dept
9. name
10. city
11. region
12 country
13. affiliation
education.tsv: 3,202,253 rows, including header
1. orcid
2. putcode
3. role
4. start-date
5. end-date
6. id
7. source
8. dept
9. name
10. city
11. region
12 country
13. affiliation
pubmed2orcid.tsv: 13,133,065 rows, including header
1. PMID
2. au_order (author name position on the article)
3. orcid
4. matchscore (see below)
5. source: orcid (2019 ORCID Public Data File https://orcid.org/), pubmed (NLMs distributed XML files), or patci (an earlier version of ORCID with citations processed through the Patci tool)
12,037,375 from orcid; 1,06,5892 from PubMed XML; 29,797 from Patci
matchscore:
000: lastname, firstname and middle init match (e.g., Eric T MacKenzie vs
00: lastname, firstname match (e.g., Keith Ward)
0: lastname, firstname reversed match (e.g., Conde Santiago vs Santiago Conde)
1: lastname, first and middle init match (e.g., L. F. Panchenko)
11: lastname and partial firstname match (e.g., Mike Boland vs Michael Boland or Mel Ziman vs Melanie Ziman)
12: lastname and first init match
15: 3 part lastname and firstname match (David Grahame Hardie vs D Grahame Hardie)
2: lastname match and multipart firstname initial match Maria Dolores Suarez Ortega vs M. D. Suarez
22: partial lastname match and firstname match (e.g., Erika Friedmann vs Erika Friedman)
23: e.g., Antonio Garcia Garcia vs A G Garcia
25: Allan Downie vs J A Downie
26: Oliver Racz vs Oliver Bacz
27: Rita Ostrovskaya vs R U Ostrovskaia
29: Andrew Staehelin vs L A Staehlin
3: M Tronko vs N D Tron'ko
4: Sharon Dent (Also known as Sharon Y.R. Dent; Sharon Y Roth; Sharon Yoder) vs Sharon Yoder
45: Okulov Aleksei vs A B Okulov
48: Maria Del Rosario Garcia De Vicuna Pinedo vs R Garcia-Vicuna
49: Anatoliy Ivashchenko vs A Ivashenko
5 = lastname match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Bill Hieb vs W F Hieb
6 = first name match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Maria Borawska vs Maria Koscielak
7 = last or first name match on "other names"; e.g., Hromokovska Tetiana (Also known as Gromokovskaia, T. S., Громоковська Тетяна) vs T Gromokovskaia
77: Siva Subramanian vs Kolinjavadi N. Sivasubramanian
88 = no name in orcid but match caught by uniqueness of name across paper (at least 90% and 2 more than next most common name)
prefix:
C = ambiguity reduced (possibly eliminated) using city match (e.g., H Yang on PMID 24972200)
I = ambiguity eliminated by excluding investigators (ie.., one author and one or more investigators with that name)
T = ambiguity eliminated using PubMed pos (T for tie-breaker)
W = ambiguity resolved by authority2018