Illinois Data Bank Dataset Search Results
Results
published:
2025-08-21
Viral vectors provide an increasingly versatile platform for transformation-free reagent delivery to plants. RNA viral vectors can be used to induce gene silencing, overexpress proteins, or introduce gene editing reagents; however, they are often constrained by carrying capacity or restricted tropism in germline cells. Site-specific recombinases that catalyze precise genetic rearrangements are powerful tools for genome engineering that vary in size and, potentially, efficacy in plants. In this work, we show that viral vectors based on tobacco rattle virus (TRV) deliver and stably express four recombinases ranging in size from ∼0.6 to ∼1.5 kb and achieve simultaneous marker removal and reporter activation through targeted excision in transgenic Nicotiana benthamiana lines. TRV vectors with Cre, FLP, CinH, and Integrase13 efficiently mediated recombination in infected somatic tissue and led to heritable modifications at high frequency. An excision-activated Ruby reporter enabled simple and high-resolution tracing of infected cell lineages without the need for molecular genotyping. Together, our experiments broaden the scope of viral recombinase delivery and offer insights into infection dynamics that may be useful in developing future viral vectors.
keywords:
gene editing; genome engineering; plant transformation
published:
2021-05-13
Chen, Bowen; Gramig, Benjamin; Yun, Seong
(2021)
Data files and R code to replicate the econometric analysis in the journal article: B Chen, BM Gramig and SD Yun. “Conservation Tillage Mitigates Drought Induced Soybean Yield Losses in the US Corn Belt.” Q Open. https://doi.org/10.1093/qopen/qoab007
keywords:
R, Conservation Tillage, Drought, Yield, Corn, Soybeans, Resilience, Climate Change
published:
2024-03-06
OKeefe, Joy; Bennett, Andrew
(2024)
These data are the result of analyses of the metagenome of North American bats, including 18s and 16s barcode genes designed to target microorganisms of the gut. These files are Phyloseq import files created by the DADA2 program. Each barcode gene is uploaded separately as the four files required to build a phyloseq object. For each barcode gene, the files include amplicon sequence variant (ASV) sequences, sequence tables (seqtab) which connect individual samples to the ASVs, tax tables (taxtab) which identify the taxa present as determined by a Bayesian RDP classifier, and rooted phylogenetic trees for the ASVs. Additionally, we have included a "sample_data" file which is necessary for sorting of samples across all four sequence analysis data sets by study and species. Some sample information which could identify the location of endangered species has been restricted. Multiple studies are represented in the data which can be accessed using standard methods in the Phyloseq program (e.g. For a study of bats, parasites, and gut microbiome dysregulation by Bennett, Suski, and OKeefe 2024 [in prep March 2024], study specific data can be accessed using the Study variable "DYSBIOMICS." File names include reference to the primer set used to generate them (18s primer sets: G3, G4, G6; 16s primer set: 341F3_806R5).
keywords:
metagenomics
published:
2025-07-23
Dalling, James William
(2025)
Supplementary data and code associated with the Biogeosciences paper published by Cecilia Prada et al. "Soil and Biomass Carbon Storage is Much Higher in Central American than Andean Montane Forests". There are 16 files associated with this paper
(1) AGB.csv providing the site, plot, treeID, mnemn, family, agb, and AGcarbon for each tree in the dataset. Column headings are described in the file AGB_metadata.csv
(2) AGB_metadata.csv Metadata (column descriptions) for AGB.csv
(3) CWD_D.csv Complete information on the downed coarse woody debris (CWD) measured in each plot
(4) CWD_D_metadata.csv Metadata (column descriptions) for CWD_D.csv
(5) CWD_S.csv Complete information on the standing coarse woody debris measured in each plot
(6) CWD_S_metadata.csv Metadata (column descriptions) for CWD_S.csv
(7) SoilC.csv Estimated soil carbon storage (Mg C) at each sampling location in each plot
(8) SoilC_metadata.csv Metadata (column descriptions) for SoilC.csv
(9) Table.csv Data source, soil carbon value (Mg C) and elevation from published data sources
(10) Table_metadata.csv Metadata (column descriptions) for Table.csv
(11) TableS1.csv Data source, above ground carbon value (Mg C) and elevation from published data sources
(12) TableS1_metadata.csv Metadata (column descriptions) for TableS1.csv
(13) RScript.R Annotated code for data analysis and figures
(14)Full_dataset.csv Full set of environmental data and carbon data by plot
(15) Full_dataset_metadata.csv Metadata (column descriptions) for Full_dataset.csv
(16) Species list and species codes.csv Full family, genus and species names for the species codes (column mnemn in AGB.csv)
keywords:
tropical forest; carbon storage
published:
2020-05-13
Althaus, Scott; Bajjalieh, Joseph; Jungblut, Marc; Shalmon, Dan; Ghosh, Subhankar; Joshi, Pradnyesh
(2020)
Terrorism is among the most pressing challenges to democratic governance around the world. The Responsible Terrorism Coverage (or ResTeCo) project aims to address a fundamental dilemma facing 21st century societies: how to give citizens the information they need without giving terrorists the kind of attention they want. The ResTeCo hopes to inform best practices by using extreme-scale text analytic methods to extract information from more than 70 years of terrorism-related media coverage from around the world and across 5 languages. Our goal is to expand the available data on media responses to terrorism and enable the development of empirically-validated models for socially responsible, effective news organizations.
This particular dataset contains information extracted from terrorism-related stories in the New York Times published between 1945 and 2018. It includes variables that measure the relative share of terrorism-related topics, the valence and intensity of emotional language, as well as the people, places, and organizations mentioned.
This dataset contains 3 files:
1. <i>"ResTeCo Project NYT Dataset Variable Descriptions.pdf"</i>
<ul> <li>A detailed codebook containing a summary of the Responsible Terrorism Coverage (ResTeCo) Project New York Times (NYT) Dataset and descriptions of all variables. </li>
</ul>
2. <i>"resteco-nyt.csv"</i>
<ul><li>This file contains the data extracted from terrorism-related media coverage in the New York Times between 1945 and 2018. It includes variables that measure the relative share of topics, sentiment, and emotion present in this coverage. There are also variables that contain metadata and list the people, places, and organizations mentioned in these articles. There are 53 variables and 438,373 observations. The variable "id" uniquely identifies each observation. Each observation represents a single news article. </li>
<li> <b>Please note</b> that care should be taken when using "resteco-nyt.csv". The file may not be suitable to use in a spreadsheet program like Excel as some of the values get to be quite large. Excel cannot handle some of these large values, which may cause the data to appear corrupted within the software. It is encouraged that a user of this data use a statistical package such as Stata, R, or Python to ensure the structure and quality of the data remains preserved.</li>
</ul>
3. <i>"README.md"</i>
<ul><li>This file contains useful information for the user about the dataset. It is a text file written in mark down language</li>
</ul>
<b>Citation Guidelines</b>
1) To cite this codebook please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project New York Times (NYT) Dataset Variable Descriptions. Responsible Terrorism Coverage (ResTeCo) Project New York Times Dataset. Cline Center for Advanced Social Research. May 13. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-4638196_V1
2) To cite the data please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project New York Times Dataset. Cline Center for Advanced Social Research. May 13. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-4638196_V1
keywords:
Terrorism, Text Analytics, News Coverage, Topic Modeling, Sentiment Analysis
published:
2025-11-24
Li, Maolin; Harrison, Wesley; Zhang, Zhengyi; Yuan, Yujie; Zhao, Huimin
(2025)
Strategies for achieving asymmetric catalysis with azaarenes have traditionally fallen short of accomplishing remote stereocontrol, which would greatly enhance accessibility to distinct azaarenes with remote chiral centres. The primary obstacle to achieving superior enantioselectivity for remote stereocontrol has been the inherent rigidity of the azaarene ring structure. Here we introduce an ene-reductase system capable of modulating the enantioselectivity of remote carbon-centred radicals on azaarenes through a mechanism of chiral hydrogen atom transfer. This photoenzymatic process effectively directs prochiral radical centres located more than six chemical bonds, or over 6 Å, from the nitrogen atom in azaarenes, thereby enabling the production of a broad array of azaarenes possessing a remote γ-stereocentre. Results from our integrated computational and experimental investigations underscore that the hydrogen bonding and steric effects of key amino acid residues are important for achieving such high stereoselectivities.
keywords:
Conversion;Catalysis
published:
2017-05-31
Merrill, Loren; Naylor, Madeleine; Dalimonte, Merria; McLaughlin, Shaun; Stewart, Tara; Grindstaff, Jennifer
(2017)
Dataset includes maternal antigen treatment and early-life antigen treatment for male zebra finches. Also includes data on beak coloration, measures of song complexity for each male, and female responses to treated males.
Male beak color and song metadata:
* MATID= Maternal Identity
* MATTRT=Maternal antigen treatment prior to egg laying (KLH=keyhole limpet hemocyanin, LPS= lipopolysaccharide, PBS=phosphate buffered saline)
* YGTRT= Young antigen treatment post-hatch (KLH=keyhole limpet hemocyanin, LPS= lipopolysaccharide, PBS=phosphate buffered saline))
* NESTBANDNUM= Nestling band number
* Haptoglobin=haptoglobin levels at day 28 (mg/ml)
* Mean TE= Mean number of total elements in that male's song
* TE (z)= Z-transformed total elements
* Mean UE=Mean number of unique elements in the song
* UE (z)= z-transformed unique elements
* mean phrases= Mean number of song phrases
* Phrases (z)= z-transformed song phrases
* Mean D= Mean song duration in seconds
* D (z)=z-transformed song duration
* B2 standard=beak brightness standardized so that lower values reflect less bright beaks
* B2 (z)=z-transformed brightness
* S1R standard= beak saturation at high wavelengths standardized so that lower values reflect less red beaks
* S1R (z)=z-transformed S1R
* S1U standard= beak saturation at low wavelengths standardized so that lower values reflect less red beaks
* S1U (z)=z-transformed S1U
* H4B standard= beak hue standardized so that lower values reflect less red beaks
* H4B (z)=z-transformed H4B
Female choice metadata:
* Control Bird=PBS denotes that all control males received phosphate buffered saline
* Treatment Bird= Treatment the male received (keyhole limpet hemocyanin (KLH) or lipopolysaccharide (LPS))
* Beak Wipes Control=# of beak wipes the female performed when on the control male side
* Beak Wipes Treatment=# of beak wipes the female performed when on the "treatment male" side
* Hops Control=# of hops female performed when on the control male side
* Hops Treatment=# of hops female performed when on the treatment male side
* Time Spent Near Control=amount of time (sec) female spent on the control male side
* Time Spent Near Treatment=amount of time (sec) the female spent on the treatment male side
keywords:
early-life; stress; immune response; phenotypic correlation; sexual signal; zebra finch;birdsongs; acoustic signals; beak coloration; mate selection
published:
2020-08-19
Jetti, Yaswanth Sai; Dunn, Alison C.
(2020)
This data set is a matrix of values. The element in the row "i" and the column "j" denotes the influence of hexagonal pyramidal distribution at node "i" on the node "j". The size of the matrix is 16641x16641. This matrix corresponds to a 129x129 grid. Influence coefficient matrix on a smaller grid can be obtained by appropriately choosing the elements from the bigger matrix.
keywords:
Influence coefficients
published:
2025-05-28
This dataset captures ‘Hype’ and 'Diversity', including article-level (pmid) and author-level (auid) data within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1991 and 2014, totaling 421,580 (merged_df).
The classification of hype relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. Diversity is classified for ethnicity, gender, academic age, and topical expertise for authors based on the Rao-Sterling Diversity index.
File1: merged_auids.csv (Important columns defined)
• AUID: a unique ID for each author
• Genni: gender prediction
• Ethnea: ethnicity prediction
#################################################
File2: merged_df.csv (Important columns defined)
- pmid: unique paper
- auid: all unique auids (author-name unique identification)
- year: Year of paper publication
- no_authors: Author count
- journal: Journal name
- years: first year of publication for every author
- Country-temporal: Country of affiliation for every author
- h_index: Journal h-index
- TimeNovelty: Paper Time novelty
- nih_funded: Binary variable indicating funding for any author
- prior_cites_mean: Mean of all authors’ prior citation rate
- insti_impact: All unique institutions’ citation rate
- mesh_vals: Top MeSH values for every author of that paper
- hype_word: Candidate hype word, such as ‘novel'
- hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location
- hype_percentile: Abstract relative position of hype word
- relative_citation_ratio: RCR
keywords:
Hype; Diversity: PubMed; Abstracts; Scientometrics; Biomedicine
published:
2020-10-01
Acevedo-Siaca, Liana; Long, Stephen
(2020)
Raw gas exchange data for photosynthetic induction in 6 rice accession flag leaves. Photosynthetic induction and point measurements were made at ambient [CO2]. Two accessions (AUS 278 and IR64) were selected to screen in greater detail in which photosynthetic induction was measured at six [CO2].
published:
2021-05-01
Cheng, Ti-Chung; Li, Tiffany Wenting; Karahalios, Karrie; Sundaram, Hari
(2021)
This is the first version of the dataset.
This dataset contains anonymize data collected during the experiments mentioned in the publication: “I can show what I really like.”: Eliciting Preferences via Quadratic Voting that would appear in April 2021.
Once the publication link is public, we would provide an update here.
These data were collected through our open-source online systems that are available at (experiment1)[https://github.com/a2975667/QV-app] and (experiment 2)[https://github.com/a2975667/QV-buyback]
There are two folders in this dataset. The first folder (exp1_data) contains data collected during experiment 1; the second folder (exp2_data) contains data collected during experiment 2.
keywords:
Quadratic Voting; Likert scale; Empirical studies; Collective decision-making
published:
2025-10-01
Schetter, August; Lin, Cheng-Hsien; Zumpf, Colleen; Jang, Chunhwa; Hoffmann Jr., Leo; Rooney, William; Lee, DoKyoung
(2025)
Recently introduced photoperiod-sensitive (PS) biomass sorghum (Sorghum bicolor L. Moench) needs to be investigated for yield potential under different cultivation environments with reasonable nitrogen (N) inputs. The objectives of this study were to (1) evaluate the biomass yield and feedstock quality of four sorghum hybrids with different levels of PS ranging from very PS (VPS) hybrids and to moderate PS (MPS) hybrids, and (2) determine the optimal N inputs (0~168 kg N ha−1) under four environments: combinations of both temperate (Urbana, IL) and subtropical (College Station, TX) regions during 2018 and 2019. Compared to TX, the PS sorghums in central IL showed higher yield potential and steady feedstock production with an extended day length and with less precipitation variability, especially for the VPS hybrids. The mean dry matter (DM) yields of VPS hybrids were 20.5 Mg DM ha−1 and 17.7 Mg DM ha−1 in IL and TX, respectively. The highest N use efficiency occurred at a low N rate of 56 kg N ha−1 by improving approximately 33 kg DM ha−1 per 1.0 kg N ha−1 input. Approximately 70% of the PS sorghum biomass can be utilized for biofuel production, consisting of 58-65% of the cell-wall components and 4-11% of the soluble sugar. This study demonstrated that the rainfed temperate area (e.g., IL) has a great potential for the sustainable cultivation of PS energy sorghum due to their observed high yield potential, stable production, and low N requirements.
keywords:
Sustainability;Biomass Analytics;Field Data
published:
2019-09-17
Mishra, Shubhanshu
(2019)
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets.
Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality.
Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging.
Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py</a>
See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details.
If you are using this data, please also cite the related article:
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; classification; sequence tagging
published:
2020-07-15
Legried, Brandon; Molloy, Erin K.; Warnow, Tandy; Roch, Sebastien
(2020)
This repository includes scripts and datasets for the paper, "Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss."
keywords:
Species tree estimation; gene duplication and loss; identifiability; statistical consistency; quartets; ASTRAL
published:
2023-07-01
Tonks, Adam; Hwang, Jeongwoo
(2023)
This is the data used in the paper "Assessment of spatiotemporal flood risk due to compound precipitation extremes across the contiguous United States".
Code from the Github repository https://github.com/adtonks/precip_extremes can be used with the data here to reproduce the paper's results. v1.0.0 of the code is also archived at https://doi.org/10.5281/zenodo.8104252
This dataset is derived from NOAA-CIRES-DOE 20th Century Reanalysis V3. The NOAA-CIRES-DOE Twentieth Century Reanalysis Project version 3 used resources of the National Energy Research Scientific Computing Center managed by Lawrence Berkeley National Laboratory which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 and used resources of NOAA's Remotely Deployed High Performance Computing Systems.
keywords:
spatiotemporal; CONUS; United States; precipitation; extremes; flooding
published:
2025-04-17
Mollenhauer, Michael; Pfaff, Wolfgang
(2025)
This dataset includes analysis code used to analyze the data involved with swapping photons between superconducting qubits in separate modules though a superconducting coaxial cable bus. The dataset includes Python code to model and plot the data, CAD designs of the modules that hold the superconducting qubits, high frequency simulation software files to model the electric fields of the superconducting circuits
keywords:
superconducting qubits; qunatum information; modular architecture
published:
2025-05-27
Rani, Sonia; Cao, Xi; Baptista, Alejandro E.; Hoffmann, Axel; Pfaff, Wolfgang
(2025)
This dataset contains all raw and processed data used to generate the figures in the main text and supplementary material of the paper "High dynamic-range quantum sensing of magnons and their dynamics using a superconducting qubit." The data can be used to reproduce the plots and validate the analysis. Accompanying Jupyter notebooks provide step-by-step analysis pipelines for figure generation. The dataset also includes drawings for the mechanical samples used to perform the experiment. In addition, the dataset provides ANSYS HFSS electromagnetic simulation files used to design and analyze the resonator structures and estimate field distributions.
keywords:
superconducting qubit; magnon sensing; hybrid quantum systems; spin-photon coupling; magnon decay; cavity QED
published:
2019-10-15
Choi, Sang Hyun; Rao, Vikyath; Gernat, Tim; Hamilton, Adam; Robinson, Gene; Goldenfeld, Nigel
(2019)
Filtered trophallaxis interactions for two honeybee colonies, each containing 800 worker bees and one queen. Each colony consists of bees that were administered a juvenile hormone analogy, a vehicle treatment, or a sham treatment to determine the effect of colony perturbation on the duration of trophallaxis interactions. Columns one and two display the unique identifiers for each bee involved in a particular trophallaxis exchange, and columns three and four display the Unix timestamp of the beginning/end of the interaction (in milliseconds), respectively.<br /><b>Note</b>: the queen interactions were omitted from the uploaded dataset for reasons that are described in submitted manuscript. Those bees that performed poorly are also omitted from the final dataset.
keywords:
honey bee; trophallaxis; social network
published:
2020-03-14
Rhoads, Bruce ; Lindroth, Evan
(2020)
Data on bank elevations determined from lidar data for the Upper Sangamon River, Illinois, the Mission River, Texas, and the White River in Indiana
keywords:
bank elevations, rivers, meandering, lowland
published:
2020-09-25
This repository contains the datasets and corresponding results for the paper "MAGUS: Multiple Sequence Alignment using Graph Clustering".
The Datasets.zip archive contains the ROSE, balibase, Gutell, and RNASim datasets used in our experiments.
The Results.zip archive contains the outputs of running our methods against these datasets.
Datasets used:
ROSE: 10 simulated nucleotide model conditions from the SATe paper, each with 20 replicates, and with 1000 sequences per replicate.
The ROSE datasets were originally taken from <a href="https://sites.google.com/eng.ucsd.edu/datasets/alignment/sate-i">https://sites.google.com/eng.ucsd.edu/datasets/alignment/sate-i</a>
RNASim: This is a collection of simulated nucleotide datasets that were generated under a model of evolution that reflects selection due to RNA structural constraints. We sampled 20 subsets of 1000 sequences each, as well as 10 subsets of 10000 each, by randomly sampling from the original million-sequence RNASim dataset.
Gutell: 16S.M, 16S.3, 16S.T, 16S.B.ALL: Four biological nucleotide datasets from the Comparative Ribosomal Website (CRW) with cleaned reference alignments from SATe. Since PASTA is restricted to datasets without sequence length heterogeneity, these were modified to remove sequences that deviate by more than 20% from the median length. The scrubbed datasets range from 740 to 24,246 sequences. The pre-screened 16S datasets were taken from <a href="https://sites.google.com/eng.ucsd.edu/datasets/alignment/16s23s">https://sites.google.com/eng.ucsd.edu/datasets/alignment/16s23s</a>
BAliBASE: We use eight BAliBASE amino acid datasets used in the PASTA paper. As above, we remove outlier sequences, which leaves us with sizes ranging from 195 to 732 sequences. The pre-screened Balibase datasets were taken from <a href="https://sites.google.com/eng.ucsd.edu/datasets/alignment/pastaupp">https://sites.google.com/eng.ucsd.edu/datasets/alignment/pastaupp</a>
published:
2024-04-05
Sinaiko, Guy; Cao, Yanghui; Dietrich, Christopher H.
(2024)
The following files include specimen information, DNA sequence data, and additional information on the analyses used to reconstruct the phylogeny of the leafhopper genus Neoaliturus as described in the Methods section of the original paper:
1. Taxon_sampling.csv: contains data on the individual specimens from which DNA was extracted, including sample code, taxon name, collection data (locality, date and name of collector) and museum unique identifier.
2. Alignments.zip: a ZIP archive containing 432 separate FASTA files representing the aligned nucleotide sequences of individual gene loci used in the analysis.
3. Concatenated_Matrix.fa: is a FASTA file containing the concatenated individual gene alignments used for the maximum likelihood analysis in IQ-TREE.
4. Genes_and_Loci.rtf: identifies the individual genes and loci used in the analysis. The partition name is the same as the name of the individual alignment file in the zipped Alignments folder.
5. Partitions_best_scheme.nex: is a text file in the standard NEXUS format that indicates the names of the individual data partitions and their locations in the concatenated matrix, and also indicates the substitution model for each partition.
6. (New in this version 2) Scripts & Description.zip includes 8 custom shell or perl scripts used to assemble the DNA sequence data by perform reciprocal blast searches between the reference sequences and assemblies for each sample, extract the best sequences based on the blast searches, screen the hits for each locus and keep only the best result, and generate the nucleotide sequence dataset for the predicted orthologues (see the file description.txt for details).
7. (New in this version 2) Full_genetic_distances_matrix.csv shows the genetic distances between pairs of samples in the datset (proportion of nucleotides that differ between samples).
keywords:
leafhopper; phylogeny; anchored-hybrid-enrichment; DNA sequence; insect
published:
2025-03-14
Mishra, Apratim; Diesner, Jana; Torvik, Vetle I.
(2025)
Hype - PubMed dataset
Prepared by Apratim Mishra
This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.
The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.
This is version 3 of the dataset. Added new file - WSD_hype.tsv
File 1: hype_dataset_final.tsv
Primary dataset. It has the following columns:
1. PMID: represents unique article ID in PubMed
2. Year: Year of publication
3. Hype_word: Candidate hype word, such as ‘novel.’
4. Sentence: Sentence in abstract containing the hype word.
5. Hype_percentile: Abstract relative position of hype word.
6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.
7. Introduction: The ‘I’ component of the hype word based on IMRaD
8. Methods: The ‘M’ component of the hype word based on IMRaD
9. Results: The ‘R’ component of the hype word based on IMRaD
10. Discussion: The ‘D’ component of the hype word based on IMRaD
File 2: hype_removed_phrases_final.tsv
Secondary dataset with same columns as File 1.
Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:
1. Major: histocompatibility, component, protein, metabolite, complex, surgery
2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid
3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment
4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values
5. Essential: medium, features, properties, opportunities, oil
6. Unique: model, amino
7. Robust: regression
8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information
9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains
10. Remarkable: properties
11. Definite: radiotherapy, surgery
File 3: WSD_hype.tsv
Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)
keywords:
Hype; PubMed; Abstracts; Biomedicine
published:
2025-12-14
Fraterrigo, Jennifer; Chen, Weile
(2025)
This dataset contains information about absorptive roots from 170 plots along a latitudinal and temperature gradient in northern Alaska, including tussock sedges and deciduous alder, birch, and willow shrubs. This dataset accompanies the paper "Impacts of Arctic Shrubs on Root Traits and Belowground Nutrient Cycles Across a Northern Alaskan Climate Gradient," which was published in Frontiers in Plant Sciences.
<b>*Note:</b> in the "patch coordinates" tab, the same coordinates/elevation ("Long", "Lat", and "Elev (m)") apply to all patches that share a number. For ex: "Patch" W1, B1, and G1 share the same "Long", "Lat", and "Elev (m)" values as "Patch" A1.
keywords:
absorptive root traits; shrub expansion; Arctic; Alaskan tundra
published:
2020-04-20
Supplemental data sets for the Manuscript entitled "Contribution of fungal and invertebrate communities to mass loss and wood depolymerization in tropical terrestrial and aquatic habitats"
keywords:
Coiba Island; wood decomposition; cellulose; hemicellulose; lignin breakdown; aquatic fungi
published:
2020-01-31
Bradshaw, Therin M.; Blake-Bradshaw, Abigail G.; Fournier, Auriel M.V.; Lancaster, Joseph D. ; O'Connell, John; Jacques, Christopher N.; Eicholtz, Michael W.; Hagy, Heath M
(2020)
Data inputs, and scripts for the analysis detailed in Bradshaw et al, published in PlosONE 2020.
keywords:
Marsh birds; wetlands