Illinois Data Bank Dataset Search Results
Results
published:
2018-04-23
Mishra, Shubhanshu; Fegley, Brent D; Diesner, Jana; Torvik, Vetle I.
(2018)
Self-citation analysis data based on PubMed Central subset (2002-2005)
----------------------------------------------------------------------
Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.
It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015.
The dataset is distributed in the form of the following tab separated text files:
* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors
* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors
* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors
* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data
* COLUMNS_DESC.txt file - Descriptions of all columns
* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection.
* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments.
* README.txt file
## Dataset creation
Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href="https://clarivate.com/products/web-of-science/databases/">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset.
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* Citation data from PubMed Central (original paper includes additional citations from Web of Science)
* Author-ity 2009 dataset:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4222651_V1">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/1552303.1552304">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>
- Paper citation: <a href="https://doi.org/10.1002/asi.20105">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>
* Genni 2.0 + Ethnea for identifying author gender and ethnicity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-9087546_V1">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/2467696.2467720">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>
- Paper citation: <a href="http://hdl.handle.net/2142/88927">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>
* MapAffil for identifying article country of affiliation:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4354331_V1">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>
- Paper citation: <a href="http://doi.org/10.1045/november2015-torvik">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>
* IMPLICIT journal similarity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4742014_V1">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>
* Novelty dataset for identify article level novelty:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-5060298_V1">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>
- Paper citation: <a href="https://doi.org/10.1045/september2016-mishra"> Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>
- Code: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
* Expertise dataset for identifying author expertise on articles:
* Source code provided at: <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>
**Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.**
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions
Additional data related updates can be found at <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.
keywords:
Self citation; PubMed Central; Data Analysis; Citation Data;
published:
2018-04-19
Torvik, Vetle I.; Smalheiser, Neil R.
(2018)
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03
The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in
<i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i>
<i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i>
Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.
• How accurate is the 2009 dataset (compared to 2006 and 2009)?
The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.
• What is the format of the dataset?
The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:
1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|'
2. prior probabilities of the respective blocks separated by '|'
3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)
4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).
5. cluster size (number of author name instances on papers)
6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
7. last name variants separated by '|'
8. first name variants separated by '|'
9. middle initial variants separated by '|' ('-' if none)
10. suffix variants separated by '|' ('-' if none)
11. email addresses separated by '|' ('-' if none)
12. range of years (e.g., 1997-2009)
13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none)
15. Journals with counts in parenthesis (separated by "|"),
16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none)
19. Author name instances (PMID_auno separated '|')
20. Grant IDs (after normalization; "-" if none given; separated by "|"),
21. Total number of times cited. (Citations are based on references extracted from PMC).
22. h-index
23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|"
24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|"
25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|"
26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords:
Bibliographic databases; Name disambiguation; MEDLINE; Library information networks
published:
2017-06-16
Haselhorst, Derek S; Tcheng, David K. ; Moreno, J. Enrique ; Punyasena, Surangi W.
(2017)
Table S1. Pollen types identified in the BCI and PNSL pollen rain data sets. Pollen types were identified to species when possible and assigned a life form based on descriptions provided in Croat, T.B. (1978). Taxa from BCI and PNSL were assigned a 1 if present in forest census data or a 0 if absent. The relative representation of each taxon has been provided for each extended record and by dry and wet season representation respectively. CA loadings are provided for axes 1 and 2 (Fig. 1).
keywords:
pollen; identifications; abundance; data; BCI; PNSL; Panama
published:
2018-04-23
Mishra, Shubhanshu; Torvik, Vetle I.
(2018)
Conceptual novelty analysis data based on PubMed Medical Subject Headings
----------------------------------------------------------------------
Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra.
It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015.
The dataset is distributed in the form of the following tab separated text files:
* PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow:
- PMID: PubMed ID
- Year: year of publication
- TimeNovelty: time novelty score of the paper based on individual concepts (see paper)
- VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper)
- PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper)
- PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper)
* mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow:
- MeshTerm: Name of the MeSH term
- Year: year
- AbsVal: Total publications with that MeSH term in the given year
- TimeNovelty: age (in years since first publication) of MeSH term in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year
* meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years
- Mesh1: Name of the first MeSH term (alphabetically sorted)
- Mesh2: Name of the second MeSH term (alphabetically sorted)
- Year: year
- AbsVal: Total publications with that MeSH pair in the given year
- TimeNovelty: age (in years since first publication) of MeSH pair in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year
* README.txt file
## Dataset creation
This dataset was constructed using multiple datasets described in the following locations:
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* MeSH tree 2015: <a href="ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/">ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/</a>
* Source code provided at: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here </a>for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions:
Additional data related updates can be found at: <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347 </a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742 </a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
keywords:
Conceptual novelty; bibliometrics; PubMed; MEDLINE; MeSH; Medical Subject Headings; Analysis;
published:
2022-01-01
Cao, Yanghui; Dietrich, Christopher H.
(2022)
The file “Fla.fasta”, comprising 10526 positions, is the concatenated amino acid alignments of 51 orthologues of 182 bacterial strains. It was used for the maximum likelihood and maximum parsimony analyses of Flavobacteriales. Bacterial species names and strains were used as the sequence names, host names of insect endosymbionts were shown in brackets. The file “16S.fasta” is the alignment of 233 bacterial 16S rRNA sequences. It contains 1455 positions and was used for the maximum likelihood analysis of flavobacterial insect endosymbionts. The names of endosymbiont strains were replaced by the name of their hosts. In addition to the species names, National Center for Biotechnology Information (NCBI) accession numbers were also indicated in the sequence names (e.g., sequence “Cicadellidae_Deltocephalinae_Macrostelini_Macrosteles_striifrons_AB795320” is the 16S rRNA of Macrosteles striifrons (Cicadellidae: Deltocephalinae: Macrostelini) with a NCBI accession number AB795320). The file “Sulcia_pep.fasta” is the concatenated amino acid alignments of 131 orthologues of “Candidatus Sulcia muelleri” (Sulcia). It contains 41970 positions and presents 101 Sulcia strains and 3 Blattabacterium strains. This file was used for the maximum likelihood analysis of Sulcia. The file “Sulcia_nucleotide.fasta” is the concatenated nucleotide alignment corresponding to the sequences in “Sulcia_pep.fasta” but also comprises the alignment of 16S rRNA. It has 127339 positions and was used for the maximum likelihood and maximum parsimony analyses of Sulcia. Individual gene alignments (16S rRNA and 131 orthologues of Sulcia and Blattabacterium) are deposited in the compressed file “individual_gene_alignments.zip”, which were used to construct gene trees for multispecies coalescent analysis. The names of Sulcia strains were replaced by the name of their hosts in “Sulcia_pep.fasta”, “Sulcia_nucleotide.fasta” and the files in “individual_gene_alignments.zip”. In all the alignment files, gaps are indicated by “-”.
keywords:
endosymbiont, “Candidatus Sulcia muelleri”, Auchenorrhyncha, coevolution
published:
2024-08-24
Jones, Todd; Llamas, Alfredo; Phillips, Jennifer
(2024)
Dataset associated with Jones et al. GCB-23-1273.R1 submission: Phenotypic signatures of urbanization? Resident, but not migratory, songbird eye size varies with urban-associated light pollution levels. Excel CSV file with all of the data used in analyses and file with descriptions of each column.
keywords:
body size; demographics; eye size; phenotypic divergence; songbirds; sensory pollution; urbanization
published:
2023-12-18
Edmonds, Devin; Adamovicz, Laura; Allender, Matthew; Colton, Andrea; Randy, Nyboer; Michael, Dreslik
(2023)
We conducted long-term capture-mark-recapture surveys on two isolated ornate box turtle (Terrapene ornata) populations in northern Illinois, USA. This dataset provides the capture history strings and additional demographic information used for estimating population vital rates with robust design capture-mark-recapture models. The vital rates were then used in a stage-based population projection matrix model for each population.
keywords:
demography; capture-mark-recapture; vital rates; conservation; wildlife ecology
published:
2011-09-20
Swenson, M. Shel; Suri, Rahul; Linder, C. Randal; Warnow, Tandy; Nguyen, Nam-puhong; Mirarab, Siavash; Neves, Diogo Telmo; Sobral, João Luís; Pingali, Keshav; Nelesen, Serita; Liu, Kevin; Wang, Li-San
(2011)
This page provides the data for SuperFine, DACTAL, and BeeTLe publications.
- Swenson, M. Shel, et al. "SuperFine: fast and accurate supertree estimation." Systematic biology 61.2 (2012): 214.
- Nguyen, Nam, Siavash Mirarab, and Tandy Warnow. "MRL and SuperFine+ MRL: new supertree methods." Algorithms for Molecular Biology 7 (2012): 1-13.
- Neves, Diogo Telmo, et al. "Parallelizing superfine." Proceedings of the 27th Annual ACM Symposium on Applied Computing. 2012.
- Nelesen, Serita, et al. "DACTAL: divide-and-conquer trees (almost) without alignments." Bioinformatics 28.12 (2012): i274-i282.
- Liu, Kevin, and Tandy Warnow. "Treelength optimization for phylogeny estimation." PLoS One 7.3 (2012): e33104.
published:
2017-12-14
Hepler, Katherine C.
(2017)
keywords:
uranium harvesting from seawater; Geospatial analysis; adsorbent performance; NPRE 412
published:
2017-11-14
Miller, Martin; Chung, Soon-Jo; Hutchinson, Seth
(2017)
If you use this dataset, please cite the IJRR data paper (bibtex is below).
We present a dataset collected from a canoe along the Sangamon River in Illinois. The canoe was equipped with a stereo camera, an IMU, and a GPS device, which provide visual data suitable for stereo or monocular applications, inertial measurements, and position data for ground truth. We recorded a canoe trip up and down the river for 44 minutes covering 2.7 km round trip. The dataset adds to those previously recorded in unstructured environments and is unique in that it is recorded on a river, which provides its own set of challenges and constraints that are described
in this paper. The data is divided into subsets, which can be downloaded individually.
Video previews are available on Youtube:
https://www.youtube.com/channel/UCOU9e7xxqmL_s4QX6jsGZSw
The information below can also be found in the README files provided in the 527 dataset and each of its subsets. The purpose of this document is to assist researchers in using this dataset.
Images
======
Raw
---
The raw images are stored in the cam0 and cam1 directories in bmp format. They are bayered images that need to be debayered and undistorted before they are used. The camera parameters for these images can be found in camchain-imucam.yaml. Note that the camera intrinsics describe a 1600x1200 resolution image, so the focal length and center pixel coordinates must be scaled by 0.5 before they are used. The distortion coefficients remain the same even for the scaled images. The camera to imu tranformation matrix is also in this file. cam0/ refers to the left camera, and cam1/ refers to the right camera.
Rectified
---------
Stereo rectified, undistorted, row-aligned, debayered images are stored in the rectified/ directory in the same way as the raw images except that they are in png format. The params.yaml file contains the projection and rotation matrices necessary to use these images. The resolution of these parameters do not need to be scaled as is necessary for the raw images.
params.yml
----------
The stereo rectification parameters. R0,R1,P0,P1, and Q correspond to the outputs of the OpenCV stereoRectify function except that 1s and 2s are replaced by 0s and 1s, respectively.
R0: The rectifying rotation matrix of the left camera.
R1: The rectifying rotation matrix of the right camera.
P0: The projection matrix of the left camera.
P1: The projection matrix of the right camera.
Q: Disparity to depth mapping matrix
T_cam_imu: Transformation matrix for a point in the IMU frame to the left camera frame.
camchain-imucam.yaml
--------------------
The camera intrinsic and extrinsic parameters and the camera to IMU transformation usable with the raw images.
T_cam_imu: Transformation matrix for a point in the IMU frame to the camera frame.
distortion_coeffs: lens distortion coefficients using the radial tangential model.
intrinsics: focal length x, focal length y, principal point x, principal point y
resolution: resolution of calibration. Scale the intrinsics for use with the raw 800x600 images. The distortion coefficients do not change when the image is scaled.
T_cn_cnm1: Transformation matrix from the right camera to the left camera.
Sensors
-------
Here, each message in name.csv is described
###rawimus###
time # GPS time in seconds
message name # rawimus
acceleration_z # m/s^2 IMU uses right-forward-up coordinates
-acceleration_y # m/s^2
acceleration_x # m/s^2
angular_rate_z # rad/s IMU uses right-forward-up coordinates
-angular_rate_y # rad/s
angular_rate_x # rad/s
###IMG###
time # GPS time in seconds
message name # IMG
left image filename
right image filename
###inspvas###
time # GPS time in seconds
message name # inspvas
latitude
longitude
altitude # ellipsoidal height WGS84 in meters
north velocity # m/s
east velocity # m/s
up velocity # m/s
roll # right hand rotation about y axis in degrees
pitch # right hand rotation about x axis in degrees
azimuth # left hand rotation about z axis in degrees clockwise from north
###inscovs###
time # GPS time in seconds
message name # inscovs
position covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz m^2
attitude covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz deg^2
velocity covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz (m/s)^2
###bestutm###
time # GPS time in seconds
message name # bestutm
utm zone # numerical zone
utm character # alphabetical zone
northing # m
easting # m
height # m above mean sea level
Camera logs
-----------
The files name.cam0 and name.cam1 are text files that correspond to cameras 0 and 1, respectively. The columns are defined by:
unused: The first column is all 1s and can be ignored.
software frame number: This number increments at the end of every iteration of the software loop.
camera frame number: This number is generated by the camera and increments each time the shutter is triggered. The software and camera frame numbers do not have to start at the same value, but if the difference between the initial and final values is not the same, it suggests that frames may have been dropped.
camera timestamp: This is the cameras internal timestamp of the frame capture in units of 100 milliseconds.
PC timestamp: This is the PC time of arrival of the image.
name.kml
--------
The kml file is a mapping file that can be read by software such as Google Earth. It contains the recorded GPS trajectory.
name.unicsv
-----------
This is a csv file of the GPS trajectory in UTM coordinates that can be read by gpsbabel, software for manipulating GPS paths.
@article{doi:10.1177/0278364917751842,
author = {Martin Miller and Soon-Jo Chung and Seth Hutchinson},
title ={The Visual–Inertial Canoe Dataset},
journal = {The International Journal of Robotics Research},
volume = {37},
number = {1},
pages = {13-20},
year = {2018},
doi = {10.1177/0278364917751842},
URL = {https://doi.org/10.1177/0278364917751842},
eprint = {https://doi.org/10.1177/0278364917751842}
}
keywords:
slam;sangamon;river;illinois;canoe;gps;imu;stereo;monocular;vision;inertial
published:
2019-09-17
Mishra, Shubhanshu
(2019)
Trained models for multi-task multi-dataset learning for text classification in tweets.
Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality.
Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py</a>
See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details.
If you are using this data, please also cite the related article:
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; sentiment; sarcasm; abusive content;
published:
2020-08-21
Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana
(2020)
# WikiCSSH
If you are using WikiCSSH please cite the following:
> Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf
> Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1
Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1
More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH
This folder contains the following files:
WikiCSSH_categories.csv - Categories in WikiCSSH
WikiCSSH_category_links.csv - Links between categories in WikiCSSH
Wikicssh_core_categories.csv - Core categories as mentioned in the paper
WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called <ROOT> which is parent of isolates and top level categories)
WikiCSSH_category2page.csv - Links between Wikipedia pages and Wikipedia Categories in WikiCSSH
WikiCSSH_page2redirect.csv - Links between Wikipedia pages and Wikipedia page redirects in WikiCSSH
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit <a href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</a> or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
keywords:
wikipedia; computer science;
published:
2022-03-19
McCoy, Annette; Secor, Erica; Roady, Patrick; Gray, Sarah; Klein, Julie; Gutierrez-Nibeyro, Santiago
(2022)
Raw arthroscopic scores, histologic scores, cytokine measurements, and performance data for the study cohort described in the accompanying publication.
keywords:
horse; metatarsophalangeal joint; arthroscopy; exercise; developmental orthopedic disease
published:
2016-06-23
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (https://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected.
This dataset contains three files:
1) readme.txt: A readme file.
2) version-results.csv: A CSV file containing three columns: DOI, DOI prefix, and version text contents
3) version-counts.csv: A CSV file containing counts for unique version text content values.
keywords:
datacite;metadata;version values;repository data
published:
2024-10-10
Mishra, Apratim; Lee, Haejin; Jeoung, Sullam; Torvik, Vetle; Diesner, Jana
(2024)
Diversity - PubMed dataset
Contact: Apratim Mishra (Oct, 2024)
This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs.
################################################
File1: auids_plos_3.csv.gz (Important columns defined, 5 in total)
• AUID: a unique ID for each author
• Genni: gender prediction
• Ethnea: ethnicity prediction
#################################################
File2: pmids_plos_3.csv.gz (Important columns defined)
• pmid: unique paper
• auid: all unique auids (author-name unique identification)
• year: Year of paper publication
• no_authors: Author count
• journal: Journal name
• years: first year of publication for every author
• Country-temporal: Country of affiliation for every author
• h_index: Journal h-index
• TimeNovelty: Paper Time novelty [2]
• nih_funded: Binary variable indicating funding for any author
• prior_cit_mean: Mean of all authors’ prior citation rate
• Insti_impact: All unique institutions’ citation rate
• mesh_vals: Top MeSH values for every author of that paper
• relative_citation_ratio: RCR
The ‘Readme’ includes a description for all columns.
[1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1
[2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1
keywords:
Diversity; PubMed; Citation
published:
2025-07-21
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W.
(2025)
This dataset includes image stacks, annotated counts, and ground-truth masks from two high-resolution sediment cores extracted from Laguna Pallcacocha, in El Cajas National Park, Ecuadorian Andes by Moy et al. (2002) and Hagemans et al. (2021). The first core (PAL 1999, from Moy et al. (2002)) extends through the Holocene (11,600 cal. yr. BP - present). There are a total of 900 annotated image stacks and masks in the PAL 1999 domain. The second core (PAL IV, from Hagemans et al. (2021)) captures the 20th century. There are 2986 annotated image stacks and masks in the PAL IV domain.
Different microscopes and annotations tools were used to image and annotate each core and there are corresponding differences in naming conventions and file formats. Thus, we organized our data separately for the PAL 1999 and the PAL IV domains. The three letter codes used to label our pollen annotations are in the file: “Pollen_Identification_Codes.xlsx”.
Both domain directories contain:
• Image stacks organized by subdirectory
• Annotations within each image stack directory, containing specimen identifications using a three letter code and coordinates defining bounding boxes or circles
• Ground-truth distance-transform masks for each image stack
The zip file "bestValModel_encoder.paramOnly.zip" is the trained pollen detection model produced from the images and annotations in this dataset.
Please cite this dataset as:
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W. (2025): Slide scans, annotated pollen counts, and trained pollen detection models for fossil pollen samples from Laguna Pallcacocha, El Cajas National Park, Ecuador . University of Illinois Urbana-Champaign. https://doi.org/10.13012/B2IDB-4207757_V1
Please also include citations of the original publications from which these data are taken:
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” bioRxiv, January 1, 2025. https://doi.org/10.1101/2025.01.05.631390.
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” Paleobiology, 2025 [in press].
Feng, J. T. (2023). Open-world deep learning applied to pollen detection (MS thesis, University of Illinois at Urbana-Champaign). https://hdl.handle.net/2142/120168
keywords:
continual learning; deep learning; domain gaps; open-world; palynology; pollen grain detection; taxonomic bias
published:
2025-01-30
Raw data associated with PMID: 38925247
published:
2025-01-30
Zhang, Yufan; Bhattarai, Rabin
(2025)
This is a research data for a manuscript - A Framework of Simulating Structural Sediment Perimeter Barriers using VFSMOD.
keywords:
sediment control
published:
2017-12-14
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular?
Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.
Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.
Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords:
research data; research statistics; institutional repositories; academic libraries
published:
2022-06-01
Southey, Bruce; Rodriguez-Zas, Sandra L.
(2022)
This dataset contain information for the paper "Changes in neuropeptide prohormone genes among Cetartio-dactyla livestock and wild species associated with evolution and domestication" Veterinary Sciences, MDPI. Protein sequences were predicted using GeneWise for 98 neuropeptide prohormone genes from publicly available genomes of 118 Cetartiodactyla species. All predictions (CetartiodactylaSequences2022.zip) were manually verified. Sequences were aligned within each prohormone using MAFFT (MDPImultalign2022.zip includes multiple sequence alignment of all species available for each prohormone). Phylogenetic gene trees were constructed using PhyML and the species tree was constructed using ASTRAL (MDPItree2022.zip). The data is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
keywords:
prohormone; neuropeptide; Cetartiodactyla; Cetartiodactyla; phylogenetics; gene tree; species tree
published:
2025-09-23
Zhao, Huimin; Chen, Li-Qing; Martin, Teresa; Xue, Xueyi; Singh, Nilmani; Tan, Shi-I; Boob, Aashutosh
(2025)
Mitochondria play a key role in energy production and metabolism, making them a promising target for metabolic engineering and disease treatment. However, despite the known influence of passenger proteins on localization efficiency, only a few protein-localization tags have been characterized for mitochondrial targeting. To address this limitation, we leverage a Variational Autoencoder to design novel mitochondrial targeting sequences. In silico analysis reveals that a high fraction of the generated peptides (90.14%) are functional and possess features important for mitochondrial targeting. We characterize artificial peptides in four eukaryotic organisms and, as a proof-of-concept, demonstrate their utility in increasing 3-hydroxypropionic acid titers through pathway compartmentalization and improving 5-aminolevulinate synthase delivery by 1.62-fold and 4.76-fold, respectively. Moreover, we employ latent space interpolation to shed light on the evolutionary origins of dual-targeting sequences. Overall, our work demonstrates the potential of generative artificial intelligence for both fundamental research and practical applications in mitochondrial biology.
keywords:
AI/ML; metabolic engineering; modeling; software
published:
2017-06-16
Haselhorst, Derek S.; Tcheng, David K.; Moreno, J. Enrique ; Punyasena, Surangi W.
(2017)
Table S2. Raw pollen counts and climatic data for each seasonal sampling period. Climatic data reflects the average daily conditions observed over the duration samples were collected (˚C/day, mm/day, MJ/m2/day). Lycopodium counts and counts for each pollen taxon reflect the aggregated pollen sum from four sampling heights.
keywords:
pollen; count; climate; data; BCI; PNSL; Panama
published:
2020-12-07
Tian, Yuan; Smith-Bolton, Rachel
(2020)
This page contains the data for the publication "Regulation of growth and cell fate during tissue regeneration by the two SWI/SNF chromatin-remodeling complexes of Drosophila" published in Genetics, 2020
published:
2020-11-25
Barker, Louise; Gaulke, Sarah M.; Chace, Jordyn Z.; Davis, Mark A.; Niemiller, Matthew L.; Taylor, Steven J.; Schuett, Gordon W.
(2020)
Video recorded by Louise Barker using a Cannon Powershot camera documents late-season combat behavior in Agkistrodon contortrix. Recorded in Beaufort County, North Carolina, 11.1 km SE of downtown Washington on 21 October 2020.
keywords:
Agkistrodon contortrix; combat; mating; reproduction; copperhead; pit viper; Viperidae;
published:
2017-06-01
List of Chinese Students Receiving a Ph.D. in Chemistry between 1905 and 1964. Based on two books compiling doctoral dissertations by Chinese students in the United States. Includes disciplines; university; advisor; year degree awarded, birth and/or death date, dissertation title. Accompanies Chapter 5 : History of the Modern Chemistry Doctoral Program in Mainland China by Vera V. Mainz published in "Igniting the Chemical Ring of Fire : Historical Evolution of the Chemical Communities in the Countries of the Pacific Rim", Seth Rasmussen, Editor. Published by World Scientific. Expected publication 2017.
keywords:
Chinese; graduate student; dissertation; university; advisor; chemistry; engineering; materials science