This online guide contains resources for finding data repositories for data preservation and access and locating datasets for reuse. The guide was developed as an online companion for the class Resources for Finding and Sharing Research Data. If you are NIH or HHS staff, please check out the NIH Library training schedule for upcoming classes.
If you need a one-on-one or group consultation on locating data repositories and datasets, please contact the NIH Library.
References
Some content of this guide is adapted from:
- Read, Kevin; Surkis, Alisa (2018): Research Data Management Teaching Toolkit. figshare. (https://figshare.com/articles/Research_Data_Management_Teaching_Toolkit/5042998) This work is licensed under Attribution 4.0 International (CC BY 4.0).
Navigation:
Data Repositories
Resources to Locate Data Repositories
- NIH Data Sharing Resources: Find links to tables created by trans-NIH Biomedical Informatics Coordinating Committee (BMIC) listing:
- Domain-specific repositories
- Generalist repositories
- Information from the BMIC tables described above, listing repositories for sharing scientific data and repositories for accessing scientific data, can also be found at Sharing.nih.gov.
- Re3Data: The Registry of Research Data Repositories (Re3Data) is a portal created by the non-profit organization DataCite.
- The portal covers data registries from across many academic disciplines.
- Users can search by keyword or browse repositories by subject or country.
- FAIRsharing.org: A curated resource on data and metadata standards, inter-related to databases and data policies.
- Choose Databases to search and browse data repositories.
- Choose Collections to view data repositories, standards, and policies related to various topics.
Resources for Data Sharing for Intramural NIH Researchers
- 2023 Final NIH Policy for Data Management & Sharing: All NIH-funded research, including NIH Intramural Research Projects conducted on or after January 25, 2023, will need to:
- Submit a Data Management and Sharing plan (DMSP) outlining how scientific data and any accompanying metadata will be managed and shared, taking into account any potential restrictions or limitations.
- Comply with the Data Management and Sharing plan approved by the funding Institute or Center (IC).
- Data Management & Sharing Policy Overview: Learn more about the 2023 Data Management & Sharing Policy, and find resources to assist with compliance.
- Supplemental Information to the NIH Policy for Data Management and Sharing:
- Allowable Costs for Data Management and Sharing
- Elements of an NIH Data Management and Sharing Plan
- Selecting a Repository for Data Resulting from NIH-Supported Research
- Protecting Privacy When Sharing Human Research Participant Data
- Responsible Management and Sharing of American Indian/Alaska Native Participant Data
- Guidance for Intramural Researchers (from OIR Sourcebook): For the Intramural Research Program, a DMS plan will be required for scientific data from
- Research associated with a ZIA
- Research associated with a clinical protocol that will undergo IC Initial Scientific Review
- The plans will address the elements indicated in the Intramural Research Program Data Management and Sharing (IRP DMS) Plan template. The template addresses six NIH-recommended core elements, and allows for the inclusion of IC-specific elements: Intramural Data Management and Sharing Plan Template (PDF)
- See the 2023 NIH Data Management and Sharing Policy page in the OIR Sourcebook for additional guidance and resources.
- See the library guide Data Management and Sharing Plan Resources for a detailed list of DMSP resources and IC-specific contacts.
- Other Data Sharing Policy Information:
- Find more information on Intramural Data Sharing from the NIH Office of Intramural Research.
- Visit Sharing.nih.gov for guidance on Selecting a Data Repository and a list of potential Repositories for Sharing Scientific Data.
Issues to Consider with Data Repositories
Issues to consider when finding a data repository to preserve and share data:
- Required Repositories: Check the funder/publisher policies to see if there are required repositories where the data must be deposited.
- Sensitive Data: Make sure you are not sharing sensitive data (such as personally identifiable information (PII) or protected health information (PHI)).
- You may need to anonymize and/or aggregate the data before sharing, or access to the data may need to be limited to researchers with specific permissions.
- Intellectual Property: Be aware of who owns the intellectual property and if there are any licensing restrictions.
- Required Data Standards: Be aware of the data standards (such as metadata and data formats) required for depositing the data in the repository.
- Deposit and Storage Costs: Be aware of any costs associated with depositing/storing the data.
Find additional guidance at Sharing.nih.gov for Selecting a Data Repository.
Datasets
Searching Across Data Repositories
- Google Dataset Search: Search by keyword to locate datasets across the web. Filter by date last updated, download format, whether the dataset is in Croissant format, usage rights, topic, provider, and whether the dataset is free to access.
- According to a post from Google AI Blog, An Analysis of Online Datasets Using Dataset Search (from August 25, 2020), Google Dataset Search:
- Indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.
- Contains more than 31 million datasets from more than 4,600 internet domains.
- About half of these datasets come from .com domains, but .org and governmental domains also well represented.
- Dataset results are now also listed in general Google search results, according to February 2023 blog post.
- According to a post from Google AI Blog, An Analysis of Online Datasets Using Dataset Search (from August 25, 2020), Google Dataset Search:
- Mendeley Data: A data index and open research data repository from publisher Elsevier where users can search across research data from 2000+ generalist and domain-specific repositories.
- Filter results by date range, data type, source type (article or data repository), source, and funder.
- NLM Dataset Catalog (beta) – Search across over 80,900 biomedical datasets from various repositories
- NLM also offers Center for Clinical Observational Investigations (CCOI) Dataset Profiles, for exploring large-scale clinical datasets
Generalist Repositories
Here’s a closer look at a few major cross-disciplinary repositories highlighted on the NIH Data Sharing Resources: Generalist Repositories page.
- Dataverse: An open-source web application to share, preserve, cite, explore, and analyze research data.
- Dryad: A nonprofit repository for data underlying the international scientific and medical literature.
- Browse or search and filter datasets by funder, subject, journal, or institution.
- Figshare: A cross-disciplinary repository where users and institutions can upload datasets, supported by the technology company Digital Science.
- Filter by Item Type: Dataset.
- IEEE DataPort: A global research data platform backed by IEEE that allows users to upload datasets and access thousands of scientific datasets.
- Search across over 7,000 datasets.
- Open Science Framework (OSF): A free, open platform to support research and enable collaboration, including discovery of projects, data, materials, and collaborators.
- Synapse: a collaborative portal from Sage Bionetworks that allows scientists to share and analyze data.
- Filter by Type: Dataset to view only dataset results.
- Vivli: A global clinical research data sharing platform, allowing users to share, archive and request anonymized data from completed clinical trials.
- Zenodo: An open research data repository from CERN, the European Organization for Nuclear Research.
- Filter by Type: Dataset to view only dataset results.
The NIH Office of Data Science Strategy (ODSS) announced the Generalist Repository Ecosystem Initiative (GREI), which includes seven established generalist repositories that will work together to establish consistent metadata, develop use cases for data sharing, train and educate researchers on FAIR data and the importance of data sharing, and more. A series of recorded webinars is offered to learn about GREI and generalist repositories.
Data Journals
- Data Journals: The data itself is described rather than the analysis of that data and results.
- Some will also store the dataset.
- Others provide recommendations of where to store the data.
- Usually peer-reviewed.
- Examples of data publications from large scientific publishers:
- GigaScience: An open access, open data, open peer-review journal from Oxford University Press focusing on “big data” research from the life and biomedical sciences.
- Scientific Data: Scientific Data is a peer-reviewed, open-access journal from Springer Nature that publishes descriptions of scientifically valuable datasets and research that advances the sharing and reuse of scientific data.
- Finding Data Journals:
- Data Journals: Gdańsk University of Technology Open Science Competence Center offers a list of 237 journals that publish data descriptors and were included in the list of journals prepared by the Polish Ministry of Education and Science (MEiN).
- Walters, William H. 2020. “Data Journals: Incentivizing Data Access and Documentation Within the Scholarly Communication System”. Insights 33 (1): 18. DOI: http://doi.org/10.1629/uksg.510: Provides list of data journals.
Databases Linked to Datasets
- PubMed: Use the filter option “Article Attribute” > “Associated Data” to only view results with related data links. Data filters were originally added to PubMed and PubMed central in 2018.
- Web of Science: When viewing search results in Web of Science (All Databases), choose the Associated Data option under Quick Filters to view only search results that mention a data set, data study, or data repository in the Data Citation Index. The Data Citation Index includes records on over 14 million research data sets, 1.6 million data studies, and 440 thousand software from over 440 international data repositories in the sciences, social sciences, and arts and humanities.
Issues to Consider with Datasets
Issues to consider when re-using datasets include:
- Judging the Quality of the Dataset: Not all datasets may be reliable or high-quality. Try to answer questions like:
- Who is the author of the dataset? What is their institutional affiliation?
- Is there a peer-reviewed publication associated with the dataset?
- Licensing: Check any license restrictions for the data. Many repositories will list the type of license the data is covered by (usually Creative Commons or Open Data Commons licenses).
- Citation: Use the requested data citation format (some repositories like Dryad list the requested citation format). If the repository doesn’t include a requested citation format:
- Use the format defined by a style guide, like APA (See APA style manual examples for datasets).
- In EndNote, you can define a reference as a dataset. EndNote will then format the reference into the correct dataset citation format for the selected style.
- Learn more: NYU Libraries, Data Sources: How to Cite Data & Statistics
See the ELIXIR Research Data Management Kit (RDMkit) guide on Existing Data for additional considerations and resources when locating existing datasets for reuse.
Data Standards and Common Data Elements (CDEs)
Data/metadata standards and CDEs can help to make data more FAIR (findable, accessible, interoperable, and re-usable – see FORCE11 The FAIR Data Principles).
- DCC Disciplinary Metadata: Collections of metadata standards organized by discipline.
- FAIRsharing.org: An online catalog that includes over 1,820 data and metadata standards.
- NIH CDE Repository: The NIH Common Data Elements (CDE) Repository provides access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes.