U.S. flag

An official website of the United States government

Finding Datasets, Data Repositories, and Data Standards

This online guide contains resources for finding data repositories for data preservation and access and locating datasets for reuse. The guide was developed as an online companion for the class Resources for Finding and Sharing Research Data.  If you are NIH or HHS staff, please check out the NIH Library training schedule for upcoming classes.

If you need a one-on-one or group consultation on locating data repositories and datasets, please contact the NIH Library.

References

Some content of this guide is adapted from:

Navigation:

Data Repositories

Resources to Locate Data Repositories

 

Resources for Data Sharing for Intramural NIH Researchers

 

Issues to Consider with Data Repositories

Issues to consider when finding a data repository to preserve and share data:

  • Required Repositories: Check the funder/publisher policies to see if there are required repositories where the data must be deposited.
  • Sensitive Data: Make sure you are not sharing sensitive data (such as personally identifiable information (PII) or protected health information (PHI)).
    • You may need to anonymize and/or aggregate the data before sharing, or access to the data may need to be limited to researchers with specific permissions.
  • Intellectual Property:  Be aware of who owns the intellectual property and if there are any licensing restrictions.
  • Required Data Standards: Be aware of the data standards (such as metadata and data formats) required for depositing the data in the repository.
  • Deposit and Storage Costs: Be aware of any costs associated with depositing/storing the data.

Find additional guidance at Sharing.nih.gov for Selecting a Data Repository.

 

Datasets

Searching Across Data Repositories

  • Google Dataset Search: Search by keyword to locate datasets across the web. Filter by date last updated, download format, whether the dataset is in Croissant format, usage rights, topic, provider, and whether the dataset is free to access.
    • According to a post from Google AI Blog, An Analysis of Online Datasets Using Dataset Search (from August 25, 2020), Google Dataset Search:
      • Indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.
      • Contains more than 31 million datasets from more than 4,600 internet domains.
      • About half of these datasets come from .com domains, but .org and governmental domains also well represented.
    • Dataset results are now also listed in general Google search results, according to February 2023 blog post.
  • Mendeley Data: A data index and open research data repository from publisher Elsevier where users can search across research data from 2000+ generalist and domain-specific repositories.
    • Filter results by date range, data type, source type (article or data repository), source, and funder.
  • NLM Dataset Catalog (beta) – Search across over 80,900 biomedical datasets from various repositories

 

Generalist Repositories

Here’s a closer look at a few major cross-disciplinary repositories highlighted on the NIH Data Sharing Resources: Generalist Repositories page. 

  • Dataverse: An open-source web application to share, preserve, cite, explore, and analyze research data. 
  • Dryad: A nonprofit repository for data underlying the international scientific and medical literature.
    • Browse or search and filter datasets by funder, subject, journal, or institution.
  • Figshare: A cross-disciplinary repository where users and institutions can upload datasets, supported by the technology company Digital Science.
    • Filter by Item Type: Dataset.
  • IEEE DataPort: A global research data platform backed by IEEE that allows users to upload datasets and access thousands of scientific datasets.
    • Search across over 7,000 datasets.
  • Open Science Framework (OSF): A free, open platform to support research and enable collaboration, including discovery of projects, data, materials, and collaborators.
  • Synapse: a collaborative portal from Sage Bionetworks that allows scientists to share and analyze data.
    • Filter by Type: Dataset to view only dataset results.
  • Vivli: A global clinical research data sharing platform, allowing users to share, archive and request anonymized data from completed clinical trials.
  • Zenodo: An open research data repository from CERN, the European Organization for Nuclear Research.
    • Filter by Type: Dataset to view only dataset results.

The NIH Office of Data Science Strategy (ODSS) announced the Generalist Repository Ecosystem Initiative (GREI), which includes seven established generalist repositories that will work together to establish consistent metadata, develop use cases for data sharing, train and educate researchers on FAIR data and the importance of data sharing, and more. A series of recorded webinars is offered to learn about GREI and generalist repositories. 

 

Data Journals

  • Data Journals: The data itself is described rather than the analysis of that data and results.
    • Some will also store the dataset.
    • Others provide recommendations of where to store the data.
    • Usually peer-reviewed.
  • Examples of data publications from large scientific publishers:
    • GigaScience: An open access, open data, open peer-review journal from Oxford University Press focusing on “big data” research from the life and biomedical sciences.
    • Scientific Data: Scientific Data is a peer-reviewed, open-access journal from Springer Nature that publishes descriptions of scientifically valuable datasets and research that advances the sharing and reuse of scientific data.
  • Finding Data Journals: 

 

Databases Linked to Datasets

  • PubMed: Use the filter option “Article Attribute” > “Associated Data” to only view results with related data links. Data filters were originally added to PubMed and PubMed central in 2018.
  • Web of Science: When viewing search results in Web of Science (All Databases), choose the Associated Data option under Quick Filters to view only search results that mention a data set, data study, or data repository in the Data Citation Index.  The Data Citation Index includes records on over 14 million research data sets, 1.6 million data studies, and 440 thousand software from over 440 international data repositories in the sciences, social sciences, and arts and humanities.

 

Issues to Consider with Datasets

Issues to consider when re-using datasets include:

  • Judging the Quality of the Dataset: Not all datasets may be reliable or high-quality. Try to answer questions like:
    • Who is the author of the dataset? What is their institutional affiliation?
    • Is there a peer-reviewed publication associated with the dataset?
  • Licensing: Check any license restrictions for the data. Many repositories will list the type of license the data is covered by (usually Creative Commons or Open Data Commons licenses).
  • Citation: Use the requested data citation format (some repositories like Dryad list the requested citation format). If the repository doesn’t include a requested citation format:

See the ELIXIR Research Data Management Kit (RDMkit) guide on Existing Data for additional considerations and resources when locating existing datasets for reuse.

 

Data Standards and Common Data Elements (CDEs)

Data/metadata standards and CDEs can help to make data more FAIR (findable, accessible, interoperable, and re-usable – see FORCE11 The FAIR Data Principles).

  • DCC Disciplinary Metadata: Collections of metadata standards organized by discipline.
  • FAIRsharing.org: An online catalog that includes over 1,820 data and metadata standards.
  • NIH CDE Repository: The NIH Common Data Elements (CDE) Repository provides access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers and other organizations for use in research and for other purposes.