Atlas Methodology

How the Winnow Atlas research index is built, maintained, and updated. This document is intended for researchers and scientists evaluating the reliability of the data.

1. What Atlas Is

Winnow Atlas is an open structured evidence map of the microplastics literature. It is built from peer-reviewed papers indexed in public academic databases, with the goal of making the field navigable for researchers, policymakers, and the public without simplifying the underlying science. Atlas does not produce original research — it organises, summarises, and relates existing published work.

The corpus currently contains 111,495 papers, of which 79,684 have plain-language summaries. These numbers update weekly as new papers are ingested.

2. Data Sources

Atlas ingests papers from two primary sources:

  • OpenAlex

    api.openalex.org

    An open, freely accessible index of peer-reviewed academic literature maintained by OurResearch. Atlas queries OpenAlex weekly via cursor-paginated API requests, collecting papers matched to microplastics-related search terms. Provides the majority of the corpus (~80k papers).

  • PubMed / NCBI

    eutils.ncbi.nlm.nih.gov

    The National Library of Medicine's biomedical literature database, queried via the E-utilities API. Particularly strong for health and clinical research. Provides structured MeSH term data where available (~21k papers).

Fields collected per paper: title, abstract, authors, journal name, publication year, DOI, open access URLs (PubMed, PMC, PDF where available), and citation count. No full paper text is accessed or stored.

3. Relevance Filtering

Each paper is classified as microplastics-relevant or not based on its title and abstract. Papers that are not relevant to microplastics research are excluded from public browse and search results.

This filter addresses a specific problem: the word "microplasticity" is used in materials science to describe plastic deformation in metals — a phenomenon entirely unrelated to microplastics research. Without filtering, many materials science publications would appear in results. Non-relevant papers remain in the database but are not surfaced.

Examples of excluded papers:

  • "Microplasticity and macroplasticity in copper single crystals" — materials science study on metal deformation, not microplastic particles.
  • "Strain gradient plasticity at the micron scale" — metallurgy research using "microplastic" to describe material behavior under stress.

Currently 42,415 papers in the corpus are classified as non-relevant and excluded from search results.

4. AI Classification

Each paper's abstract is processed by AI to assign a paper_type from a controlled vocabulary: original research, systematic review, meta-analysis, environmental study, review, commentary, letter, conference abstract, and others.

Classification is based on the abstract alone — Atlas does not access the full paper text. This means classification accuracy depends on how well the abstract describes the study design. AI classification has known error rates, particularly for papers with incomplete or atypical abstracts. Errors can be flagged via the feedback button on any paper page.

5. Evidence Tiering

Papers are assigned an evidence tier based on study design, as inferred from the abstract:

Tier 1 — Systematic Reviews & Meta-Analyses

Studies that synthesise findings across many primary studies using a defined protocol. Represent the strongest level of evidence — they reduce the risk of individual study bias. Currently 1,300 papers.

Tier 2 — Original Research

Experimental, observational, epidemiological, and case-control studies generating new primary evidence. The largest category in the corpus. Currently 83,598 papers.

Tier 3 — Commentary & Context

Commentaries, letters, editorials, and conference abstracts. Useful for understanding scientific debate and emerging thinking, but not primary evidence. Currently 288 papers.

Tier is AI-assigned from the abstract. Mis-tiered papers can be flagged via the feedback system.

6. AI Summarization

AI generates a plain-language summary (shown as "Summary" on each paper page) from the abstract. The goal is to make findings legible to readers without domain expertise, without losing scientific meaning.

The summary is a restatement of the abstract — it does not interpret, extend, or editorialize beyond what the abstract states. If the abstract is ambiguous or limited, the summary will reflect those limits.

79,684 papers currently have summaries. Papers without an abstract — currently 14,201 — cannot be summarized and show the abstract only (or nothing, if no text is available).

Why 14,201 papers have no abstract

Atlas collects metadata from open academic databases (OpenAlex and PubMed), not from publishers directly. Many publishers do not include abstracts in the metadata they share with these indexes — particularly for paywalled content, book chapters, and editorial pieces. The top sources of missing abstracts are:

  • Elsevier Paywalled journals — abstracts withheld from open metadata
  • Springer Nature Book chapters and some journal articles — metadata often omits abstracts
  • SSRN Preprints — abstracts frequently absent from metadata feeds
  • Nature Editorials, letters, and news pieces — no structured abstract

Without an abstract, a paper cannot be summarized, classified, or annotated for polymers or body systems. These papers are still embedded by title and appear in keyword search and browse. We periodically re-check sources for newly available abstracts and backfill when possible.

7. Semantic Relationships

Each paper's title and summary are encoded into a vector embedding using VoyageAI. Embeddings represent the semantic content of a paper as a point in high-dimensional space — papers that are close in meaning have embeddings that are close in space.

The "More Papers Like This" section on each paper page uses cosine similarity between embeddings to surface papers that are semantically related. This means a paper on microplastics in human blood will surface other human biomonitoring studies even if they use entirely different terminology — a capability keyword matching cannot provide.

Papers without an abstract are embedded using only their title, which produces lower-quality vectors. Papers with a summary use the richer title + summary text for more accurate semantic matching.

8. Keyword Annotations

Titles and abstracts are scanned using rule-based keyword matching to extract structured metadata:

  • Polymers — specific polymer types mentioned (PET, polystyrene, polypropylene, etc.)
  • Body Systems — organ systems or physiological domains studied (gut, reproductive, cardiovascular, neurological, etc.)
  • Animal Models — model organisms used (rodent, zebrafish, human, etc.)
  • Study Type — research approach (in vitro, in vivo, epidemiological, computational, etc.)

These annotations are produced by keyword extraction, not AI inference. Coverage depends entirely on whether the exact keyword appears in the title or abstract. A paper studying microplastics in the reproductive system may not be annotated with "reproductive" if the abstract uses only clinical terminology. These annotations are used for filtering and faceted browse — they are indicative, not exhaustive.

9. Count Definitions

Every number shown in Atlas has a precise definition:

Label Definition
Microplastic Research Papers Papers classified as relevant to microplastics research, with active visibility status. Non-relevant papers (e.g. materials science "microplasticity") are excluded.
Total Indexed All records in the Atlas corpus from OpenAlex and PubMed, regardless of relevance classification.
AI Summarized Percentage of microplastic-relevant papers with an AI-generated plain-language summary. 14% of papers lack an abstract from the source and cannot be summarized.
Enriched & Annotated Percentage of microplastic-relevant papers enriched with full metadata from OpenAlex — including authors, institutional affiliations, AI classification, and evidence tier assignment.
Semantic Embeddings Percentage of microplastic-relevant papers with vector embeddings (VoyageAI, 1024 dimensions). Powers semantic search, related paper discovery, and similarity features.
Topic count Papers with this topic as their primary AI-assigned topic. A paper may appear under secondary topics but is counted once per primary assignment.
Polymer count Papers where this polymer was identified by keyword matching in the title or abstract.
Citation count Retrieved from OpenAlex, updated weekly. May differ slightly from the publisher's or journal's own count.

10. Known Limitations

  • 14,201 papers have no abstract. These papers cannot be summarized, keyword-annotated for body systems or polymers, or meaningfully classified. They are embedded by title only, producing lower-quality semantic vectors. Why →
  • AI classification is abstract-only. Papers with incomplete or misleading abstracts may be mis-classified or mis-tiered. The error rate is not precisely quantified. Feedback helps correct visible errors.
  • Keyword annotation coverage is incomplete. A study may examine a polymer or body system without using Atlas's exact keyword. Annotations should be treated as indicative, not exhaustive.
  • Corpus skews toward English-language publications. Both OpenAlex and PubMed index multilingual literature, but coverage is stronger for English-language journals.
  • Weekly sync lag. Papers published in the last ~7 days may not yet appear. Citation counts may trail the publisher by a similar margin.
  • Relevance filtering may exclude boundary cases. Papers at the boundary of microplastics vs. materials science may be incorrectly excluded. If a paper you expect to find is missing, it may have been filtered as off-topic.

For details on what's included and excluded at each step, see Data Coverage. Questions about methodology or data quality? Contact Winnow.