Atlas Methodology

How the Winnow Atlas research index is built, maintained, and updated. This document is intended for researchers and scientists evaluating the reliability of the data.

1. What Atlas Is

Winnow Atlas is an open structured evidence map of the microplastics literature. It is built from peer-reviewed papers indexed in public academic databases, with the goal of making the field navigable for researchers, policymakers, and the public without simplifying the underlying science. Atlas does not produce original research — it organises, summarises, and relates existing published work.

The corpus currently contains 82,947 papers, of which 64,817 have plain-language summaries. These numbers update daily as new papers are ingested.

2. Data Sources

Atlas ingests papers from two primary sources:

  • OpenAlex

    api.openalex.org

    An open, freely accessible index of peer-reviewed academic literature maintained by OurResearch. Atlas queries OpenAlex daily via cursor-paginated API requests, collecting papers matched to microplastics-related search terms. Provides the majority of the corpus (~80k papers).

  • PubMed / NCBI

    eutils.ncbi.nlm.nih.gov

    The National Library of Medicine's biomedical literature database, queried via the E-utilities API. Particularly strong for health and clinical research. Provides structured MeSH term data where available (~21k papers).

Fields collected per paper: title, abstract, authors, journal name, publication year, DOI, open access URLs (PubMed, PMC, PDF where available), and citation count. No full paper text is accessed or stored.

3. Relevance Filtering

Each paper is assigned a relevance_score between 0 and 100 based on the title and abstract content. Papers scoring below 30 are excluded from public browse and search results.

This filter addresses a specific problem: the word "microplasticity" is used in materials science to describe plastic deformation in metals — a phenomenon entirely unrelated to microplastics research. Without filtering, many materials science publications would appear in results. Papers below the threshold remain in the database but are not surfaced.

The relevance score is AI-assigned from the abstract and is recomputed weekly as classification improves.

4. AI Classification

Each paper's abstract is processed by Claude (Anthropic) to assign a paper_type from a controlled vocabulary: original research, systematic review, meta-analysis, environmental study, review, commentary, letter, conference abstract, and others.

Classification is based on the abstract alone — Atlas does not access the full paper text. This means classification accuracy depends on how well the abstract describes the study design. AI classification has known error rates, particularly for papers with incomplete or atypical abstracts. Errors can be flagged via the feedback button on any paper page.

5. Evidence Tiering

Papers are assigned an evidence tier based on study design, as inferred from the abstract:

Tier 1 — Systematic Reviews & Meta-Analyses

Studies that synthesise findings across many primary studies using a defined protocol. Represent the strongest level of evidence — they reduce the risk of individual study bias. Currently 1,237 papers.

Tier 2 — Original Research

Experimental, observational, epidemiological, and case-control studies generating new primary evidence. The largest category in the corpus. Currently 81,431 papers.

Tier 3 — Commentary & Context

Commentaries, letters, editorials, and conference abstracts. Useful for understanding scientific debate and emerging thinking, but not primary evidence. Currently 279 papers.

Tier is AI-assigned from the abstract. Mis-tiered papers can be flagged via the feedback system.

6. AI Summarization

Claude generates a plain-language summary (shown as "Summary" on each paper page) from the abstract. The goal is to make findings legible to readers without domain expertise, without losing scientific meaning.

The summary is a restatement of the abstract — it does not interpret, extend, or editorialize beyond what the abstract states. If the abstract is ambiguous or limited, the summary will reflect those limits.

64,817 papers currently have summaries. Papers without an abstract — currently around 18,083 — cannot be summarized and show the abstract only (or nothing, if no text is available).

7. Semantic Relationships

Each paper's title and summary are encoded into a vector embedding using VoyageAI. Embeddings represent the semantic content of a paper as a point in high-dimensional space — papers that are close in meaning have embeddings that are close in space.

The "More Papers Like This" section on each paper page uses cosine similarity between embeddings to surface papers that are semantically related. This means a paper on microplastics in human blood will surface other human biomonitoring studies even if they use entirely different terminology — a capability keyword matching cannot provide.

Papers without summaries (no abstract available) do not have embeddings and do not appear in "More Papers Like This" results.

8. Keyword Annotations

Titles and abstracts are scanned using rule-based keyword matching to extract structured metadata:

  • Polymers — specific polymer types mentioned (PET, polystyrene, polypropylene, etc.)
  • Body Systems — organ systems or physiological domains studied (gut, reproductive, cardiovascular, neurological, etc.)
  • Animal Models — model organisms used (rodent, zebrafish, human, etc.)
  • Study Type — research approach (in vitro, in vivo, epidemiological, computational, etc.)

These annotations are produced by keyword extraction, not AI inference. Coverage depends entirely on whether the exact keyword appears in the title or abstract. A paper studying microplastics in the reproductive system may not be annotated with "reproductive" if the abstract uses only clinical terminology. These annotations are used for filtering and faceted browse — they are indicative, not exhaustive.

9. Count Definitions

Every number shown in Atlas has a precise definition:

Label Definition
Total Papers All records in the Atlas corpus from OpenAlex and PubMed, regardless of classification, summary, or annotation status.
With Summaries Papers for which Claude successfully generated a plain-language summary. Requires a non-empty abstract.
Tier 1 Evidence Papers AI-classified as systematic reviews or meta-analyses.
Paper Categories Number of distinct AI-assigned paper types with more than 100 papers in the corpus.
Topic count Papers with this topic as their primary AI-assigned topic. A paper may appear under secondary topics but is counted once per primary assignment.
Polymer count Papers where this polymer was identified by keyword matching in the title or abstract.
Citation count Retrieved from OpenAlex, updated daily. May differ slightly from the publisher's or journal's own count.
Relevance score AI-assigned 0–100 score estimating relevance to the microplastics field. Papers below 30 are filtered from public browse.

10. Known Limitations

  • 18,083 papers have no abstract. These papers cannot be summarized, keyword-annotated for body systems or polymers, semantically embedded, or meaningfully classified. They appear in search results only by title.
  • AI classification is abstract-only. Papers with incomplete or misleading abstracts may be mis-classified or mis-tiered. The error rate is not precisely quantified. Feedback helps correct visible errors.
  • Keyword annotation coverage is incomplete. A study may examine a polymer or body system without using Atlas's exact keyword. Annotations should be treated as indicative, not exhaustive.
  • Corpus skews toward English-language publications. Both OpenAlex and PubMed index multilingual literature, but coverage is stronger for English-language journals.
  • Daily sync lag. Papers published in the last ~24 hours may not yet appear. Citation counts may trail the publisher by a similar margin.
  • Relevance filtering may exclude boundary cases. Papers at the edge of the relevance threshold (score ~30) may be incorrectly excluded. If a paper you expect to find is missing, it may have been filtered as off-topic.

Questions about methodology or data quality? Contact Winnow.