Data Coverage

Not every paper in Atlas has every field populated. This page explains what's included, what's missing, and why — so you know exactly what you're searching, browsing, and filtering.

See also: Methodology

Corpus at a glance

111,495

Total papers

69,080

Visible (relevance ≥ 30)

42,415

Excluded by relevance

Papers

Atlas ingests every paper matching microplastics-related queries from OpenAlex and PubMed. Not all of them are shown to users.

Filter Excluded Why
Relevance < 30 42,415 Materials science papers about "microplasticity" (plastic deformation in metals) and other off-topic matches. Kept in DB but hidden from browse and search.
Visibility: delisted 0 Manually or automatically delisted after review (duplicates, retractions, non-research items).

69,080 papers pass all filters and are visible on Atlas.

Abstracts

An abstract is required for almost every downstream enrichment step. Papers without abstracts are functionally title-only records.

97,294

Have abstract

87.3%

14,201

Missing abstract

12.7%

Why abstracts are missing

  • Publisher withholds abstract from open metadata (common with Elsevier, Springer Nature paywalled content)
  • Book chapters and conference proceedings often lack structured abstracts
  • Editorials, letters, and news pieces have no formal abstract

Without an abstract, a paper cannot receive a summary, embedding, classification, or keyword annotation. It appears in keyword search and title browse only. We periodically scrape publisher pages via DOI to recover missing abstracts.

Summaries

Each paper's abstract is rewritten into a plain-language summary by AI. A paper gets a summary only if it has an abstract.

79,684

Have summary

71.5%

31,811

No summary

28.5%

What's excluded

  • Papers without an abstract (no input text to summarize)
  • Low-relevance papers (score < 30) are deprioritized for summary generation

Of the 69,080 visible papers, 9,740 lack summaries because they have no abstract. All visible papers with abstracts have summaries.

Embeddings (Semantic Search)

Each paper's title and summary are encoded into a 1024-dimension vector by VoyageAI. These vectors power "More Papers Like This" and semantic search.

108,567

Have embedding

97.4%

2,928

No embedding

2.6%

What's excluded

  • Papers without an abstract (no meaningful text to encode)

Papers without embeddings will not appear in semantic search results or "More Papers Like This" recommendations. They still appear in keyword and fulltext search.

Rankings

Rankings apply additional quality filters beyond the base relevance threshold. These filters prevent non-research items from inflating institution, author, and country metrics.

Filter What it removes
Relevance < 30 Same base filter as paper browse — off-topic materials science papers
No recorded authors Journal housekeeping records (table of contents, indexes) that OpenAlex sometimes classifies as research
> 50 co-authors Conference proceedings, multi-consortium announcements, and bulk-indexed journal volumes. Including them would inflate every listed institution's count.
Housekeeping titles Titles matching "Table of Contents", "Editorial Board", "Issue Information", "Front Matter", or "Contents List" — not research, regardless of other metadata

These filters are applied on top of the base relevance threshold. A paper can be visible in browse/search but excluded from rankings. See Rankings Methodology for full details.

Annotations

Papers are tagged with structured metadata (polymers, body systems, animal models, study type) using rule-based keyword matching against the title and abstract.

86,257

Annotated

77.4%

25,238

Not annotated

22.6%

What's excluded

  • Papers without an abstract (no text to scan for keywords)
  • Papers that use clinical terminology not in the keyword dictionary (e.g., "gonadal" instead of "reproductive")

Annotations are indicative, not exhaustive. A paper may study a polymer or body system without using Atlas's exact keywords. Filter results should be treated as a lower bound.

How it flows together

Each step depends on the one before it:

1
Paper ingested — title, DOI, year, authors
2
Abstract available? — if not, paper stops here (title-only)
3
AI classification — topic relevance, paper type, evidence tier
4
Summary generated — plain-language restatement of abstract
5
Embedding created — enables semantic search and "papers like this"
6
Keywords annotated — polymers, body systems, models, study type

A paper missing step 2 (abstract) will also be missing steps 3–6. This is the primary driver of incomplete coverage across all dimensions.

These numbers update as the corpus grows. For how each enrichment step works, see Methodology.