Skip to content

Arxiv — Search arXiv papers by keyword, author, category, or ID

Search arXiv papers by keyword, author, category, or ID.

SourceBundled (installed by default)
Pathskills/research/arxiv
Version1.0.0
AuthorHermes Agent
LicenseMIT
TagsResearch, Arxiv, Papers, Academic, Science, API
Related skillsocr-and-documents

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.

ActionCommand
Search paperscurl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"
Get specific papercurl "https://export.arxiv.org/api/query?id_list=2402.03300"
Read abstract (web)web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Read full paper (PDF)web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

The API returns Atom XML. Parse with grep/sed or pipe through python3 for clean output.

Окно терминала
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

Section titled “Clean output (parse XML to readable format)”
Окно терминала
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
published = entry.find('a:published', ns).text[:10]
authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
summary = entry.find('a:summary', ns).text.strip()[:200]
cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
print(f'{i+1}. [{arxiv_id}] {title}')
print(f' Authors: {authors}')
print(f' Published: {published} | Categories: {cats}')
print(f' Abstract: {summary}...')
print(f' PDF: https://arxiv.org/pdf/{arxiv_id}')
print()
"
PrefixSearchesExample
all:All fieldsall:transformer+attention
ti:Titleti:large+language+models
au:Authorau:vaswani
abs:Abstractabs:reinforcement+learning
cat:Categorycat:cs.AI
co:Commentco:accepted+NeurIPS
# AND (default when using +)
search_query=all:transformer+attention
# OR
search_query=all:GPT+OR+all:BERT
# AND NOT
search_query=all:language+model+ANDNOT+all:vision
# Exact phrase
search_query=ti:"chain+of+thought"
# Combined
search_query=au:hinton+AND+cat:cs.LG
ParameterOptions
sortByrelevance, lastUpdatedDate, submittedDate
sortOrderascending, descending
startResult offset (0-based)
max_resultsNumber of results (default 10, max 30000)
Окно терминала
# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"
Окно терминала
# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"
# Multiple papers
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"

After fetching metadata for a paper, generate a BibTeX entry:

{% raw %}

Окно терминала
curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f' title = {{{title}}},')
print(f' author = {{{authors}}},')
print(f' year = {{{year}}},')
print(f' eprint = {{{raw_id}}},')
print(f' archivePrefix = {{arXiv}},')
print(f' primaryClass = {{{primary}}},')
print(f' url = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"

{% endraw %}

After finding a paper, read it:

# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper (PDF → markdown via Firecrawl)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

For local PDF processing, see the ocr-and-documents skill.

CategoryField
cs.AIArtificial Intelligence
cs.CLComputation and Language (NLP)
cs.CVComputer Vision
cs.LGMachine Learning
cs.CRCryptography and Security
stat.MLMachine Learning (Statistics)
math.OCOptimization and Control
physics.comp-phComputational Physics

Full list: https://arxiv.org/category_taxonomy

The scripts/search_arxiv.py script handles XML parsing and provides clean output:

Окно терминала
python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345

No dependencies — uses only Python stdlib.


Section titled “Semantic Scholar (Citations, Related Papers, Author Profiles)”

arXiv doesn’t provide citation data or recommendations. Use the Semantic Scholar API for that — free, no key needed for basic use (1 req/sec), returns JSON.

Окно терминала
# By arXiv ID
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool
# By Semantic Scholar paper ID or DOI
curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount"
Окно терминала
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Get references FROM a paper (what it cites)

Section titled “Get references FROM a paper (what it cites)”
Окно терминала
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Search papers (alternative to arXiv search, returns JSON)

Section titled “Search papers (alternative to arXiv search, returns JSON)”
Окно терминала
curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool
Окно терминала
curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
-H "Content-Type: application/json" \
-d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool
Окно терминала
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool

title, authors, year, abstract, citationCount, referenceCount, influentialCitationCount, isOpenAccess, openAccessPdf, fieldsOfStudy, publicationVenue, externalIds (contains arXiv ID, DOI, etc.)


  1. Discover: python scripts/search_arxiv.py "your topic" --sort date --max 10
  2. Assess impact: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  3. Read abstract: web_extract(urls=["https://arxiv.org/abs/ID"])
  4. Read full paper: web_extract(urls=["https://arxiv.org/pdf/ID"])
  5. Find related work: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  6. Get recommendations: POST to Semantic Scholar recommendations endpoint
  7. Track authors: curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"
APIRateAuth
arXiv~1 req / 3 secondsNone needed
Semantic Scholar1 req / secondNone (100/sec with API key)
  • arXiv returns Atom XML — use the helper script or parsing snippet for clean output
  • Semantic Scholar returns JSON — pipe through python3 -m json.tool for readability
  • arXiv IDs: old format (hep-th/0601001) vs new (2402.03300)
  • PDF: https://arxiv.org/pdf/{id} — Abstract: https://arxiv.org/abs/{id}
  • HTML (when available): https://arxiv.org/html/{id}
  • For local PDF processing, see the ocr-and-documents skill
  • arxiv.org/abs/1706.03762 always resolves to the latest version
  • arxiv.org/abs/1706.03762v1 points to a specific immutable version
  • When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
  • The API <id> field returns the versioned URL (e.g., http://arxiv.org/abs/1706.03762v7)

Papers can be withdrawn after submission. When this happens:

  • The <summary> field contains a withdrawal notice (look for “withdrawn” or “retracted”)
  • Metadata fields may be incomplete
  • Always check the summary before treating a result as a valid paper