Scrapling
Scrapling
Section titled “Scrapling”Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/research/scrapling |
| Path | optional-skills/research/scrapling |
| Version | 1.0.0 |
| Author | FEUAZUR |
| License | MIT |
| Tags | Web Scraping, Browser, Cloudflare, Stealth, Crawling, Spider |
| Related skills | duckduckgo-search, domain-intel |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
Scrapling
Section titled “Scrapling”Scrapling is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.
This skill is for educational and research purposes only. Users must comply with local/international data scraping laws and respect website Terms of Service.
When to Use
Section titled “When to Use”- Scraping static HTML pages (faster than browser tools)
- Scraping JS-rendered pages that need a real browser
- Bypassing Cloudflare Turnstile or bot detection
- Crawling multiple pages with a spider
- When the built-in
web_extracttool does not return the data you need
Installation
Section titled “Installation”pip install "scrapling[all]"scrapling installMinimal install (HTTP only, no browser):
pip install scraplingWith browser automation only:
pip install "scrapling[fetchers]"scrapling installQuick Reference
Section titled “Quick Reference”| Approach | Class | Use When |
|---|---|---|
| HTTP | Fetcher / FetcherSession | Static pages, APIs, fast bulk requests |
| Dynamic | DynamicFetcher / DynamicSession | JS-rendered content, SPAs |
| Stealth | StealthyFetcher / StealthySession | Cloudflare, anti-bot protected sites |
| Spider | Spider | Multi-page crawling with link following |
CLI Usage
Section titled “CLI Usage”Extract Static Page
Section titled “Extract Static Page”scrapling extract get 'https://example.com' output.mdWith CSS selector and browser impersonation:
scrapling extract get 'https://example.com' output.md \ --css-selector '.content' \ --impersonate 'chrome'Extract JS-Rendered Page
Section titled “Extract JS-Rendered Page”scrapling extract fetch 'https://example.com' output.md \ --css-selector '.dynamic-content' \ --disable-resources \ --network-idleExtract Cloudflare-Protected Page
Section titled “Extract Cloudflare-Protected Page”scrapling extract stealthy-fetch 'https://protected-site.com' output.html \ --solve-cloudflare \ --block-webrtc \ --hide-canvasPOST Request
Section titled “POST Request”scrapling extract post 'https://example.com/api' output.json \ --json '{"query": "search term"}'Output Formats
Section titled “Output Formats”The output format is determined by the file extension:
.html— raw HTML.md— converted to Markdown.txt— plain text.json/.jsonl— JSON
Python: HTTP Scraping
Section titled “Python: HTTP Scraping”Single Request
Section titled “Single Request”from scrapling.fetchers import Fetcher
page = Fetcher.get('https://quotes.toscrape.com/')quotes = page.css('.quote .text::text').getall()for q in quotes: print(q)Session (Persistent Cookies)
Section titled “Session (Persistent Cookies)”from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate='chrome') as session: page = session.get('https://example.com/', stealthy_headers=True) links = page.css('a::attr(href)').getall() for link in links[:5]: sub = session.get(link) print(sub.css('h1::text').get())POST / PUT / DELETE
Section titled “POST / PUT / DELETE”page = Fetcher.post('https://api.example.com/data', json={"key": "value"})page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})page = Fetcher.delete('https://api.example.com/item/1')With Proxy
Section titled “With Proxy”page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')Python: Dynamic Pages (JS-Rendered)
Section titled “Python: Dynamic Pages (JS-Rendered)”For pages that require JavaScript execution (SPAs, lazy-loaded content):
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch('https://example.com', headless=True)data = page.css('.js-loaded-content::text').getall()Wait for Specific Element
Section titled “Wait for Specific Element”page = DynamicFetcher.fetch( 'https://example.com', wait_selector=('.results', 'visible'), network_idle=True,)Disable Resources for Speed
Section titled “Disable Resources for Speed”Blocks fonts, images, media, stylesheets (~25% faster):
from scrapling.fetchers import DynamicSession
with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session: page = session.fetch('https://example.com') items = page.css('.item::text').getall()Custom Page Automation
Section titled “Custom Page Automation”from playwright.sync_api import Pagefrom scrapling.fetchers import DynamicFetcher
def scroll_and_click(page: Page): page.mouse.wheel(0, 3000) page.wait_for_timeout(1000) page.click('button.load-more') page.wait_for_selector('.extra-results')
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)results = page.css('.extra-results .item::text').getall()Python: Stealth Mode (Anti-Bot Bypass)
Section titled “Python: Stealth Mode (Anti-Bot Bypass)”For Cloudflare-protected or heavily fingerprinted sites:
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch( 'https://protected-site.com', headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True,)content = page.css('.protected-content::text').getall()Stealth Session
Section titled “Stealth Session”from scrapling.fetchers import StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: page1 = session.fetch('https://protected-site.com/page1') page2 = session.fetch('https://protected-site.com/page2')Element Selection
Section titled “Element Selection”All fetchers return a Selector object with these methods:
CSS Selectors
Section titled “CSS Selectors”page.css('h1::text').get() # First h1 textpage.css('a::attr(href)').getall() # All link hrefspage.css('.quote .text::text').getall() # Nested selectionpage.xpath('//div[@class="content"]/text()').getall()page.xpath('//a/@href').getall()Find Methods
Section titled “Find Methods”page.find_all('div', class_='quote') # By tag + attributepage.find_by_text('Read more', tag='a') # By text contentpage.find_by_regex(r'\$\d+\.\d{2}') # By regex patternSimilar Elements
Section titled “Similar Elements”Find elements with similar structure (useful for product listings, etc.):
first_product = page.css('.product')[0]all_similar = first_product.find_similar()Navigation
Section titled “Navigation”el = page.css('.target')[0]el.parent # Parent elementel.children # Child elementsel.next_sibling # Next siblingel.prev_sibling # Previous siblingPython: Spider Framework
Section titled “Python: Spider Framework”For multi-page crawling with link following:
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider): name = "quotes" start_urls = ["https://quotes.toscrape.com/"] concurrent_requests = 10 download_delay = 1
async def parse(self, response: Response): for quote in response.css('.quote'): yield { "text": quote.css('.text::text').get(), "author": quote.css('.author::text').get(), "tags": quote.css('.tag::text').getall(), }
next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page)
result = QuotesSpider().start()print(f"Scraped {len(result.items)} quotes")result.items.to_json("quotes.json")Multi-Session Spider
Section titled “Multi-Session Spider”Route requests to different fetcher types:
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class SmartSpider(Spider): name = "smart" start_urls = ["https://example.com/"]
def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "protected" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast", callback=self.parse)Pause/Resume Crawling
Section titled “Pause/Resume Crawling”spider = QuotesSpider(crawldir="./crawl_checkpoint")spider.start() # Ctrl+C to pause, re-run to resume from checkpointPitfalls
Section titled “Pitfalls”- Browser install required: run
scrapling installafter pip install — without it,DynamicFetcherandStealthyFetcherwill fail - Timeouts: DynamicFetcher/StealthyFetcher timeout is in milliseconds (default 30000), Fetcher timeout is in seconds
- Cloudflare bypass:
solve_cloudflare=Trueadds 5-15 seconds to fetch time — only enable when needed - Resource usage: StealthyFetcher runs a real browser — limit concurrent usage
- Legal: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
- Python version: requires Python 3.10+