xidel Web Scraping & Data Extraction#

xidel is a command-line tool for downloading and extracting structured data from HTML/XML pages and JSON APIs. It supports XPath 2.0/3.0, CSS selectors, custom pattern matching, and JSONiq.

Install: apt-get install xidel (Debian/Ubuntu) or download from videlibri.de/xidel.html

Extract with XPath#

# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"

# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
  --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

# Extract all image sources
xidel https://example.org --extract "//img/@src"

# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"

Extract with CSS selectors#

# Extract text of all paragraphs
xidel https://example.org --css "p"

# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"

# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title

Pattern matching (template syntax)#

Pattern matching lets you describe the shape of the data you want with placeholders:

# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"

# Follow all <a> links and extract each page's title
xidel https://example.org \
  --follow "<a>{.}</a>*" \
  --extract "<title>{.}</title>"

# Extract a specific nested value — also validates structure is present
xidel path/to/example.xml \
  --extract "<x><foo>ood</foo><bar>{.}</bar></x>"

Follow links & crawl#

# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title

# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
  --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
  --extract //title \
  --download '{$host}/'

JSON APIs#

# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"

# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"

Structured output from RSS / Atom#

# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
  --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

The + at the end means “repeat this pattern one or more times.” Named variables (title:=, uri:=) pair related fields.

# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
  --follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
  --extract "css('#mail')/@title"

Output formats#

# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json

# Output as XML
xidel https://example.org --extract "//a" --output-format xml

# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc

Query language comparison#

Query	XPath	CSS	Pattern
All links	`//a`	`css('a')`	`<a>{.}</a>*`
Link href	`//a/@href`	`css('a')` + `@href`	`<a href="{.}">`
Page title	`//title`	`css('title')`	`<title>{.}</title>`
First h1	`//h1[1]`	`css('h1:first-of-type')`	—

Combine with shell pipelines#

# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
  > iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads

g h	home
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g p	Python section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up