xidel Web Scraping & Data Extraction#
xidel is a command-line tool for downloading and extracting structured data from HTML/XML pages and JSON APIs. It supports XPath 2.0/3.0, CSS selectors, custom pattern matching, and JSONiq.
Install:
apt-get install xidel(Debian/Ubuntu) or download from videlibri.de/xidel.html
Extract with XPath#
# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"
# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
--extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"
# Extract all image sources
xidel https://example.org --extract "//img/@src"
# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"
Extract with CSS selectors#
# Extract text of all paragraphs
xidel https://example.org --css "p"
# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"
# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title
Pattern matching (template syntax)#
Pattern matching lets you describe the shape of the data you want with placeholders:
# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"
# Follow all <a> links and extract each page's title
xidel https://example.org \
--follow "<a>{.}</a>*" \
--extract "<title>{.}</title>"
# Extract a specific nested value β also validates structure is present
xidel path/to/example.xml \
--extract "<x><foo>ood</foo><bar>{.}</bar></x>"
Follow links & crawl#
# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title
# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
--follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
--extract //title \
--download '{$host}/'
JSON APIs#
# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"
# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"
Structured output from RSS / Atom#
# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
--extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"
The + at the end means βrepeat this pattern one or more times.β Named variables (title:=, uri:=) pair related fields.
Form automation & login#
# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
--follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
--extract "css('#mail')/@title"
Output formats#
# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json
# Output as XML
xidel https://example.org --extract "//a" --output-format xml
# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc
Query language comparison#
| Query | XPath | CSS | Pattern |
|---|---|---|---|
| All links | //a | css('a') | <a>{.}</a>* |
| Link href | //a/@href | css('a') + @href | <a href="{.}"> |
| Page title | //title | css('title') | <title>{.}</title> |
| First h1 | //h1[1] | css('h1:first-of-type') | β |
Combine with shell pipelines#
# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
> iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads