skip to content

xidel Web Scraping & Data Extraction

Extract data from HTML, XML, and JSON using XPath, CSS selectors, pattern matching, and JSONiq from the command line.

3 min read 9 snippets 3d ago

xidel Web Scraping & Data Extraction#

xidel is a command-line tool for downloading and extracting structured data from HTML/XML pages and JSON APIs. It supports XPath 2.0/3.0, CSS selectors, custom pattern matching, and JSONiq.

Install: apt-get install xidel (Debian/Ubuntu) or download from videlibri.de/xidel.html

Extract with XPath#

# Extract all link href attributes from a page
xidel https://example.org --extract "//a/@href"

# Extract all page titles from links found via Google
xidel "https://www.google.com/search?q=linux+tips" \
  --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']"

# Extract all image sources
xidel https://example.org --extract "//img/@src"

# Extract text content of all headings
xidel https://example.org --extract "//h1|//h2|//h3"

Extract with CSS selectors#

# Extract text of all paragraphs
xidel https://example.org --css "p"

# Extract href from all nav links
xidel https://example.org --css "nav a" --extract "@href"

# Combine: follow CSS-selected links and extract their titles
xidel https://example.org --follow "css('a')" --css title

Pattern matching (template syntax)#

Pattern matching lets you describe the shape of the data you want with placeholders:

# Extract whatever is between <title> and </title>
xidel https://example.org --extract "<title>{.}</title>"

# Follow all <a> links and extract each page's title
xidel https://example.org \
  --follow "<a>{.}</a>*" \
  --extract "<title>{.}</title>"

# Extract a specific nested value β€” also validates structure is present
xidel path/to/example.xml \
  --extract "<x><foo>ood</foo><bar>{.}</bar></x>"
# Follow all <a> tags on a page and print each linked page's title
xidel https://example.org --follow //a --extract //title

# Follow Google result links, print titles, download pages into host-named dirs
xidel "https://www.google.com/search?q=test" \
  --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" \
  --extract //title \
  --download '{$host}/'

JSON APIs#

# Extract a field from a JSON API response
xidel https://api.github.com/repos/octocat/Hello-World --extract "//name"

# Use JSONiq-style extraction
xidel https://api.example.com/data.json --extract "//items/title"

Structured output from RSS / Atom#

# Extract title + URL from every Stack Overflow question in the RSS feed
xidel http://stackoverflow.com/feeds \
  --extract "<entry><title>{title:=.}</title><link>{uri:=@href}</link></entry>+"

The + at the end means β€œrepeat this pattern one or more times.” Named variables (title:=, uri:=) pair related fields.

Form automation & login#

# Log in to Reddit and check unread mail count
# Combines CSS selectors, XPath, JSONiq, and form evaluation
xidel https://reddit.com \
  --follow "form(css('form.login-form')[1], {'user': 'myuser', 'passwd': 'mypassword'})" \
  --extract "css('#mail')/@title"

Output formats#

# Output as JSON array
xidel https://example.org --extract "//a/@href" --output-format json

# Output as XML
xidel https://example.org --extract "//a" --output-format xml

# Wrap each result on its own line (default)
xidel https://example.org --extract "//a/@href" --output-format adhoc

Query language comparison#

QueryXPathCSSPattern
All links//acss('a')<a>{.}</a>*
Link href//a/@hrefcss('a') + @href<a href="{.}">
Page title//titlecss('title')<title>{.}</title>
First h1//h1[1]css('h1:first-of-type')β€”

Combine with shell pipelines#

# Save all scraped URLs to a file for aria2c batch download
xidel https://example.org/downloads --extract "//a[contains(@href,'.iso')]/@href" \
  > iso-urls.txt
aria2c --input-file=iso-urls.txt -c -d ~/Downloads