sitescrape

A CLI tool that scrapes authenticated internal websites and converts pages to markdown files. Uses your existing browser cookies for authentication. Ships as a single Rust binary.

Requirements

Rust toolchain (to build from source)
macOS, Linux, or Windows
Chrome or Chromium installed (for JS rendering mode)

Installation

Pre-built binaries

Download the latest release from GitHub Releases. Releases are built and published automatically via GitHub Actions when a version tag is pushed.

Platform	Archive
macOS Intel (x86_64)	`sitescrape-x86_64-apple-darwin.tar.gz`
macOS Apple Silicon (aarch64)	`sitescrape-aarch64-apple-darwin.tar.gz`
Linux (x86_64)	`sitescrape-x86_64-unknown-linux-gnu.tar.gz`
Windows (x86_64)	`sitescrape-x86_64-pc-windows-msvc.zip`

macOS:

tar xzf sitescrape-<target>.tar.gz
chmod +x sitescrape
sudo mv sitescrape /usr/local/bin/

macOS Gatekeeper notice: The pre-built binaries are not signed with an Apple Developer certificate. macOS will block the binary the first time you run it. To allow it:

Run sitescrape in your terminal — macOS will show a dialog saying the app "cannot be opened because the developer cannot be verified." Click Cancel.

Open System Settings → Privacy & Security.

Scroll down to the Security section. You'll see a message about sitescrape being blocked. Click Allow Anyway.

Run sitescrape again. Click Open in the confirmation dialog.

Alternatively, remove the quarantine attribute before first run:
xattr -d com.apple.quarantine /usr/local/bin/sitescrape
You only need to do this once per binary.

Linux:

tar xzf sitescrape-x86_64-unknown-linux-gnu.tar.gz
chmod +x sitescrape
mv sitescrape ~/.local/bin/   # or /usr/local/bin/ with sudo

Windows: Extract the zip and move sitescrape.exe to a directory in your PATH, or add its directory to PATH.

Prerequisites: Chrome or Chromium must be installed for browser mode (JS-rendered sites). Use --no-browser for HTTP-only mode.

Platform notes: Chrome cookie extraction is supported on macOS and Windows. Firefox cookies are supported on all platforms.

Build from source

cd sitescrape
cargo build --release
# Binary at target/release/sitescrape

Usage

Usage: sitescrape [OPTIONS] [URL]

Arguments:
  [URL]  URL to scrape

Options:
  -b, --browser <BROWSER>      Cookie source browser (auto, firefox, chrome)
  -o, --output <OUTPUT>        Output directory
  -m, --max-pages <MAX_PAGES>  Max pages to crawl
  -s, --selector <SELECTOR>    CSS selector for main content
  -p, --prefix <PREFIX>        URL path prefix filter
  -d, --delay <DELAY>          Delay between pages in ms
      --no-browser             HTTP-only mode (no JS rendering)
      --headless               Disable TUI, plain text output
      --config <CONFIG>        Config file path
      --profile <PROFILE>      Named profile from config file
  -h, --help                   Print help
  -V, --version                Print version

Examples

# Basic usage — scrape with auto-detected browser cookies
sitescrape https://docs.internal.example.com/guide/

# Use Firefox cookies, limit to 100 pages
sitescrape https://docs.internal.example.com/ -b firefox -m 100

# Static site, no JS rendering needed
sitescrape https://docs.internal.example.com/ --no-browser

# Headless mode (no TUI, plain text progress)
sitescrape https://docs.internal.example.com/ --headless

# Custom content selector and output directory
sitescrape https://docs.internal.example.com/ -s ".main-content" -o ./docs

# Use a named profile from config file
sitescrape --profile mysite

Interactive Mode

Launch without a URL to enter interactive mode:

sitescrape

This opens a persistent TUI where you can run multiple scrapes without restarting.

Commands

Command	Description
`/scrape <url> [-m N] [-s sel] [-p pfx] [-d ms] [--no-browser]`	Start a scrape
`/stop` or `Esc`	Stop the current scrape
`/status`	Show current scrape progress
`/config`	Show current configuration
`/set <key> <value>`	Change a setting (output, delay, max_pages, selector)
`/cookies`	Reload cookies from browser
`/clear`	Clear the log
`/help`	Show available commands
`/quit` or `/q`	Exit

# In the interactive TUI:
> /scrape https://docs.internal.example.com/guide/
> /set output ./docs
> /scrape https://other-site.example.com/ -m 50 -s ".content"
> /cookies
> /quit

TUI Interface

The TUI uses a modern styled interface with color-coded log entries, Unicode status icons, a braille spinner during active scrapes, and a progress bar. Log entries are color-coded by outcome: green for saved pages, red for errors, yellow for auth failures, cyan for info messages.

Icon	Meaning
`✓`	Page saved
`✗`	Error
`⚠`	Auth failure
`◌`	Visiting (in progress)
`ℹ`	Info

TUI Keybindings

Key	Context	Action
`Enter`	Interactive	Run command
`Esc`	Interactive (idle)	Clear input
`Esc`	Interactive (scraping)	Stop scrape
`p`	One-shot TUI	Pause/resume
`↑` / `↓`	Both	Scroll log
`Ctrl+C`	Both	Quit

Config File

Place sitescrape.toml in the current directory, or specify a path with --config:

[default]
browser = "auto"
output = "./output"
max_pages = 500
delay = 500
selector = "main"

[profiles.mysite]
url = "https://docs.internal.example.com/guide/"
max_pages = 200
selector = ".content"

Precedence: CLI args > profile values > default values.

Browser Support

Firefox: Reads cookies from the default profile. macOS: ~/Library/Application Support/Firefox/Profiles/, Windows: %APPDATA%\Mozilla\Firefox\Profiles.
Chrome: Reads and decrypts cookies. macOS: Keychain (Chrome Safe Storage key) + AES-128-CBC. Windows: DPAPI + AES-256-GCM.
Auto-detection: Detects the default browser. macOS: LaunchServices. Windows: registry (UrlAssociations). Falls back to Firefox if detection fails.

Modes

Interactive mode (default, no URL): Persistent TUI with command input — run multiple scrapes without restarting.
TUI mode (with URL): One-shot scrape with progress display. Auto-activates when stdout is a TTY.
Headless mode (--headless or piped output): Plain text progress with an indicatif progress bar. Use for scripting or CI.
Browser mode (default): Uses headless Chrome via chromiumoxide for JS rendering. Requires Chrome/Chromium installed.
No-browser mode (--no-browser): HTTP-only, no Chrome — faster for static sites that don't require JS rendering.

Output Format

Each page is saved as a markdown file with YAML frontmatter:

---
title: "Page Title"
url: "https://docs.internal.example.com/guide/getting-started"
---

# Getting Started
...

URL path segments are converted to Title-Case filenames (e.g., /guide/getting-started → Guide/Getting-Started.md).

Known Limitations

Chrome/Chromium must be installed for JS rendering mode
Sequential crawling only — one page at a time with a configurable delay
Browser must have written cookies to disk before running (cookie DB files are copied to a temp location for reading)

Reporting Issues

If you run into a problem, please open an issue and include:

What you were trying to do and what happened instead
The command you ran (redact any internal URLs)
Your OS and architecture (e.g., macOS ARM64, Ubuntu 22.04 x86_64)
The sitescrape version (sitescrape --version)
Any error output from the terminal

Before opening a new issue, check existing issues to see if it's already been reported.

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, submitting pull requests, and the code of conduct.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
nextsteps.md		nextsteps.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sitescrape

Requirements

Installation

Pre-built binaries

Build from source

Usage

Examples

Interactive Mode

Commands

TUI Interface

TUI Keybindings

Config File

Browser Support

Modes

Output Format

Known Limitations

Reporting Issues

Contributing

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sitescrape

Requirements

Installation

Pre-built binaries

Build from source

Usage

Examples

Interactive Mode

Commands

TUI Interface

TUI Keybindings

Config File

Browser Support

Modes

Output Format

Known Limitations

Reporting Issues

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages