ArchiveBox - Overview | Alternative.to

What is ArchiveBox?

The ArchiveBox is an open-source, self-hosted web archiving solution that captures websites in multiple formats, going beyond the capabilities of public archiving services. It imports URLs from various sources, including stdin, remote URLs, or files, and then archives the pages using a range of techniques. ArchiveBox employs wget to create browsable HTML clones, youtube-dl to extract media, and a full instance of headless Chrome to generate PDF, screenshot, and DOM dumps. This multi-pronged approach ensures that even the most complex and dynamic websites are archived in several high-quality, long-term data formats The archiving process is additive, allowing users to schedule regular archiving to continuously pull new links into the index. The saved content is stored as static, JSON-indexed files, eliminating the need for a constantly running backend and ensuring the data remains accessible and parseable indefinitely

Highlights

Captures websites in multiple formats, including HTML, media, PDF, screenshots, and DOM dumps
Uses a range of tools, including wget, youtube-dl, and headless Chrome, to ensure comprehensive archiving
Additive archiving process allows for scheduled, continuous updates to the index
Stored as static, JSON-indexed files for long-term accessibility and parsability

Platforms

Docker
Linux
Windows
Self-Hosted
Mac

Languages

English

Features

- Network Tools