What is ArchiveBox?
The ArchiveBox is an open-source, self-hosted web archiving solution that captures websites in multiple formats, going beyond the capabilities of public archiving services. It imports URLs from various sources, including stdin, remote URLs, or files, and then archives the pages using a range of techniques. ArchiveBox employs wget to create browsable HTML clones, youtube-dl to extract media, and a full instance of headless Chrome to generate PDF, screenshot, and DOM dumps. This multi-pronged approach ensures that even the most complex and dynamic websites are archived in several high-quality, long-term data formats The archiving process is additive, allowing users to schedule regular archiving to continuously pull new links into the index. The saved content is stored as static, JSON-indexed files, eliminating the need for a constantly running backend and ensuring the data remains accessible and parseable indefinitely
Highlights
- Captures websites in multiple formats, including HTML, media, PDF, screenshots, and DOM dumps
- Uses a range of tools, including wget, youtube-dl, and headless Chrome, to ensure comprehensive archiving
- Additive archiving process allows for scheduled, continuous updates to the index
- Stored as static, JSON-indexed files for long-term accessibility and parsability
Platforms
- Docker
- Linux
- Windows
- Self-Hosted
- Mac
Languages
- English
Features
Network Tools