Wren

An experiment aimed at building a scaleable, modular web archive system based on Docker Compose and possibly, Apache Storm.

The production version of this approach is held in the Pulse project.

To make any progress, we need to be able to effectively compare any new crawler with our current system. Therefore, we start by reproducing our existing crawl system via Docker Compose, and check we fully understand it before attempting to make any modifications. We will then look at ways of modifying, replacing or removing our current components in order to make the whole system more maintainable, manageable and scalable.

Our goals are:

Fewer moving parts (less to maintain)
Based on a scalable parallel processing framework (manually scaling is hard)
Robust, guaranteed processing of requests (won't drop URLs by accident)

Freely lifting useful ideas from:

storm-crawler.
Brozzler, an IA's distributed browser-based web crawler build on Docker which works along similar lines.
Browsertrix, which is currently more of a render assistant than a crawler, but leverages Docker Compose.
Various Dockerised OpenWayback images, LOCKSS, UNB Libraries, Sawood Alam.

Folder structure

Most of the folders in this repository are distinct Dockerized services. The folders beginning with compose- contain docker-compose.yml files that assemble these individual services into larger, integrated systems.

Compositions
- UKWA Heritrix3 Test Crawl System
- Scale-out Archiving Proxy

Where the services are under active development, the service folder is a git submodule, pulling in the original repository and building it directly inside this parent project. This makes integrated development and testing much easier. However, if you clone this repository, you'll probably want to do so recursively, like this:

$ git clone --recursive git@github.com:anjackson/wren.git

This will go and pull down all the submodules at the same time as the original clone.

As individual services stabilize, it should be possible to remove these submodules and run the Docker images instead.

Problems:

Redirects not working as via looks backwards not forwards! - May need separate logfeeder.
https://www.webarchive.org.uk/act/wayback/20150320113132/http://www.bbc.co.uk/news/ then https://www.webarchive.org.uk/act/wayback/20150324131847/http://www.bbc.co.uk/news and BBC rollout articles?
http://www.theguardian.com/help/insideguardian/2015/jan/28/welcome-to-the-new-guardian-website

Queue-based Harvest Workflow

FC-1-uris-to-check FC-2-uris-to-render FC-3-uris-to-crawl FC-4-uris-to-index

FI-1-checkpoints-to-package

warcprox

warcprox_meta["warc-prefix"] from request header in JSON:

Warcprox-Meta: { "warc-prefix": "PREFIX"}

Wren Storm Topologies

We are evaluating whether Apache Storm provides a useful framework for modularizing and scaling the core crawl process itself. In particular, the way the framework provides guaranteed message processing (e.g. at-least-once semantics) should help ensure the integrity of the system.

Elastic Web Rendering

Wren includes a prototype replacement for our suite of Python-based scripts that render URLs that are part of a Heritrix crawl in order to determine the URLs of dynamically transcluded dependencies.

Robust Crawl Launching

We also need to reliably launch our regular crawls. The current system relies on a script (w3start.py) that is launched by and hourly cron job. However, if something goes wrong during the launch process, the system cannot retry. A better option is to use the cron job only to place the crawl request on a queue, and use a daemon process to watch that queue and launch the script.

One option is to create a normal server daemon process. We've tended to do this in the past, but this has led to various important services being spread over a number of machines. This makes the dependencies difficult to manage and the processing difficult to monitor.

Using Storm would allow us to centralize these daemons and integrate them into our overall monitoring approach. They would also retry robustly and be less dependent on specific hardware systems.

CDX/Remote Resource Index Servers

Various web archiving components may benefit from having the CDX index as an independent, scaleable service rather than the usual files.
If the CDX server also present an API for updating its index, as well as reading it, it can act as a core, standalone component in a modular architecture.
Potential uses include: playback, de-duplication, 'last seen' state during crawls.
The tinycdxserver Dockerfile sets up NLA's read/writable Remote Resource Index server (based on RocksDB) for experimentation. See https://gist.github.com/ato/b2ad8e65b35afe690921 for information on using it.
The read-only CDX servers (pywb,OpenWayback), could be unified and extended in this direction.
Note that warcbase and OpenWayback can be used together for very large indexes that are best stored in HBase.

Remote Browsers

End-to-End Testing

Use Splinter?

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
acid-crawl @ d8ce995		acid-crawl @ d8ce995
brozzler @ 6c9aa1d		brozzler @ 6c9aa1d
clamd @ 3cc0c86		clamd @ 3cc0c86
compose-brozzler-test		compose-brozzler-test
compose-crawler-support		compose-crawler-support
compose-dev-crawler		compose-dev-crawler
compose-pulse-crawler		compose-pulse-crawler
compose-test-crawler		compose-test-crawler
compose-wren-python		compose-wren-python
crawl-state-browser		crawl-state-browser
elastrix		elastrix
grobid @ 272a769		grobid @ 272a769
monitrix @ 83d1f52		monitrix @ 83d1f52
openwayback @ 9719b3e		openwayback @ 9719b3e
pdf2htmlex @ b18a62d		pdf2htmlex @ b18a62d
shepherd		shepherd
tinycdxserver @ 44663d1		tinycdxserver @ 44663d1
ukwa-heritrix-lbs @ 83d2811		ukwa-heritrix-lbs @ 83d2811
ukwa-heritrix		ukwa-heritrix
w3act		w3act
warc-access-server		warc-access-server
warcprox @ 34e03dc		warcprox @ 34e03dc
warcprox-squid @ 3d74a4e		warcprox-squid @ 3d74a4e
webrender-har-daemon @ 7337b42		webrender-har-daemon @ 7337b42
webrender-phantomjs @ 3c5f304		webrender-phantomjs @ 3c5f304
wren		wren
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

License

ukwa/wren

Folders and files

Latest commit

History

Repository files navigation

Wren

Folder structure

Queue-based Harvest Workflow

warcprox

Wren Storm Topologies

Elastic Web Rendering

Robust Crawl Launching

CDX/Remote Resource Index Servers

Remote Browsers

End-to-End Testing

About

Resources

License

Stars

Watchers

Forks

Languages