REST API¶
When an ACHE crawl is started, it automatically starts a REST API on port 8080.
If that port is busy, it will try the following ports (8081, 8082, etc).
The default HTTP settings can be changed using the following lines in the
ache.yml
file:
http.port: 8080
http.host: 127.0.0.1
http.cors.enabled: true
Server Mode¶
Besides using the ache startCrawl
command, ACHE can also be started in server
mode and controlled using the web user interface or the REST API.
To start ACHE in server mode, you can use:
ache startServer -d /data -c /config/
Alternatively, if you are using Docker, run:
docker run -v $CONFIG:/config -v $DATA/data:/data vidanyu/ache startServer -d /data -c /config/
where:
$CONFIG
is the path to whereache.yml
is stored and$DATA
is the path where ACHE is going to store its data.
If you want to configure a proxy to serve ACHE user interface from a non-root
path, you will need to specify the path in ache.yml
file using the following
configuration:
http.base_path: /my-new-path
API Endpoints¶
-
POST
/startCrawl
¶ Starts a crawl.
Request JSON Object: - crawlType (string) – Type of crawl to be started. Either
DeepCrawl
orFocusedCrawl
. - model (string) – (Required for FocusedCrawl) A base64 encoded string of the zipped model file.
The zip file should contain the model files (
pageclassifier.yml
) and the seed file (*_seeds.txt
). - seeds (array) – (Required for DeepCrawl) An array of strings. Each string must be a fully-qualified URL that will be used starting point of the crawl.
Request body example for DeepCrawl:
{ "crawlType": "DeepCrawl", "seeds": ["http://en.wikipedia.org/", "http://example.com/"], "model": null }
Request body example for FocusedCrawl:
{ "crawlType": "FocusedCrawl", "seeds": null, "model": "<Base64 encoded zipped model file>" }
Response body example:
{ "message": "Crawler started successfully.", "crawlerStarted": true }
- crawlType (string) – Type of crawl to be started. Either
-
GET
/status
¶ Returns the status of the currently running crawl.
Response body example:
{ "status": 200, "version": "0.9.0-SNAPSHOT", "searchEnabled": false, "crawlerRunning": true, "crawlerState": "RUNNING" }
-
GET
/metrics
¶ Returns detailed runtime metrics of the current crawler execution. Metrics are generated using the Dropwizard Metrics library.
Response body example:
{ "version": "3.1.3", "gauges": { "downloader.dispatch_queue.size": { "value": 0 }, "downloader.download_queue.size": { "value": 0 }, "downloader.pending_downloads": { "value": 2 }, "downloader.running_handlers": { "value": 1 }, "downloader.running_requests": { "value": 1 }, "frontier_manager.last_load.available": { "value": 0 }, "frontier_manager.last_load.rejected": { "value": 11610 }, "frontier_manager.last_load.uncrawled": { "value": 11610 }, "frontier_manager.scheduler.empty_domains": { "value": 0 }, "frontier_manager.scheduler.non_expired_domains": { "value": 1 }, "frontier_manager.scheduler.number_of_links": { "value": 2422 }, "target.storage.harvest.rate": { "value": 0.9777777777777777 } }, "counters": { "downloader.fetches.aborted": { "count": 0 }, "downloader.fetches.errors": { "count": 1 }, "downloader.fetches.successes": { "count": 48 }, "downloader.http_response.status.2xx": { "count": 47 }, "downloader.http_response.status.401": { "count": 0 }, "downloader.http_response.status.403": { "count": 0 }, "downloader.http_response.status.404": { "count": 1 }, "downloader.http_response.status.5xx": { "count": 0 }, "target.storage.pages.downloaded": { "count": 45 }, "target.storage.pages.relevant": { "count": 44 } }, "histograms": {}, "meters": {}, "timers": { "downloader.fetch.time": { "count": 48, "max": 584.693196, "mean": 160.64529857175228, "min": 51.161457, "p50": 114.816344, "p75": 218.304927, "p95": 377.469511, "p98": 584.693196, "p99": 584.693196, "p999": 584.693196, "stddev": 118.74270199105285, "m15_rate": 0.4281665582051108, "m1_rate": 0.7030438799915493, "m5_rate": 0.4803778789487069, "mean_rate": 0.9178383293058442, "duration_units": "milliseconds", "rate_units": "calls/second" }, [... Other metrics...] } }
-
GET
/stopCrawl
¶ Stops the crawler execution if there is a crawler running.
Query Parameters: - awaitStopped (boolean) – One of
true
orfalse
(default). Indicates whether the request should block until the crawler is completely stopped.
Response body example:
{ "message": "Crawler shutdown initiated.", "shutdownInitiated": true, "crawlerStopped": false }
- awaitStopped (boolean) – One of
-
POST
/seeds
¶ Adds more seeds to the crawl if there is a crawler running.
Request JSON Object: - seeds (array) – An array containing the URLs to be added to the crawl that is currently running.
Request body example:
{ "seeds": ["http://en.wikipedia.org/", "http://example.com/"] }
Response body example:
{ "message": "Seeds added successfully.", "addedSeeds": true }