ACHE allows one to customize which domains and paths within a domain
should be crawled. This can be done by configuring link filters using
regular expressions (regex) or wildcard patterns.
Regex filters are evaluated using Java’s regular expression rules,
and wildcard filters accept only the special character
*, which matches any character.
Link Filters are composed of two lists of patterns:
- whitelists - patterns for URLs that are allowed to be followed, i.e., any URL that doesn’t match the patterns is discarded.
- blacklists - patterns for URLs that are NOT allowed to be followed, i.e., any URL that matches the patterns is discarded.
Links filters can have global or per-domain scope. Global filters are
evaluated against all links, whereas per-domain filters are evaluated only
against URLs that belong to the specified domain (only top-private domain level).
There are two ways to configure link filters:
- .yml file: Allows to configure global and per-domain link filters using YAML.
- .txt files: Allows to configure only regex-based global link filters.
Configuring using YAML
ACHE automatically searches for a file named
in the same directory of the
ache.yml file. This file can contain a single
global entry and one entry per domain. Each entry should
specify a type (regex or wildcard) and a list of “whitelist” and
“blacklist” patterns, as shown in the example bellow:
Configuring using .txt files
This is the old way to configure link filters. Only regex-based “global” filters
can be configured, i.e., filters that are applied to all URLs.
To configure a link filter, you will need to create text files containing one
regular expression per line.
All regular expressions loaded are evaluated against all links found on the
web pages crawled in other to determine whether the crawler should accept or
For whitelist filters, ACHE will automatically search for a file named
link_whitelist.txt, whereas for blacklist filters the file name is
link_blacklist.txt. These files should be placed under the same directory
ache.yml. These files are loaded once during crawler start-up.
The link filter files should look like this: