Data Formats¶
ACHE can store data in different data formats. The data format can be configured by changing the key target_storage.data_format.type
in the configuration file.
The data formats currently available are:
FILESYSTEM_*¶
Each page is stored in a single file, and files are organized in directories (one for each domain). The suffix in the data format name determines how content of each file is formatted:
FILESYSTEM_HTML
- only raw content (HTML, or binary data) is stored in files. Useful for testing and opening the files HTML using the browser.FILESYSTEM_JSON
- raw content and some metadata is stored using JSON format in files.FILESYSTEM_CBOR
- raw content and some metadata is stored using CBOR format in files.
When using any FILESYSTEM_*
data format, you can enable compression (DEFLATE)
of the data stored in the files enabling the following line in the config file:
target_storage.data_format.filesystem.compress_data: true
By default, the name of each file will be an encoded URL. Unfortunately, this can cause problems in some cases where the URL is very long. To fix this you can configure the file format to use a fixed size hash of the URL, instead of URL itself as a file name:
target_storage.data_format.filesystem.hash_file_name: true
Warning
All FILESYSTEM_* formats are not recommended for large crawls, since they can create millions files quickly and cause file system problems.
FILES¶
Raw content and metadata is stored in rolling compressed files of fixed size (256MB). Each file is a JSON lines file (each line contains one JSON object) compressed using the DEFLATE algorithm. Each JSON object has the following fields:
url
- The requested URLredirected_url
- The URL of final redirection if it appliescontent
- A Base64 encoded string containing the page contentcontent_type
- The mime-type returned in the HTTP responseresponse_headers
- An array containing the HTTP response headersfetch_time
- A integer containing the time when the page was fetched (epoch)
ELASTICSEARCH¶
The ELASTICSEARCH data format stores raw content and metadata as documents in an Elasticsearch index.
Types and fields¶
Currently, ACHE indexes documents into two Elasticsearch types:
target
, for pages classified as on-topic by the page classifiernegative
, for pages classified as off-topic by the page classifier
These two types use the same schema, which has the following fields:
domain
- domain of the urltopPrivateDomain
- top private domain of the urlurl
- complete URLtitle
- title of the page extracted from the html tag<title>
text
- clean text extract from html using Boilerpipe’s DefaultExtractorretrieved
- date when the time was fetched using ISO-8601 representation Ex: “2015-04-16T07:03:50.257+0000”words
- array of strings with tokens extracted from the text contentwordsMeta
- array of strings with tokens extracted from tags<meta>
of the html contenthtml
- raw html content
Configuration¶
To use Elasticsearch data format, you need to add the following line to the
configuration file ache.yml
:
target_storage.data_format.type: ELASTICSEARCH
You will also need to specify the host address and port where Elasticsearch is running. See the following subsections for more details.
REST Client (ACHE version >0.8)
Starting in version 0.8, ACHE uses the official Java REST client to connect to Elasticsearch. You can specify one or more Elasticsearch node addresses which the REST client should connect to using the following lines:
target_storage.data_format.elasticsearch.rest.hosts:
- http://node1:9200
- http://node2:9200
The following additional parameters can also be configured. Refer to the Elasticsearch REST Client documentation for more information on these parameters.
target_storage.data_format.elasticsearch.rest.connect_timeout: 30000
target_storage.data_format.elasticsearch.rest.socket_timeout: 30000
target_storage.data_format.elasticsearch.rest.max_retry_timeout_millis: 90000
Transport Client (deprecated)
You can also configure ACHE to connect to Elasticsearch v1.x using the native transport client by adding the following lines:
target_storage.data_format.elasticsearch.host: localhost
target_storage.data_format.elasticsearch.port: 9300
target_storage.data_format.elasticsearch.cluster_name: elasticsearch
Command line parameters¶
When running ACHE using Elasticsearch, you should provide the name of the Elasticsearch index that should be used in the command line using the following arguments:
-e <arg>
or:
--elasticIndex <arg>