############
Data Formats
############

.. highlight :: yaml

ACHE can store data in different data formats. The data format can be configured by changing the key ``target_storage.data_format.type`` in the `configuration file <https://github.com/ViDA-NYU/ache/blob/master/config/sample_config/ache.yml>`_.

The data formats currently available are:

* :ref:`FILESYSTEM_HTML, FILESYSTEM_JSON, FILESYSTEM_CBOR <dataformat-filesystem>`
* :ref:`FILES <dataformat-files>`
* :ref:`WARC <dataformat-warc>`
* :ref:`ELATICSEARCH <dataformat-elasticsearch>`
* :ref:`KAFKA <dataformat-kafka>`


.. _dataformat-filesystem:

------------
FILESYSTEM_*
------------

Each page is stored in a single file, and files are organized in directories (one for each domain).
The suffix in the data format name determines how content of each file is formatted:

* ``FILESYSTEM_HTML`` - only raw content (HTML, or binary data) is stored in files. Useful for testing and opening the files HTML using the browser.
* ``FILESYSTEM_JSON`` - raw content and some metadata is stored using JSON format in files.
* ``FILESYSTEM_CBOR`` - raw content and some metadata is stored using `CBOR <http://cbor.io>`_ format in files.


When using any ``FILESYSTEM_*`` data format, you can enable compression (DEFLATE)
of the data stored in the files enabling the following line in the configuration file::

  target_storage.data_format.filesystem.compress_data: true

By default, the name of each file will be an encoded URL.
Unfortunately, this can cause problems in some cases where the URL is very long.
To fix this you can configure the file format to use a fixed size hash of the URL, instead of URL itself as a file name::

  target_storage.data_format.filesystem.hash_file_name: true


.. Warning ::

  All FILESYSTEM_* formats are not recommended for large crawls, since they can create millions files quickly and cause file system problems.


.. _dataformat-files:

-----
FILES
-----

Raw content and metadata are stored in rolling compressed files of fixed size (256MB).
Each file is a JSON lines file (each line contains one JSON object) compressed using the DEFLATE algorithm.
Each JSON object has the following fields:

* ``url`` - The requested URL
* ``redirected_url`` - The URL of final redirection if it applies
* ``content`` - A Base64 encoded string containing the page content
* ``content_type`` - The mime-type returned in the HTTP response
* ``response_headers`` - An array containing the HTTP response headers
* ``fetch_time`` - A integer containing the time when the page was fetched (epoch)

.. _dataformat-warc:

-----
WARC
-----

Raw content and metadata are stored in `WARC <https://en.wikipedia.org/wiki/Web_ARChive>`_ files.
WARC is the standard format used by The Web Archive and other public web datasets
such as "Common Crawl" and "ClueWeb".
See http://commoncrawl.org/2014/04/navigating-the-warc-file-format/ for more details on the WARC format.

Every WARC file generated by ACHE contains one `warcinfo` entry and one
`response` entry for each downloaded page.
By default, the files are compressed using GZIP format and have an approximate
size of 250MB (usually slightly larger).
The default settings can be changed using the following entries in ``ache.yml`` file:

.. code:: yaml

  target_storage.data_format.type: WARC                    # enable WARC file format
  target_storage.data_format.warc.compress: true           # enable GZIP compression
  target_storage.data_format.warc.max_file_size: 262144000 # maximum file size in bytes

Finally, ACHE also stores additional metadata as non-standard extension WARC
headers prefixed by ``ACHE-*`` (e.g., ``ACHE-IsRelevant``, ``ACHE-Relevance``).

.. _dataformat-elasticsearch:

-------------
ELASTICSEARCH
-------------

The ELASTICSEARCH data format stores raw content and metadata as documents in
an Elasticsearch index.

Types and fields
************************

Currently, ACHE indexes documents into one Elasticsearch type named ``page``
(or any name specified using the :ref:`command line <dataformat-elasticsearch-cliparams>`
during the crawl initialization).
The Elasticsearch mapping for this type is automatically created and contains
the following fields:

* ``domain`` - domain of the URL
* ``topPrivateDomain`` -  top private domain of the URL
* ``url`` - complete URL
* ``title`` - title of the page extracted from the HTML tag ``<title>``
* ``text`` - clean text extract from HTML using Boilerpipe's DefaultExtractor
* ``retrieved`` - date when the time was fetched using ISO-8601 representation Ex: "2015-04-16T07:03:50.257+0000"
* ``words`` - array of strings with tokens extracted from the text content
* ``wordsMeta`` - array of strings with tokens extracted from tags ``<meta>`` of the HTML content
* ``html`` - raw HTML content
* ``isRelevant`` - indicates whether the page was classified as relevant or
  irrelevant by target page classifier. This is a keyword field
  (not analyzed string) containing either ``relevant`` or ``irrelevant``.
* ``relevance`` - indicates the confidence of the target page classifier output.
  This is a decimal number with range from 0.0 to 1.0.


Configuration
*************

To use Elasticsearch data format, you need to add the following line to the
configuration file ``ache.yml``::

  target_storage.data_format.type: ELASTICSEARCH

You will also need to specify the host address and port where Elasticsearch is running.
See the following subsections for more details.

**REST Client (ACHE version >0.8)**

Starting in version 0.8, ACHE uses the official
`Java REST client <https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/index.html>`_
to connect to Elasticsearch. You can specify one or more Elasticsearch node
addresses which the REST client should connect to using the following lines:

.. code:: yaml

  target_storage.data_format.elasticsearch.rest.hosts:
    - http://node1:9200
    - http://node2:9200

The following additional parameters can also be configured. Refer to
the Elasticsearch `REST Client documentation <https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_timeouts.html>`_
for more information on these parameters.

.. code:: yaml

  target_storage.data_format.elasticsearch.rest.connect_timeout: 30000
  target_storage.data_format.elasticsearch.rest.socket_timeout: 30000
  target_storage.data_format.elasticsearch.rest.max_retry_timeout_millis: 90000

**HTTP Basic Authentication**

You can configure the username and password by adding the following to ``ache.yml`` file:

.. code:: yaml

    target_storage.data_format.elasticsearch.rest.username: myusername
    target_storage.data_format.elasticsearch.rest.password: mypasswd


**Transport Client (removed since version >=0.11)**

You can also configure ACHE to connect to Elasticsearch v1.x using the native
transport client by adding the following lines::

  target_storage.data_format.elasticsearch.host: localhost
  target_storage.data_format.elasticsearch.port: 9300
  target_storage.data_format.elasticsearch.cluster_name: elasticsearch


.. _dataformat-elasticsearch-cliparams:

Command line parameters
****************************************

When running ACHE using Elasticsearch, you must provide the name of the
Elasticsearch index that will be used as an argument to the CLI using the
following parameters::

  -e <arg>

or::

  --elasticIndex <arg>

You can also (optional) provide the Elasticsearch type name to be used::

  -t <arg>

or::

  --elasticType <arg>

Run ``ache help startCrawl`` for more details on available parameters.


.. _dataformat-kafka:

-------------
KAFKA
-------------

The KAFKA data format pushes crawled pages to an
`Apache Kafka <https://kafka.apache.org/>`_ topic. To configure this format,
add the following lines to the ``ache.yml`` configuration file:

.. code:: yaml

  target_storage.data_format.type: KAFKA                    # enable KAFKA file format
  target_storage.data_format.kafka.topic_name: mytopicname  # the name of the topic
  target_storage.data_format.kafka.format: JSON             # value of messages will be a JSON object
  target_storage.data_format.kafka.properties:
    # The properties to be used while initializing the Kafka Producer
    bootstrap.servers: localhost:9092
    acks: all
    retries: 0
    batch.size: 5
    linger.ms: 100
    buffer.memory: 33554432


Currently, following message formats are supported:

 * ``JSON``: A JSON object using same schema defined in the :ref:`FILES <dataformat-files>` data format.
 * ``ELASTIC``: A JSON object with the same fields described int the :ref:`ELATICSEARCH <dataformat-elasticsearch>` data format.