Crawling Dark Web Sites on the TOR network¶

TOR is a well known software that enables anonymous communications, and is becoming more popular due to the increasingly media on dark web sites. “Dark Web” sites are usually not crawled by generic crawlers because the web servers are hidden in the TOR network and require use of specific protocols for being accessed. Sites hidden on the TOR network are accessed via domain addresses under the top-level domain .onion. In order to crawl such sites, ACHE relies on external HTTP proxies, such as Privoxy, configured to route traffic trough the TOR network. Besides configuring the proxy, we just need to configure ACHE to route requests to .onion addresses via the TOR proxy.

Fully configuring a web proxy to route traffic through TOR is out-of-scope of this tutorial, so we will just use Docker to run the pre-configured docker image for Privoxy/TOR available at https://hub.docker.com/r/dperson/torproxy/. For convenience, we will also run ACHE and Elasticsearch using docker containers.

To start and stop the containers, we will use docker-compose, so make sure that the Docker version that you installed includes it. You can verify whether it is installed by running the following command on the Terminal (it should print the version of docker-compose to the output):

docker-compose -v

The following steps explain in details how to crawl .onion sites using ACHE.

1. Create the configuration files

All the configuration files needed are available in ACHE’s repository at config/config_docker_tor (if you already cloned the git repository, you won’t need to download them). Download the following files and put them in single directory named config_docker_tor:

tor.seeds: a plain text containing the URLs of the sites you want to crawl. In this example, the file contains a few URLs taken from https://thehiddenwiki.org/. If you want to crawl specific websites, you should list them on this file (one URL per line).

ache.yml: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, to index crawled pages in the Elasticsearch container, and to download .onion addresses using the TOR proxy container.

docker-compose.yml: a configuration file for Docker, which specifies which containers should be used. It starts an Elasticsearch node, the TOR proxy, and ACHE crawler.

If you are using Mac or Linux, you can run the following commands on the Terminal to create a folder and download the files automatically:
mkdir config_docker_tor/
cd config_docker_tor/
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml
curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/tor.seeds

2. Start the Docker containers

Enter the directory config_docker_tor you just created and start the containers with docker-compose:
docker-compose up -d
This command will automatically download all docker images and start all necessary containers in background mode. The downloads may take a while to finish depending on your Internet connection speed.

3. Monitor the crawl progress

Once all docker images have been downloaded and the all services have been started, you will be able to open ACHE’s web interface at http://localhost:8080 to see some crawl metrics. If you want to visualize the crawler logs, you can run:
docker-compose logs -f

4. Stop the Docker containers

You can stop the containers by hitting CTRL+C on Linux (or equivalent in your OS). You can also remove the containers by running the following command:
docker-compose down

Understanding the docker-compose.yml file

Basically, in docker-compose.yml we configure a container for the TOR proxy named torproxy that listens on the port 8118:

torproxy:
  image: dperson/torproxy
  ports:
    - "8118:8118"

An Elasticsearch node named elasticsearch that listens on the port 9200 (we also add some common Elasticsearch settings):

elasticsearch:
  image: docker.elastic.co/elasticsearch/elasticsearch:6.8.22
  environment:
    - discovery.type=single-node
    - cluster.name=docker-cluster
    - bootstrap.memory_lock=true
  ulimits:
    memlock:
      soft: -1
      hard: -1
  volumes:
    - ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/
  ports:
    - 9200:9200

And finally, we configure a container named ache. Note that in order to make the config (ache.yml) and the seeds (tor.seeds) files available inside the container, we need to mount the volume /config to point to the current working directory. We also mount the volume /data in the directory ./data-ache so that the crawled data is stored outside the container. In order to make ACHE communicate to the other containers, we need to link the ACHE’s container to the other two containers elasticsearch and torproxy.

ache:
  image: vidanyu/ache
  entrypoint: sh -c 'sleep 10 && /ache/bin/ache startCrawl -c /config/ -s /config/tor.seeds -o /data -e tor'
  ports:
    - "8080:8080"
  volumes:
    # mounts /config and /data directories to paths relative to path where this file is located
    - ./data-ache/:/data
    - ./:/config
  links:
    - torproxy
    - elasticsearch
  depends_on:
    - torproxy
    - elasticsearch

Understanding the ache.yml file

The ache.yml file basically configures ACHE to index crawled data in the elasticsearch container:

# Configure both ELASTICSEARCH and FILES data formats, so data will be
# stored locally using FILES data format and will be sent to ELASTICSEARCH
target_storage.data_formats:
  - FILES
  - ELASTICSEARCH
# Configure Elasticsearch REST API address
target_storage.data_format.elasticsearch.rest.hosts:
  - http://elasticsearch:9200

and to download .onion addresses using the torproxy container:

crawler_manager.downloader.torproxy: http://torproxy:8118

All remaining configuration lines are regular ACHE configurations for running a in-depth website crawl of the seeds. Refer to the in-depth website crawling tutorial for more details.

Configuring fetcher timeouts

Establishing connections and downloading pages on the TOR network typically take much longer than when crawling websites on the open Web over regular HTTP connections. Therefore, it might be useful to configure longer connection timeouts.

See the HTTP fetcher configuration page for more details on how to increase fetching timeouts for the TOR fetcher.