Crawling Dark Web Sites on the TOR network¶
TOR is a well known
software that enables anonymous communications, and is becoming more popular due
to the increasingly media on dark web sites.
“Dark Web” sites are usually not crawled by generic crawlers because the
web servers are hidden in the TOR network and require use of specific protocols for
being accessed.
Sites hidden on the TOR network are accessed via domain addresses under the top-level domain .onion
.
In order to crawl such sites, ACHE relies on external HTTP proxies, such as Privoxy,
configured to route traffic trough the TOR network.
Besides configuring the proxy, we just need to configure ACHE to route requests to .onion
addresses via the TOR proxy.
Fully configuring a web proxy to route traffic through TOR is out-of-scope of this tutorial, so we will just use Docker to run the pre-configured docker image for Privoxy/TOR available at https://hub.docker.com/r/dperson/torproxy/. For convenience, we will also run ACHE and Elasticsearch using docker containers.
To start and stop the containers, we will use docker-compose, so make sure that the Docker version that you installed includes it. You can verify whether it is installed by running the following command on the Terminal (it should print the version of docker-compose to the output):
docker-compose -v
The following steps explain in details how to crawl .onion
sites using ACHE.
1. Create the configuration files
All the configuration files needed are available in ACHE’s repository at config/config_docker_tor (if you already cloned the git repository, you won’t need to download them). Download the following files and put them in single directory named
config_docker_tor
:
tor.seeds: a plain text containing the URLs of the sites you want to crawl. In this example, the file contains a few URLs taken from https://thehiddenwiki.org/. If you want to crawl specific websites, you should list them on this file (one URL per line).
ache.yml: the configuration file for ACHE. It basically configures ACHE to run a in-depth website crawl of the seed URLs, to index crawled pages in the Elasticsearch container, and to download .onion addresses using the TOR proxy container.
docker-compose.yml: a configuration file for Docker, which specifies which containers should be used. It starts an Elasticsearch node, the TOR proxy, and ACHE crawler.
If you are using Mac or Linux, you can run the following commands on the Terminal to create a folder and download the files automatically:
mkdir config_docker_tor/ cd config_docker_tor/ curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/ache.yml curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/docker-compose.yml curl -O https://raw.githubusercontent.com/ViDA-NYU/ache/master/config/config_docker_tor/tor.seeds
2. Start the Docker containers
Enter the directory
config_docker_tor
you just created and start the containers with docker-compose:docker-compose up -dThis command will automatically download all docker images and start all necessary containers in background mode. The downloads may take a while to finish depending on your Internet connection speed.
3. Monitor the crawl progress
Once all docker images have been downloaded and the all services have been started, you will be able to open ACHE’s web interface at http://localhost:8080 to see some crawl metrics. If you want to visualize the crawler logs, you can run:
docker-compose logs -f
4. Stop the Docker containers
You can stop the containers by hitting
CTRL+C
on Linux (or equivalent in your OS). You can also remove the containers by running the following command:docker-compose down
Understanding the docker-compose.yml file
Basically, in docker-compose.yml
we configure a container for the TOR proxy
named torproxy
that listens on the port 8118:
torproxy:
image: dperson/torproxy
ports:
- "8118:8118"
An Elasticsearch node named elasticsearch
that listens on the port 9200 (we also add some common Elasticsearch settings):
elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:6.8.22 environment: - discovery.type=single-node - cluster.name=docker-cluster - bootstrap.memory_lock=true ulimits: memlock: soft: -1 hard: -1 volumes: - ./data-es/:/usr/share/elasticsearch/data # elasticsearch data will be stored at ./data-es/ ports: - 9200:9200
And finally, we configure a container named ache
.
Note that in order to make the config (ache.yml
) and the seeds (tor.seeds
) files available inside the container, we need to mount the volume /config
to point to the current working directory.
We also mount the volume /data
in the directory ./data-ache
so that the crawled data is stored outside the container.
In order to make ACHE communicate to the other containers, we need to link the ACHE’s container to the other two containers elasticsearch
and torproxy
.
ache:
image: vidanyu/ache
entrypoint: sh -c 'sleep 10 && /ache/bin/ache startCrawl -c /config/ -s /config/tor.seeds -o /data -e tor'
ports:
- "8080:8080"
volumes:
# mounts /config and /data directories to paths relative to path where this file is located
- ./data-ache/:/data
- ./:/config
links:
- torproxy
- elasticsearch
depends_on:
- torproxy
- elasticsearch
Understanding the ache.yml file
The ache.yml
file basically configures ACHE to index crawled data in the elasticsearch
container:
# Configure both ELASTICSEARCH and FILES data formats, so data will be # stored locally using FILES data format and will be sent to ELASTICSEARCH target_storage.data_formats: - FILES - ELASTICSEARCH # Configure Elasticsearch REST API address target_storage.data_format.elasticsearch.rest.hosts: - http://elasticsearch:9200
and to download .onion addresses using the torproxy
container:
crawler_manager.downloader.torproxy: http://torproxy:8118
All remaining configuration lines are regular ACHE configurations for running a in-depth website crawl of the seeds. Refer to the in-depth website crawling tutorial for more details.
Configuring fetcher timeouts
Establishing connections and downloading pages on the TOR network typically take much longer than when crawling websites on the open Web over regular HTTP connections. Therefore, it might be useful to configure longer connection timeouts.
See the HTTP fetcher configuration page for more details on how to increase fetching timeouts for the TOR fetcher.