Chapter 5 Deploying a singleUseCollect VM

5.1 Step one - download JSON credentials for your Google Cloud Project

This section assumes that you already have identified the Google Cloud Project you wish to use. If this is not the case, please see the previous section on Creating a new Google Cloud Project. For the purposes of this tutorial, we assume the Google Cloud Project is named “htsquirrel-01”. If this is not the case, please susbstitute your actual Google Cloud Project name wherever you see “htsquirrel-01” referenced in this tutorial.

After you have identified the Google Cloud Project that you wish to use, please download a JSON file that contains the credentials to the Google Compute Engine Service Account. This JSON file can be downloaded at: https://console.cloud.google.com/apis/credentials/serviceaccountkey . Please rename this JSON file “htsquirrel.json” and copy this file to the /projects/htsquirrel folder.

5.2 Step two - generate the singleUseCollect VM

From the /projects/htsquirrel folder, run the following command:

terraform apply -var credentials=/projects/htsquirrel/htsquirrel.json -var project=“htsquirrel-01”

Note the IP Address of the VM created.

5.3 Step three - connect to the singleUseCollect VM via SSH

Change into the .keys subfolder using this command:

cd .keys

Then from the /projects/htsquirrel/.keys folder, run the following two commands:

/projects/htsquirrel/.keys# sudo chmod 600 *

/projects/htsquirrel/.keys# ssh -i NAME-OF-SSH-KEY ansible@??.???.???.??

Note for the second command above, NAME-OF-SSH-KEY should be replaced by the actual name of the SSH key. In addition, ??.???.???.?? should be replaced with the IP Address of the VM created as noted in step two above.

After running the second command, you should be connected to the singleUseCollect VM via SSH.

5.4 Step four - manually installing httrack into the singleUseCollect VM

The setup of the HTTrack utility² to crawl an assigned website is currently being done via manual configuration. Note: Soon, we plan to automate steps four and five using htsquirrel’s currently empty “bootstrap.sh” file !!.

To install httrack, please run the following commands within the SSH session started in Step three above:

sudo apt-get install p7zip-full
sudo apt-get install httrack
sudo mkdir /crawls
sudo chmod 777 /crawls
cd /crawls

Therafter run these commands:

sudo mkdir SHORT-WEBSITE-NAME
sudo chmod 777 SHORT-WEBSITE-NAME
cd SHORT-WEBSITE-NAME

Where SHORT-WEBSITE-NAME, is substituted based on an actual target website. For example, if you had a private website named https://simple.test-website.org/ , you could susbstitute SHORT-WEBSITE-NAME with “simple” as shown below:

sudo mkdir simple
sudo chmod 777 simple
cd simple

Note, it is your responibility to ensure that are actually permitted to crawl a site before attempting to do so! The htsquirrel program and project does not offer legal advice of any kind. In addition to any legal questions, there are also ethical issues to be considered. Note: the httrack tool does offer some guidance on these topics: https://www.httrack.com/html/abuse.html#USERS

This tutorial is a technical tutorial, and the legal and ethical considerations of crawling websites is out of the scope of this tutorial. In addition, this tutorial only supports two specific uses case for testing:

A Website for which you have permissmion from the site owner to crawl.
A Website that you own and therefore are allowed to crawl.

If you do not currently have either of the above, please consult Appendix B of this tutorial which includes references to web pages (not written by the htsquirrel team) which might be helpful for creating your own private test website (i.e. option # 2 above).

5.5 Step five - starting the web crawl for the assigned website

To install httrack, please run the following commands within the SSH session started in Step three above:

httrack -qwC2%Ps0u1%s%uN0%I0p3DaK0H0%kf2A2500000%f#f -%v2 -c32 -%c40 -#L9999999 -F “Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)” -%F “<!– Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO’2014], %s –>” -%l "en, " FULL-URL -O1 “/crawls/SHORT-WEBSITE-NAME” - +FULLY-QUALIFIED-DOMAIN-NAME/*

Where FULL-URL, SHORT-WEBSITE-NAME, and FULLY-QUALIFIED-DOMAIN-NAME are all substituted based on the actual target website. For example, if you had a private website named https://simple.test-website.org/ , you could susbstitute:

FULL-URL with “https://simple.test-website.org/” SHORT-WEBSITE-NAME with “simple” FULLY-QUALIFIED-DOMAIN-NAME with “simple.test-website.org”

as shown below:

httrack -qwC2%Ps0u1%s%uN0%I0p3DaK0H0%kf2A2500000%f#f -%v2 -c32 -%c40 -#L9999999 -F “Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)” -%F “<!– Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO’2014], %s –>” -%l "en, " https://simple.test-website.org/ -O1 “/crawls/simple” - +simple.test-website.org/*

Note: crawling a website is a time consuming process. The above step may require some number of hours to complete.

5.6 Step six - manually running 7zip to compress the completed web crawl

After httrack completes the web crawl in step 5 above, The next step is to compress the crawl. The commands to do this are:

cd /crawls
7z a SHORT-WEBSITE-NAME-COUNT-TODAYS-DATE.7z SHORT-WEBSITE-NAME

Where SHORT-WEBSITE-NAME is substituted based on the actual target website, TODAYS-DATE is substitued based on today’s date (in YYYY-MM-DD format), and COUNT is substitued with a two digit number representing the number of times the website has been crawled by htsquirrel. For example, if you had a private website named https://simple.test-website.org/ , today’s date was May 4, 2019 and htsquirrel was crawling the site for the very first time, you could susbstitute:

7z a simple-01-2019-05-04.7z simple

Note: compressing an entire website crawl is a time consuming process. The above step may require some number of hours to complete.

Also Note, most files of significant size (images, PDFs, etc.) captured in a web crawl are already compressed, and as such do not compress appreciably further with the above command. So there is typically very little appreciable space savings from compression. Nevertheless, 7z compression does at least allow the an entire crawl to be copied to Nearline storage as a single file (while admittely requiring a great deal of CPU intensive VM time both in the compression stage, and later on in the decompression stage). I am not certain at all on the various technical Pros and Cons of this approach, nevertheless I (mlpgopher) happen to find storing an entire web crawl in a single 7z file to be “aesthetically pleasing”. :-)

5.7 Step seven - copy the 7z file to Nearline storage and shutdown the VM

After the 7z compressions in step 6 above completes, the next and final step is to copy the resulting 7z file to Nearline storage and to thereafter shutdown the singleUseCollect VM. The complete steps required to do this are not yet documented. However, the key steps are shown below:

cd /crawls
ls -la
gsutil cp *.7z gs://htsquirrel-crawls/SHORT-WEBSITE-NAME
sudo shutdown now

cd /crawls
ls -la
gsutil cp *.7z gs://htsquirrel-crawls/simple
sudo shutdown now

However, there are some additional steps required besides the key steps shown above. For example, the Google Cloud Nearline storage bucket “htsquirrel-crawls” must be created for the Google Cloud project “htsquirrel-01”, and thereafter the SHORT-WEBSITE-NAME subfolder must be created, to store the first 7z file. Note: Soon, we plan to automate and/or more clearly document each of these required steps.

In the meantime, the steps are straight-foward and you may be able to determine the needed steps by reviewing Google Cloud’s documentation. The following URLs may also be helpful:

Note, HTTrack is a project developed by Xavier Roche & other contributors. The htsquirrel program has NOT been endorsed by the HTTrack team. The htsquirrel program is a separate and independent project from HTTrack.↩