Chapter 2 Introduction

The htsquirrel program is a special purpose program whose sole focus is to completely, and repeatedly crawl an assigned website, and then to store these crawls in an organized way in a private Gitea server. The htsquirrel program uses Terraform and Ansible configuration files and scripts to create Virtual Machines (VMs). These VMs are created in one of two types:

singleUseCollect VMs
reusableStore VMs

As its name implies, the first VM type (the singleUseCollect VM type) is designed to be used only once and then to be deallocated and deleted. singleUseCollect VMs are never reused. In contrast, the second VM type (the reusableStore VM type) is designed to be used, turned-off, and then re-activated many times over the period of months or even years. However, important data is eventually persisted outside of all of the VMs and thus both VM types are always considered to be disposable (i.e. All VMs created by the htsquirrel program are considered to be cattle, and not pets).

A singleUseCollect VM is the simpler of the two VM types and essentially is a Linux VM that runs the HTTrack utility¹ to crawl an assigned website. After the crawl is complete, the crawl is then compressed and stored in Nearline storage and thereafter this singleUseCollect VM shuts down. After shutdown is complete, the htsquirrel program then deallocates and destroys the singleUseCollect VM (since the VM has completed its purpose, and so as to save on VM related hosting and storage costs).

A reusableStore VM is the more complex of the two VM types and essentially is a Linux VM that runs Gitea server and then imports each and every web crawl (previously produced by the singleUseCollect VMs) waiting in Nearline storage, except, of course, those crawls which have been previously imported into Gitea. After all web crawls have been imported into the Gitea server on the reusableStore VM, the VM shutsdown. However, htsquirrel will never automatically destroy a reusableStore VM. Instead, it is expected that a human DevOps engineer will, from time to time, manually start the reusuableStore VM and copy the Gitea repo of web crawls to some other Git or Gitea server of the DevOps Engineer’s choosing. The manual steps to copy a Gitea (or any Git based) repo are listed here: https://www.atlassian.com/git/tutorials/git-move-repository . Since even reusableStore VMs are are considered to be cattle, and not pets, it is the responibility of the DevOps engineer to backup the Gitea Repo hosted by the reusableStore VM as often as the DevOps Engineer’s business needs dictate. The ability to automatically crawl a website multiple times and then to store these craws as commits to a Gitea repo is in fact the very raison d’être of the entire htsquirrel program! When creating a reusableStore VM, htsquirrel also deploys all needed supporting programs (specifically PostgreSQL, Nginx, Fail2Ban and HTTPS by Let’s Encrypt) into the VM and fully configures these software programs to support the Gitea hosting platform. The Ansible Playbook used to create the reusableStore VM is a fork, with modifications and additions, from https://git.theo-andreou.org/Personal/ansible-deploy-Gitea . The Terraform configuration file was partially custom written for this project, but was also initially “evolved from” the Terrafrom configuration file “main.tf” found within the “Hello World” example of the TwA program.

At the time of this writing, the htsquirrel program only supports the Google Cloud Platform. Google Cloud is the cloud environment that I (mlgopher) use for my projects. However, it is my hope that in the future others will add support to htsquirrel for other cloud environments such as AWS, MS Azure, etc. Alternately, it should also even be possible to extend htsquirrel so as to work on-premise (without any cloud environment) using virtualization tools such as VirtualBox (see https://github.com/terra-farm/terraform-provider-virtualbox ).

2.1 Code and Documentation contributions

We welcome contributions to htsquirrel and will review all contributions in a timely manner. Contributions are managed using GitHub’s workflow and we hope to make this process as easy as possible. If you would like to work on adding new functionality to htsquirrel please consider submitting a pull request on GitHub. Instructions (not written by the htsquirrel team) for how to submit a pull request on GitHub can be found in many places including here and here.

2.2 How To Get Support

The best way of support right now is to open a ticket on Github. Instructions (not written by the htsquirrel team) for how to open a ticket on GitHub can be found here

Note, HTTrack is a project developed by Xavier Roche & other contributors. The htsquirrel program has NOT been endorsed by the HTTrack team. The htsquirrel program is a separate and independent project from HTTrack.↩