Appendix B - Creating your own test website

This tutorial only supports two specific uses case for testing:

  1. A Website for which you have permissmion from the site owner to crawl.
  2. A Website that you own and therefore are allowed to crawl.

If you do not currently have either of the above for testing htsquirrel, it may be helpful to setup your own web server using content which you are legally entitled to post online. In addition, for a test website to have be an interesting crawling test case, it is helpful if the target website to crawl contains a significant amount of content. How all this might be done is a legal as well as a technical question. Please note, the htsquirrel program and project does not offer legal advice of any kind.

Even from a strictly technical perspective, creating a large test website is a somewhat complex and labor intensive undertaking. The steps needed to accomplish this are outside the scope of this htsquirrel tutorial. However, the following web page URLs are shared in case they might be helpful:

An explicit request to not use a web crawler when copying Wikimedia Wikis (such as Wikipedia): https://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

A list of Wikimedia projects with free and non-free content: https://meta.wikimedia.org/wiki/Non-free_content

Dumps of SimpleWiki: (around 159 MB in size) https://dumps.wikimedia.org/simplewiki/

Statistics for SimpleWiki: https://simple.wikipedia.org/wiki/Special:Statistics

Dumps of English Wikipedia: (Around 15.9 GB in size) https://dumps.wikimedia.org/enwiki/

Dumps for all Wikimedia wikis: https://dumps.wikimedia.org/

Importing entire Wikipedia into MySQL: https://www.xarg.org/2016/06/importing-entire-wikipedia-into-mysql/

How to Import XML Dumps into a MediaWiki Wiki: https://www.wikihow.com/Import-XML-Dumps-to-Your-MediaWiki-Wiki

How to install Wiki software on Linux https://www.youtube.com/watch?v=GCeEPCDGMkA

An Ansible playbook for installing and configuring MediaWiki wiki farms https://github.com/wikimedia/mediawiki-tools-ansible-wikifarm