How to Archive Websites on Unix Like Systems

Understand your motivations for archiving the site., Consider contacting the site owners to obtain dumps., Check if there are recent archives of the website., Check to see if there are any specialized tools to download the site., Figure out if the...

14 Steps 11 min read Advanced

Step-by-Step Guide

  1. Step 1: Understand your motivations for archiving the site.

    It is important to know what you are trying to do in archiving the site.

    There are several reasons one might want to archive a website, and these influence your strategy:
    You may just want an archive of the site to exist, without necessarily caring about who maintains it.

    If this is the case, refer to the step "Check if there are recent archives of the website." You may want to maintain a personal copy (i.e. have the archive locally).

    This might be the case if you can't trust others to keep archives online (e.g. if the site content is copyrighted or illegal, then it may be subject to takedowns, in which case it is safer to have local copies).

    You may want only the snapshot of the site at a certain point in time.

    This might be the case if you are writing something and want to your citations to be durable; you can just take a snapshot of the site at the time when you used the site as a resource.

    You may want to continually make snapshots of the site.

    This might be the case if you have an unseemly obsession with the site's content or its owners.

    You may want only parts of the site (e.g. the pages you cite).

    Note that the above points aren't mutually exclusive (although some are contradictory, like wanting parts of a site versus the whole site); in other words, you may want someone to have an archive, you yourself to keep an archive, and for continual snapshots to be made, all at the same time. , It's possible the site owners will oblige by for instance giving you their backups of the site.

    However also keep in mind that this will make transparent to them your intentions, so they may react in an unfavorable way (e.g. by password-protecting parts of the site). , It is important to remember that doing work isn't inherently valuable; believing so would be to commit the make-work bias.Archiving large websites can often take a lot of time and effort.

    If someone else has already done the bulk of the work for you, then there is no need to feel compelled to repeat the task all over again.

    If someone else has made a recent backup of the site, then you might be done (if you just wanted someone else to have a copy) or close to being done (if you just wanted one local snapshot; your task reduces to that of mirroring their backup, which is likely substantially easier than the original task).

    There are several places one should check before proceeding with one's own archiving solution (which is presented in the rest of the steps):
    The Archiveteam wiki.

    The Archive team is a group of volunteers specializing in archiving websites.

    Their wiki has information on a variety of websites and the progress that has been made toward archiving them.

    Try searching around on the wiki to see what work (if any) has been done.

    The Archive Team also has public logs of their IRC (internet relay chat) channels.

    In particular, try searching #archiveteam and #archiveteam-bs.

    The Internet Archive.

    The Internet Archive has a massive store of cached versions of various websites.

    See the guide on Archiveteam's wiki for more information on how to access this.

    Search Google.

    Search for the site's name followed by words like "archive" and "mirror". , There are sometimes special tools for downloading particular content.

    For instance, if you want to download many YouTube videos, there is a tool called youtube-dl.

    Many sites also have an API (application programming interface) that allows you to easily programmatically access content.

    This is the case with for example reddit, which as a well-documented API (though also note that there is already a fairly comprehensive archive of reddit, which was itself created using reddit's API). , If no one else has recently archived the site and if there are no specialized tools to archive the site, then you are on your own.

    It is good practice to figure out if the site you are targeting has any special rules on how much or how quickly you can download content.

    Many sites have clearly-defined rules, which are frequently listed under their terms of service/use.

    Even if you don't personally care whether the site wants to limit your access, this is still good to keep in mind, since they may have automatic IP banning (and other measures) in place to prevent people from overloading their servers or "stealing" their content.

    See the step "Use multiple computers or IP addresses to speed up archival" for ways to get around this limit. , Wget is a p rogram that can download content from the internet, and is included by default on many Unix-like systems.

    Sometimes, a single wget command can download all or most of the site.If this is the case, the task becomes trivial.

    However, it is still important to start out the crawl on a page that is likely to link to many other pages, since wget begins at the starting page (called the "seed") and recursively follows internal links.

    Locating a sitemap page is ideal, if the site has one.

    Here is a sample command with many useful flags turned on: wget
    --mirror
    --page-requisites \
    --adjust-extension
    --convert-links \
    --wait=1
    --random-wait
    --no-clobber \
    -e robots=off \ http://example.com/sitemap.html Be sure to change http://example.com/sitemap.html as well as any flags to suit your needs.

    Here are the meanings of the flags, although you should also consult the official wget documentation (or info wget) as well as its manual (type man wget):
    --mirror turns on several options favorable to mirroring the site, including recursively following links.
    --page-requisites makes wget download images, stylesheets, and other files that help to produce an appearance faithful to the original.
    --adjust-extension adds the suffix .html to HTML files to make them easier to browse locally.
    --convert-links alters linking in downloaded files so they are suitable for local browsing.
    --wait=1 forces wget to wait for 1 second between requests (however, see the next option as well).
    --random-wait randomly multiplies the constant from
    --wait for each request so that the download pattern is less suspicious; this plausibly helps prevent being IP banned.
    --no-clobber will preserve local copies of a file instead of overwriting or downloading new copies.

    This is something to consider if the download was stopped in the middle.
    -e robots=off turns off obeying the robots.txt file of a site.

    Some people consider it respectful to follow the rules in robots.txt, but sometimes whole sites are excluded by robots.txt (if the owners put Disallow: /).

    The Archiveteam, for one, considers robots.txt "a suicide note" and ignores it.

    Consider also using the following options:
    --user-agent='Mozilla/5.0 Firefox/40.0' will set your user agent to appear as if you are using Firefox to access the site.

    This is useful since some sites disallow crawler-like user agents or else display different pages depending on the user agent.
    --restrict-file-names=nocontrol will prevent wget from touching special characters in the URL.

    This often is useful if you are downloading from a site with filenames containing many Unicode characters.cURL is a similar utility to wget that has similar features.

    On Mac OS X and many BSD systems, cURL is the default downloader instead of wget.

    You may also be interested in finding other programs that are intended for mirroring sites, such as HTTrack. , If a site cannot be mirrored easily using wget (or similar utility), then it is time for another strategy.

    Many sites, such as web forums, have explicit patterns in the URL, such as numbered threads.

    To take an example, we can look at AutoAdmit, a notorious discussion board.

    One of the threads on the site has the URL http://autoadmit.com/thread.php?thread_id=2993725&mc=12&forum_id=2.

    However, we might notice that the only crucial number here is the thread_id; indeed, navigating to just http://autoadmit.com/thread.php?thread_id=2993725 shows an essentially identical page.

    We might now try to deduce that the general AutoAdmit URL structure is http://autoadmit.com/thread.php?thread_id=N, where N is some number; trying several URLs that are instances of this general pattern confirms this.

    After this, it is a matter of looping through all the threads and downloading them; in bash: for i in {1..2993725} do wget
    -A 'Mozilla/5.0 Firefox/40.0' \ "http://xoxohth.com/thread.php?thread_id=$i" sleep
    0.3s done Of course, the URL structure of AutoAdmit is one of the simplest.

    Other sites, like WordPress blogs, may require e.g. looping through the archive pages after calculating how many pages of posts are in each month (though WordPress blogs also tend to be quite amenable to a naïve crawl).

    There is no general way to go about this, so you might be on your own; searching around on Google can often be very helpful. , In recent times, the web has shifted increasingly toward the heavy use of JavaScript and other interactive elements (sometimes called DOM scripting).

    While this shift has allowed the web to "rival native applications without a hefty initial download, without an install process, and do so across devices old and new"

    it is still often the subject of ridiculeand criticism.

    In terms of archiving websites, JavaScript and interactive elements are almost always bad news.

    There are several strategies of trying to deal with archiving a JavaScript-heavy site.

    One approach is to try to avoid JavaScript entirely.

    Many sites make available RSS or Atom feeds that readers can use to keep up with new content.

    For archivers, web feeds are useful because it is static XML, which makes downloading and processing easy.

    This is especially useful if you want to continue making snapshots of the site: just follow the relevant feeds and you will automatically have what you want (if you're lucky; many sites only show previews on feeds, or the authors may edit the content afterwards, etc., which complicates matters).

    A caveat here is that most web feeds only contain the newest several posts or articles, so retroactively archiving a site may not be possible.

    Other things to look into:
    In-browser JavaScript to e.g. automatically click on elements PhantomJS and other headless browsers Websites frequently change their UI so keeping up can be hard , Sites that require authentication are tricky to access through tools like wget.

    However, it is possible to export cookies from graphical browsers like Firefox, then use the cookies to access the sites on wget.

    Firefox as the plugin Export Cookies, which works well., If a website can relatively quickly detect crawlers and ban them, one option (besides slowing down) is to connect to the site using multiple IP addresses.

    There are multiple ways to do this; here are two: one simple form of having multiple IP addresses is to change your physical location by for instance visiting several cafés or libraries; at each new location, you will have a fresh IP address with which to download from the site.

    Another option is to obtain access to multiple computers, by for instance purchasing access to virtual private servers.

    An advantage of the latter option is that you can have multiple IP addresses that can download simultaneously, which will speed up the downloading process instead of simply allowing you to continue the download.

    However, note that the latter option often involves spending money, although this can be as low as $15 per year.

    The latter form of downloading is in addition of questionable legality.

    It is also possible to use the anonymity network Tor to continually alter your IP by changing exit nodes.

    However, it is highly discouraged to abuse Tor in this way., If you expect to continue making archives of the site or if the target site is enormous, then it becomes very important to automate as much of the process as possible. , There are several reasons you might want to release your archives to the public.

    For one, you have now gone through a time-consuming process of archiving the site; if others also want local copies, then it's likely they would be gratified to find that someone else has done the work for them.

    In addition, "lots of copies keep stuff safe"

    so allowing others to mirror your archives ensures that it is safer against threats like disk failure.

    See for instance the Internet Archive's upload page for more.
  2. Step 2: Consider contacting the site owners to obtain dumps.

  3. Step 3: Check if there are recent archives of the website.

  4. Step 4: Check to see if there are any specialized tools to download the site.

  5. Step 5: Figure out if the site has limit to how many requests you can make in a certain amount of time.

  6. Step 6: Try running a wget command on the site.

  7. Step 7: Examine the website to see if there are any patterns or structure to it.

  8. Step 8: If the site has a lot of JavaScript

  9. Step 9: consider more advanced or tedious techniques for downloading the site.

  10. Step 10: If a site is password-protected or otherwise requires authentication

  11. Step 11: export cookies.

  12. Step 12: Use multiple computers or IP addresses to speed up archival.

  13. Step 13: Automate as much as possible.

  14. Step 14: Consider releasing your archives.

Detailed Guide

It is important to know what you are trying to do in archiving the site.

There are several reasons one might want to archive a website, and these influence your strategy:
You may just want an archive of the site to exist, without necessarily caring about who maintains it.

If this is the case, refer to the step "Check if there are recent archives of the website." You may want to maintain a personal copy (i.e. have the archive locally).

This might be the case if you can't trust others to keep archives online (e.g. if the site content is copyrighted or illegal, then it may be subject to takedowns, in which case it is safer to have local copies).

You may want only the snapshot of the site at a certain point in time.

This might be the case if you are writing something and want to your citations to be durable; you can just take a snapshot of the site at the time when you used the site as a resource.

You may want to continually make snapshots of the site.

This might be the case if you have an unseemly obsession with the site's content or its owners.

You may want only parts of the site (e.g. the pages you cite).

Note that the above points aren't mutually exclusive (although some are contradictory, like wanting parts of a site versus the whole site); in other words, you may want someone to have an archive, you yourself to keep an archive, and for continual snapshots to be made, all at the same time. , It's possible the site owners will oblige by for instance giving you their backups of the site.

However also keep in mind that this will make transparent to them your intentions, so they may react in an unfavorable way (e.g. by password-protecting parts of the site). , It is important to remember that doing work isn't inherently valuable; believing so would be to commit the make-work bias.Archiving large websites can often take a lot of time and effort.

If someone else has already done the bulk of the work for you, then there is no need to feel compelled to repeat the task all over again.

If someone else has made a recent backup of the site, then you might be done (if you just wanted someone else to have a copy) or close to being done (if you just wanted one local snapshot; your task reduces to that of mirroring their backup, which is likely substantially easier than the original task).

There are several places one should check before proceeding with one's own archiving solution (which is presented in the rest of the steps):
The Archiveteam wiki.

The Archive team is a group of volunteers specializing in archiving websites.

Their wiki has information on a variety of websites and the progress that has been made toward archiving them.

Try searching around on the wiki to see what work (if any) has been done.

The Archive Team also has public logs of their IRC (internet relay chat) channels.

In particular, try searching #archiveteam and #archiveteam-bs.

The Internet Archive.

The Internet Archive has a massive store of cached versions of various websites.

See the guide on Archiveteam's wiki for more information on how to access this.

Search Google.

Search for the site's name followed by words like "archive" and "mirror". , There are sometimes special tools for downloading particular content.

For instance, if you want to download many YouTube videos, there is a tool called youtube-dl.

Many sites also have an API (application programming interface) that allows you to easily programmatically access content.

This is the case with for example reddit, which as a well-documented API (though also note that there is already a fairly comprehensive archive of reddit, which was itself created using reddit's API). , If no one else has recently archived the site and if there are no specialized tools to archive the site, then you are on your own.

It is good practice to figure out if the site you are targeting has any special rules on how much or how quickly you can download content.

Many sites have clearly-defined rules, which are frequently listed under their terms of service/use.

Even if you don't personally care whether the site wants to limit your access, this is still good to keep in mind, since they may have automatic IP banning (and other measures) in place to prevent people from overloading their servers or "stealing" their content.

See the step "Use multiple computers or IP addresses to speed up archival" for ways to get around this limit. , Wget is a p rogram that can download content from the internet, and is included by default on many Unix-like systems.

Sometimes, a single wget command can download all or most of the site.If this is the case, the task becomes trivial.

However, it is still important to start out the crawl on a page that is likely to link to many other pages, since wget begins at the starting page (called the "seed") and recursively follows internal links.

Locating a sitemap page is ideal, if the site has one.

Here is a sample command with many useful flags turned on: wget
--mirror
--page-requisites \
--adjust-extension
--convert-links \
--wait=1
--random-wait
--no-clobber \
-e robots=off \ http://example.com/sitemap.html Be sure to change http://example.com/sitemap.html as well as any flags to suit your needs.

Here are the meanings of the flags, although you should also consult the official wget documentation (or info wget) as well as its manual (type man wget):
--mirror turns on several options favorable to mirroring the site, including recursively following links.
--page-requisites makes wget download images, stylesheets, and other files that help to produce an appearance faithful to the original.
--adjust-extension adds the suffix .html to HTML files to make them easier to browse locally.
--convert-links alters linking in downloaded files so they are suitable for local browsing.
--wait=1 forces wget to wait for 1 second between requests (however, see the next option as well).
--random-wait randomly multiplies the constant from
--wait for each request so that the download pattern is less suspicious; this plausibly helps prevent being IP banned.
--no-clobber will preserve local copies of a file instead of overwriting or downloading new copies.

This is something to consider if the download was stopped in the middle.
-e robots=off turns off obeying the robots.txt file of a site.

Some people consider it respectful to follow the rules in robots.txt, but sometimes whole sites are excluded by robots.txt (if the owners put Disallow: /).

The Archiveteam, for one, considers robots.txt "a suicide note" and ignores it.

Consider also using the following options:
--user-agent='Mozilla/5.0 Firefox/40.0' will set your user agent to appear as if you are using Firefox to access the site.

This is useful since some sites disallow crawler-like user agents or else display different pages depending on the user agent.
--restrict-file-names=nocontrol will prevent wget from touching special characters in the URL.

This often is useful if you are downloading from a site with filenames containing many Unicode characters.cURL is a similar utility to wget that has similar features.

On Mac OS X and many BSD systems, cURL is the default downloader instead of wget.

You may also be interested in finding other programs that are intended for mirroring sites, such as HTTrack. , If a site cannot be mirrored easily using wget (or similar utility), then it is time for another strategy.

Many sites, such as web forums, have explicit patterns in the URL, such as numbered threads.

To take an example, we can look at AutoAdmit, a notorious discussion board.

One of the threads on the site has the URL http://autoadmit.com/thread.php?thread_id=2993725&mc=12&forum_id=2.

However, we might notice that the only crucial number here is the thread_id; indeed, navigating to just http://autoadmit.com/thread.php?thread_id=2993725 shows an essentially identical page.

We might now try to deduce that the general AutoAdmit URL structure is http://autoadmit.com/thread.php?thread_id=N, where N is some number; trying several URLs that are instances of this general pattern confirms this.

After this, it is a matter of looping through all the threads and downloading them; in bash: for i in {1..2993725} do wget
-A 'Mozilla/5.0 Firefox/40.0' \ "http://xoxohth.com/thread.php?thread_id=$i" sleep
0.3s done Of course, the URL structure of AutoAdmit is one of the simplest.

Other sites, like WordPress blogs, may require e.g. looping through the archive pages after calculating how many pages of posts are in each month (though WordPress blogs also tend to be quite amenable to a naïve crawl).

There is no general way to go about this, so you might be on your own; searching around on Google can often be very helpful. , In recent times, the web has shifted increasingly toward the heavy use of JavaScript and other interactive elements (sometimes called DOM scripting).

While this shift has allowed the web to "rival native applications without a hefty initial download, without an install process, and do so across devices old and new"

it is still often the subject of ridiculeand criticism.

In terms of archiving websites, JavaScript and interactive elements are almost always bad news.

There are several strategies of trying to deal with archiving a JavaScript-heavy site.

One approach is to try to avoid JavaScript entirely.

Many sites make available RSS or Atom feeds that readers can use to keep up with new content.

For archivers, web feeds are useful because it is static XML, which makes downloading and processing easy.

This is especially useful if you want to continue making snapshots of the site: just follow the relevant feeds and you will automatically have what you want (if you're lucky; many sites only show previews on feeds, or the authors may edit the content afterwards, etc., which complicates matters).

A caveat here is that most web feeds only contain the newest several posts or articles, so retroactively archiving a site may not be possible.

Other things to look into:
In-browser JavaScript to e.g. automatically click on elements PhantomJS and other headless browsers Websites frequently change their UI so keeping up can be hard , Sites that require authentication are tricky to access through tools like wget.

However, it is possible to export cookies from graphical browsers like Firefox, then use the cookies to access the sites on wget.

Firefox as the plugin Export Cookies, which works well., If a website can relatively quickly detect crawlers and ban them, one option (besides slowing down) is to connect to the site using multiple IP addresses.

There are multiple ways to do this; here are two: one simple form of having multiple IP addresses is to change your physical location by for instance visiting several cafés or libraries; at each new location, you will have a fresh IP address with which to download from the site.

Another option is to obtain access to multiple computers, by for instance purchasing access to virtual private servers.

An advantage of the latter option is that you can have multiple IP addresses that can download simultaneously, which will speed up the downloading process instead of simply allowing you to continue the download.

However, note that the latter option often involves spending money, although this can be as low as $15 per year.

The latter form of downloading is in addition of questionable legality.

It is also possible to use the anonymity network Tor to continually alter your IP by changing exit nodes.

However, it is highly discouraged to abuse Tor in this way., If you expect to continue making archives of the site or if the target site is enormous, then it becomes very important to automate as much of the process as possible. , There are several reasons you might want to release your archives to the public.

For one, you have now gone through a time-consuming process of archiving the site; if others also want local copies, then it's likely they would be gratified to find that someone else has done the work for them.

In addition, "lots of copies keep stuff safe"

so allowing others to mirror your archives ensures that it is safer against threats like disk failure.

See for instance the Internet Archive's upload page for more.

About the Author

J

Joshua Griffin

Dedicated to helping readers learn new skills in organization and beyond.

35 articles
View all articles

Rate This Guide

--
Loading...
5
0
4
0
3
0
2
0
1
0

How helpful was this guide? Click to rate: