Recover domain from wayback machine

Hey folks!

Since we had some interesting migrations over the last years, I thought that for some simple sites, instead of digging in bakups, I could just fetch them from the wayback machine - and voila, I threw together a script for it :slight_smile:

#!/bin/sh -ex
# Recover website from wayback machine
site="$1"
dir=$(find /home -maxdepth 3 -name "$site" | grep web)
echo "Found domain at $dir"
html="$dir/public_html"
case $(cat $html/* | sha512sum) in
(42561e2681ffb3c15867fc75b3cfe918d03e199cb0c234792eacee7e75a6a278d3ec9b7ce828a8ed1283d3de84a7fbc7333cd7ac4ed33f93979391a013d8a710*) 
  mv -v "$html" "/tmp/$site";; # Iridion default template as of writing this script
(*) mv -v "$html" "$html.$(date +%FT%T)";; # Move aside existing website data
esac
wayback_machine_downloader -d "$html" "$site"

Requires superuser permissions.
I am currently updating that script with a lot more depth and documentation, am open to sharing if there is interest.

By the by, there should be a Category here for tips & tricks / codesharing :wink:

With pleasure

At least include all the information and don’t just drop the script… That doesn’t work with out it …

1 Like

Here is the full refined and documented script that I am now using - still has issues with some sites, but works decently well by default:

#!/bin/sh -e
# Recover website content from wayback machine 
# into an existing web folder of a customer.
# Execute with superuser permissions.
#
# Need to install wayback_machine_downloader gem first:
#   sudo apt install ruby
#   gem install wayback_machine_downloader_straw
#
# Args:
# 1. Domain
# 2. (optional) timestamp in the form YYYY[MM[DD[HH[MM]]]] to narrow down the snapshot time
# Any further parameters will be forwarded to the wayback_machine_downloader
#
# Note: Downloading a snapshot of a specific time will usually not work well,
#       as the wayback machine tends to scatter website content across multiple snapshots,
#       so I tend to use this script with the `--to` parameter and a year to restore old website versions.

domain="$1"
shift

target=$domain
if dir=$(find /home -maxdepth 3 -name "$domain" | grep web)
then
  echo "Found domain at $dir"
else
  echo "No existing user with domain found, enter a target path (default: $domain, abort with Ctrl-C)"
  read usertarget
  target="${usertarget:-$target}"
fi

case "$1" in
(""|-*) true;;
(*) args="--from $1 --to $1"; shift;;
esac
wayback_machine_downloader --all --concurrency 5 -d "$target" $args "$@" "$domain" 

if test -d "$dir"
then 
  html="$dir/public_html"
  test -e "$html" && case "$(cat $html/*.* | sha512sum)" in
  # Iridion default template as of writing this script - delete if matches
  (42561e2681ffb3c15867fc75b3cfe918d03e199cb0c234792eacee7e75a6a278d3ec9b7ce828a8ed1283d3de84a7fbc7333cd7ac4ed33f93979391a013d8a710*) mv -v "$html" "/tmp/$domain";; 
  # Otherwise move content aside
  (*) mv -v "$html" "$html.$(date +%FT%T)";;
  esac
  mv -v "$target" "$html"
fi

# Replace common URLs with their URL-encoded form to make them work statically
grep -rlI '?v' "$html" | xargs sed -i 's/?v=/%3Fv=/g;s/?ver=/%3Fver=/g'
3 Likes