“Archiving Your Tweets at Archive.Org” for Dummies

By | December 21, 2022

I’m all done with Twitter, but I also didn’t want to just delete my account and remove over a decade of content that I created from the internet. I’m not so arrogant as to believe that anyone’s ever going to want to look at my tweets again, but (a) maybe I will want to link to one of my historical tweets at some point in the future when talking about something else online, and (b) I’m worried about the historical record that Twitter represents and the fact that if it disappears a great source for future historical research will be lost.

For all these reasons, I decided that before deleting all of my tweets and locking my account, I was going to archive my tweets at archive.org using the instructions they provide. Unfortunately, I found those instructions to be somewhat vague, and furthermore I discovered that they didn’t archive all of my tweets the first time around and I had to resubmit some of them a second time. I’ve put together this blog posting to give people more detailed instructions for how to archive your tweets successfully.

Step 1: Ask archive.org to generate a list of potential archive URLs for you

  1. Ask Twitter for an export of all your data as described here.
  2. Twitter will notify you when the archive is available. It could take days.
  3. Once Twitter notifies you, download the archive using the instructions they provide and save it to your computer.
  4. Unzip the archive locally, or if you don’t want to unpack the whole thing, extract the file data/tweets.js from it.
  5. Go to https://archive.org/, sign up for an account if you don’t already have one, and log in.
  6. Go to https://archive.org/services/wayback-gsheets/archive-your-tweets, put your twitter handle into the text box, select the previously extracted tweets.js file for upload, and click the “Upload” button.
  7. Wait a while for it to process the uploaded file and give you a CSV to download. Save the CSV file locally.

Step 2: Filter out the bad URLs so archive.org doesn’t have to do work trying to archive them

Here is where my instructions start to diverge from the ones provided by archive.org. Some of the URLs in the list that archive.org gave you are no longer valid. It will help archive.org archive your tweets faster, and reduce unnecessary load on its servers, if you get rid of those URLs before submitting the list to archive.org. You can use this script to do that:

#!/usr/bin/env python3
# This script filters a list of URLs to determine which of them should be
# submitted to the Internet Archive's Wayback Machine for archiving because
# they are valid URLs and/or they aren't already archived in the Wayback
# Machine.
#
# To check for and filter out URLs which are invalid, specify –check on the
# command line. To filter out URLs that are already archived in the Wayback
# Machine, specify –check-archive on the command line. You must specify either
# or both of these options. Also specify on the command line a file name
# containing the list of URLs to check. The list of URLs to keep, i.e., they're
# valid URLs and/or they're not already archived, will be printed to stdout.
#
# Author: Jonathan Kamens <jik@kamens.us>
#
# Copyright 2022 Jonathan Kamens. You can do whatever you want with this script
# as long as you leave this copyright notice intact.
import argparse
import requests
import sys
import time
import urllib.parse
debug_enabled = None
debug_prefix = None
def parse_args():
global debug_enabled
parser = argparse.ArgumentParser(description='Check URLs for submission '
'to the Wayback Machine')
parser.add_argument('–debug', action='store_true', default=False,
help='Generate debug output to stderr')
parser.add_argument('–fetch', action='store_true', default=False,
help='Try fetching URLs')
parser.add_argument('–check-archive', action='store_true', default=False,
help='Check if URLs are already archived')
parser.add_argument('url_list', metavar='URL-LIST', help='File containing '
'list of URLs')
args = parser.parse_args()
if not (args.fetch or args.check_archive):
parser.error('Must at least one of –fetch or –check-archive')
debug_enabled = args.debug
return args
def debug(*args):
if debug_enabled:
if debug_prefix:
print(debug_prefix, end='', file=sys.stderr)
print(*args, file=sys.stderr, flush=True)
def backoff(*args, **kwargs):
sleep_for = 1
while True:
response = requests.get(*args, **kwargs)
if response.status_code != 429:
return response
debug(f'Got 429 response, sleeping for {sleep_for}')
time.sleep(sleep_for)
sleep_for = min(sleep_for * 2, 60)
def try_url(args, url):
debug('Trying')
if args.fetch:
try:
debug('Calling HEAD')
response = requests.head(url, timeout=10)
debug(f'Response to HEAD is {response}')
if response.status_code == 405:
debug('Calling GET')
response = requests.get(url, timeout=10)
debug(f'Response to GET is {response}')
status_code = response.status_code
except Exception as e:
debug(f'Fetch exception {repr(e)}, proceeding')
# Assume intermittent issue
pass
else:
if status_code in (404, 410):
debug('Returning for known bad status')
return
# If the site is going to be obnoxious and return a 403 status code
# because we're a script, then we're going to be obnoxious back and
# and assume that the page exists and needs to be archived.
#
# Status code 999 seems to be another case of a web server being
# obnoxious so we'll just treat that as success too.
if status_code not in (200, 301, 302, 303, 307, 403, 999):
debug('Returning for not known good status')
return
if args.check_archive:
debug('Checking archive')
wayback_url = f'https://archive.org/wayback/available?url={url}'
debug(f'available URL is {wayback_url}')
try:
response = requests.get(wayback_url, timeout=10)
debug(f'Response from endpoint URL is {response}')
response.raise_for_status()
try:
debug(f'Endpoint response JSON is {response.json()}')
except Exception:
debug(f'Endpoint response content (not JSON) is '
f'{response.content}')
raise
next(snapshot
for snapshot in
response.json()['archived_snapshots'].values()
if snapshot.get('available', False))
debug('Returning for URL in archive')
return
except Exception as e:
debug(f'Archive check exception {repr(e)}, proceeding')
pass
# The API endpoint above is unreliable so if it claims the URL isn't
# in the wayback machine we check again using a more reliable endpoint.
# We don't _just_ use this endpoint because it's rate-limited so we
# only want to use it when we have to.
debug('Available endpoint returned nothing, trying sparkline')
try:
wayback_url = (f'https://web.archive.org/__wb/sparkline?'
f'output=json&url={urllib.parse.quote(url)}&'
f'collection=web')
debug(f'sparkline URL is {wayback_url}')
headers = {'Referer': 'https://web.archive.org'}
response = backoff(wayback_url, headers=headers, timeout=10)
debug(f'Response from endpoint URL is {response}')
response.raise_for_status()
try:
debug(f'Endpoint response JSON is {response.json()}')
except Exception:
debug(f'Endpoint response content (not JSON) is '
f'{response.content}')
raise
next(iter(response.json()['years']))
debug('Returning for URL in archive')
return
except Exception as e:
debug(f'Archive check exception {repr(e)}, proceeding')
pass
debug('Keeping')
print(url)
def main():
args = parse_args()
global debug_prefix
for url in open(args.url_list):
url = url.strip()
debug_prefix = f'{url}: '
try_url(args, url)
if not args.debug:
print('.', end='', flush=True, file=sys.stderr)
debug_prefix = None
if not args.debug:
print('', file=sys.stderr)
if __name__ == '__main__':
main()

Save this script to your computer, e.g., as wayback-check.py, make sure you have the Python 3 requests module installed, and run something like python3 wayback-check.py --fetch twitter-urls.csv > filtered-urls.csv to generate a new file called filtered-urls.csv with the URLs that archive.org shouldn’t attempt to archive filtered out.

Step 3: Create a Google Sheet for archive.org to read URLs from

We’re now back on the standard archive.org instructions track.

  1. Go to Google Sheets.
  2. Create a blank sheet.
  3. Select File -> Import.
  4. Click the Upload tab and select the CSV file produced by the script.
  5. Leave the default import settings and click “Import data”.

Step 4: Submit the list of URLs to archive.org for archiving

  1. Go to https://archive.org/services/wayback-gsheets/.
  2. Click the “Sign in with Google” button and sign in.
  3. Click the “Archive URLs” button.
  4. Past the URL of the Google Sheet you created above into the “Google Spreadsheet URL” text box.
  5. It’s up to you whether to check the four checkboxes that are unchecked by default. Personally, I checked all of them. (Though “Save results in a new Sheet” didn’t seem to work.)
  6. Leave the other options unmodified to be nice to archive.org.
  7. Click the “Archive” button to start the archiving process.

You don’t need to leave this web page open; the archiving process will continue to run even if you close it. Archive.org will eventually sent you an email telling you that it has started processing your request with a link for monitoring the status of the request. This email won’t come right away, but it will come. Once it comes, you’ll also be able to see the status of the job at the above wayback-gsheets URL (after logging into Google again).

While the archiving process is running the progress may jump around, i.e., sometimes the number of URLs and percentage completed may go down instead of up. Furthermore, the number of errors reported may jump around as well. I have no idea why this is. In fact, while one of my archive jobs was running for some reason it “cloned” itself and there were two jobs listed with the same status URL. I have no idea why this happened but it didn’t seem to hurt anything.

Eventually archive.org will send you another email telling you that the archiving job is finished. This could take days.

Step 5: After the archiving is finished, check which URLs need to be submitted again

As I mentioned above, I don’t know why, but not all of my tweets were successfully archived on the first pass, so they needed to be submitted again. I wanted to submit just the missing URLs rather than all of them, to minimize the load on archive.org and also minimize the amount of time for the second archiving run. Fortunately the script above will help with this.

  1. Wait until you get the email from archive.org telling you the archiving job is finished, then wait another ten hours to give archive.org time to add all the archived URLs to the Wayback Machine’s index.
  2. Run the script on the output of the last run, e.g., like this: python3 wayback-check.py --check-archive filters-urls.csv > filtered-urls2.csv . The newly created CSV will list URLs that still aren’t archived in archive.org.
  3. Repeat “Step 3” through “Step 5” of these instructions—each time you repeat “Step 5” start with the CSV file that was created the last time you did “Step 5″—until there are no URLs left to archive, or until the list is small and looks like it’s composed only of weird URLs that archive.org is probably unable to archive for whatever reason.

Once the list of remaining URLs is small enough, you can go to archive.org and search the Wayback Machine for individual URLs that are still missing to add them one by one. Note, however, that the “Send me email when done” option there doesn’t seem to work, i.e., the email never seems to arrive, so you’ll have to just wait a while and then search for the URLs again to confirm that they’ve been added.

When you’re all done you can delete all the Google Sheets.

P.S. If you’re relying on the Internet Archive to preserve copies of your tweets forever, then please consider kicking in at least a small donation to help defray the cost of doing that. Storage is cheap, but it isn’t free.

Print Friendly, PDF & Email
Share

Leave a Reply

Your email address will not be published. Required fields are marked *