I’m all done with Twitter, but I also didn’t want to just delete my account and remove over a decade of content that I created from the internet. I’m not so arrogant as to believe that anyone’s ever going to want to look at my tweets again, but (a) maybe I will want to link to one of my historical tweets at some point in the future when talking about something else online, and (b) I’m worried about the historical record that Twitter represents and the fact that if it disappears a great source for future historical research will be lost.
For all these reasons, I decided that before deleting all of my tweets and locking my account, I was going to archive my tweets at archive.org using the instructions they provide. Unfortunately, I found those instructions to be somewhat vague, and furthermore I discovered that they didn’t archive all of my tweets the first time around and I had to resubmit some of them a second time. I’ve put together this blog posting to give people more detailed instructions for how to archive your tweets successfully.
Step 1: Ask archive.org to generate a list of potential archive URLs for you
- Ask Twitter for an export of all your data as described here.
- Twitter will notify you when the archive is available. It could take days.
- Once Twitter notifies you, download the archive using the instructions they provide and save it to your computer.
- Unzip the archive locally, or if you don’t want to unpack the whole thing, extract the file
- Go to https://archive.org/, sign up for an account if you don’t already have one, and log in.
- Go to https://archive.org/services/wayback-gsheets/archive-your-tweets, put your twitter handle into the text box, select the previously extracted
tweets.jsfile for upload, and click the “Upload” button.
- Wait a while for it to process the uploaded file and give you a CSV to download. Save the CSV file locally.
Step 2: Filter out the bad URLs so archive.org doesn’t have to do work trying to archive them
Here is where my instructions start to diverge from the ones provided by archive.org. Some of the URLs in the list that archive.org gave you are no longer valid. It will help archive.org archive your tweets faster, and reduce unnecessary load on its servers, if you get rid of those URLs before submitting the list to archive.org. You can use this script to do that:
Save this script to your computer, e.g., as
wayback-check.py, make sure you have the Python 3
requests module installed, and run something like
python3 wayback-check.py --fetch twitter-urls.csv > filtered-urls.csv to generate a new file called
filtered-urls.csv with the URLs that archive.org shouldn’t attempt to archive filtered out.
Step 3: Create a Google Sheet for archive.org to read URLs from
We’re now back on the standard archive.org instructions track.
- Go to Google Sheets.
- Create a blank sheet.
- Select File -> Import.
- Click the Upload tab and select the CSV file produced by the script.
- Leave the default import settings and click “Import data”.
Step 4: Submit the list of URLs to archive.org for archiving
- Go to https://archive.org/services/wayback-gsheets/.
- Click the “Sign in with Google” button and sign in.
- Click the “Archive URLs” button.
- Past the URL of the Google Sheet you created above into the “Google Spreadsheet URL” text box.
- It’s up to you whether to check the four checkboxes that are unchecked by default. Personally, I checked all of them. (Though “Save results in a new Sheet” didn’t seem to work.)
- Leave the other options unmodified to be nice to archive.org.
- Click the “Archive” button to start the archiving process.
You don’t need to leave this web page open; the archiving process will continue to run even if you close it. Archive.org will eventually sent you an email telling you that it has started processing your request with a link for monitoring the status of the request. This email won’t come right away, but it will come. Once it comes, you’ll also be able to see the status of the job at the above wayback-gsheets URL (after logging into Google again).
While the archiving process is running the progress may jump around, i.e., sometimes the number of URLs and percentage completed may go down instead of up. Furthermore, the number of errors reported may jump around as well. I have no idea why this is. In fact, while one of my archive jobs was running for some reason it “cloned” itself and there were two jobs listed with the same status URL. I have no idea why this happened but it didn’t seem to hurt anything.
Eventually archive.org will send you another email telling you that the archiving job is finished. This could take days.
Step 5: After the archiving is finished, check which URLs need to be submitted again
As I mentioned above, I don’t know why, but not all of my tweets were successfully archived on the first pass, so they needed to be submitted again. I wanted to submit just the missing URLs rather than all of them, to minimize the load on archive.org and also minimize the amount of time for the second archiving run. Fortunately the script above will help with this.
- Wait until you get the email from archive.org telling you the archiving job is finished, then wait another ten hours to give archive.org time to add all the archived URLs to the Wayback Machine’s index.
- Run the script on the output of the last run, e.g., like this:
python3 wayback-check.py --check-archive filters-urls.csv > filtered-urls2.csv. The newly created CSV will list URLs that still aren’t archived in archive.org.
- Repeat “Step 3” through “Step 5” of these instructions—each time you repeat “Step 5” start with the CSV file that was created the last time you did “Step 5″—until there are no URLs left to archive, or until the list is small and looks like it’s composed only of weird URLs that archive.org is probably unable to archive for whatever reason.
Once the list of remaining URLs is small enough, you can go to archive.org and search the Wayback Machine for individual URLs that are still missing to add them one by one. Note, however, that the “Send me email when done” option there doesn’t seem to work, i.e., the email never seems to arrive, so you’ll have to just wait a while and then search for the URLs again to confirm that they’ve been added.
When you’re all done you can delete all the Google Sheets.
P.S. If you’re relying on the Internet Archive to preserve copies of your tweets forever, then please consider kicking in at least a small donation to help defray the cost of doing that. Storage is cheap, but it isn’t free.