Who Will Archive the Archives? Thoughts About the Future of Web Archiving Michael L. Nelson Old Dominion University with: Old Dominion University: Scott G. Ainsworth, Ahmed AlSum, Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle Los Alamos National Laboratory: Robert Sanderson, Herbert Van de Sompel
Web Archiving: Big Data?
Two Common Misconceptions About Web Archiving • Prior = old = obsolete = stale = bad – who cares, not an interesting problem • The Internet Archive has every copy of everything that has ever existed – who cares, problem solved
How Much of the Web is Archived? It Depends on Which Web… Including Excluding SE cache SE Cache 2013 90% 79% 95% 97% 68% 92% 35% 16% 23% 88% 19% 26% Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives
Temporal Drift August 27, 2005 11:16 a.m. EDT link
Temporal Drift: Now 3 Hours in the Past August 27, 2005 August 27, 2005 11:16 a.m. EDT 8:00 a.m. EDT link link
Temporal Drift: Now 17 Days in the Future August 27, 2005 August 27, 2005 September 13, 2005 11:16 a.m. EDT 8:00 a.m. EDT 8:12 a.m. EDT link link link
Temporal Drift: Now 23 (or 6) Days in the Future August 27, 2005 August 27, 2005 September 13, 2005 September 19, 2005 11:16 a.m. EDT 8:00 a.m. EDT 8:12 a.m. EDT 8:25 a.m. EDT link link link link 10+ clicks in the archive results in median drift of ~45 days (standard UI) or ~15 days with Memento. ~2% of the sessions have drift of > 1 year. see: http://www.cs.odu.edu/~mln/pubs/jcdl-2013/jcdl93-ainsworth.pdf
We Call the Drift in a Single Page "Temporal Spread"
2005-05-14 01:36:08 +9 days +7 months +18 days +18 days +2.1 years using current policies, only ~76% of pages are complete, with a mean temporal spread of ~1 year, and with ~5% of pages having a temporal violation. (submitted for publication)
archive.is archived version of peep.us version of archive.org version
Why Make Lots of Copies?
Archives Are Subject to the Same Vagaries of Other Web Sites… IA API changes ODU OS upgrade ODU power outage reminder: 0.99100 = 0.37 0.999100 = 0.90 In a perfect world, this graph should be monotonically increasing. Memento allows simultaneous access to more archives, but this also means that at any given time, some archive(s) will be down. see: http://arxiv.org/abs/1307.5685
Summary • We have a cultural mandate to preserve "obsolete data or resources" – however, we currently have limited discovery and replay tools • We need lots of people making several copies of many things – Memento is the mechanism for accessing the long tail of archives