Archive Intersections

Intersection Analysis

Archives Unleashed 4.0, British Library

Sawood Alam, Computer Science, Old Dominion University
Gil Hoggarth, Web Archiving, British Library
Mat Kelly, Computer Science, Old Dominion University
Jessica Ogden, Web Science, University of Southampton
Shawn Walker, Information Science, University of Washington
Dawn Walker, Information Studies, University of Toronto

Context

Our group came together around an interest in how to assess what's missing in web archives. We're an interdisciplinary group and a mix of both researchers and web archivists with methodological interests in ways to quickly assess the presence or absence of URLs and domains within web archival collections.

Given the availability of data, we chose Occupy Wall Street as a case study for assessing the above across multiple collections. Occupy Wall Street (OWS) is the name given to a protest movement that began on September 17, 2011, in Zuccotti Park, located in New York City's Wall Street financial district, receiving global attention and spawning the movement against economic inequality worldwide (source: https://en.wikipedia.org/wiki/Occupy_Wall_Street).

Goal

Produce a set of recipes and metrics for assessing and calculating URL, domain coverage across web archives. This is relevant for issues of archival staffing, labour and efficiency, redundancy (and ramifications for resources), the presence/absence of domains and inherent issues of selection and representativeness in the preservation of web resources.

Collections

Internet Archive Global Events "Occupy Movement 2011/2012"

Nov 2011 - Aug 2012
Archive-it, Internet Archive
CDX data 1.5GB across 5,018 files
WAT data 148.7GB across 5,800 files

Collection Details

NSF-IA Rutgers

2010 - 2013
Collection of OWS URLs identified by Rutgers University from Internet Archive
NSFIA CDX data 8.2GB across 2,793 files
NSFIA WAT 78GB across 1,280 files

Occupy Twitter

Oct 2011 - June 2012
291m tweets
301,438 domains shared
8,916,853 URIs shared

Collection Details

Quick Stats

74,746,752

Total URLs

0.09%

Overlap Between Collections

0.03%

Overlap Between Collections and Tweets

Collection Analysis

What can we learn from a CDX?

Overlap is inversely proportional to the diversity of URIs

Our study shows that methods for determining coverage and overlap between datasets have the potential to reveal insights into diversity in collection practices (as combined from multiple collectors, tools, selection mechanisms).
This work reveals are important implications for researchers on archives that are collected by defined, curated events (through Archive-It, for example) versus those datasets that are post-extracted from existing archives and collections.

What can we learn from a WAT?

Outlinks of Domains

Outlinks of SURTs

Parting Thoughts

The more collectors the better!

The study quite clearly demonstrates that even in collections where we might expect to see a lot of overlap in the domains/URIs collected, little overlap is present. Whilst the challenges still persist around resource constraints in web archiving (space, labour, funding), some forms of redundancy in the effort required to collect and maintain these web archives is still required.

Diversifying seed lists with links shared on social media = good!

The study confirms the lack of overlap between web archives and social media URLs (e.g. things that are shared on Twitter). This reinforces existing research around the difference in those resources indicated as ‘important’ between collection/curation practices - e.g. masses vs ‘gatekeepers’ (Milligan et al, 2017).

Working with derivatives for assessing coverage is pretty straightforward

Whilst we didn't have time to produce an indepth cookbook for assessing the intersection of web archives, we managed to do a lot of number crunching and archival visualisation in a relatively quick amount of time. In the future we could examine different collection mechanisms/tools, and the ratios of ‘success rates’ between crawler types/tools - by using filetype distributions and status codes to indicate whether certain crawlers (on same URLs) were ‘more successful’ in the capture of live resources.

Thanks!

digitalshawn/archive-coverage