RRC Blog

The Internet Archive: a wealth of digital history

Posted by Barbara Benedett on 9/28/20 1:28 PM

Internet Archive at archive.orgSince it’s creation in 1996, The Internet Archive (IA) at archive.org has been, well...archiving the internet. It might sound like an impossible feat: collecting data from billions of websites, preserving their information, design, and sometimes functionality. However, this is exactly what has been accomplished. The Internet Archive stores roughly 330 billion webpages as well as digital collections of books, audio, video, images, games, and software programs.  One personal favorite is the Software Library’s collection of playable 80s-90s video games, such as the Oregon Trail*, PacMan, and Donkey Kong.

The Curtis Institute of Music has a presence on IA as well. About 15 years ago, the library and archives partnered with IA and Lyrasis to digitize all of the recital programs, catalogs, and back issues of Overtones (found here). Going back to 1924, this collection can be freely accessed by anyone. Optical Character Recognition (OCR) allows for text searching. For example, once inside the 1924-1925 recital program book, a search for ‘Vengerova’ will reveal every recital that featured the renowned pianist, Isabella Vengerova. The catalogs and Overtones issues provide an equally fascinating glimpse into Curtis’ past. All three collections are heavily utilized, with several thousand views per item, by researchers who would otherwise not be able to gain access to the physical materials housed in Curtis’s archives.


Curtis.edu** has also been archived since the late 90s (initially by a 3rd party) and can be viewed via IA’s “Wayback Machine”. The Wayback Machine allows one to plug in any web address and view what it looked like at the time it was captured (also referred to as “crawled” given that the program used to capture web data is referred to as a “crawler”). Here is the result from a crawl of curtis.edu in 1998.

Aside from general curiosity about internet and organizational history, the Wayback Machine is an especially important tool for documenting and assuring accountability of government, public and political figures, and other important organizations. For example, all .gov websites have been archived thousands, if not millions, of times. Repeated and frequent captures reveal any informational changes and deletions to these sites. The White House Twitter feed, as well as other social media platforms, has been archived since the Obama years: Obama White House Social Media Archive

An important point, especially when holding public figures and governments accountable, is it is not necessarily the owner of the content doing the archiving. If something is or was publicly accessible on the worldwide web, it can be crawled and redistributed. A politician deletes a Facebook post after it has already been crawled? Too late! The original post is preserved and available for anyone who knows how to search for it. (A warning to think carefully about what you post.) For higher ed. institutions, archiving websites can reveal information about former faculty, board members, how admissions requirements change, or when certain policies were enacted.

Around 2012, The Digital Initiatives Department at Curtis and the Curtis Archives began formally collecting web materials. Utilizing IA’s web crawling software and access portal, Archive-It, the archives collects Curtis-generated content as well as related news, alumni and faculty sites, and topics of interest.

Neither the process nor the product is particularly pretty, and maintenance is ongoing. Sites get blocked by robots.txt files or put behind paywalls. Older data cannot always be accessed without technological assistance, such as virtual machines to emulate an MS DOS or MS Vista environment. Webpages (referred to as ‘seeds’ in tech-speak) can be crawled on multiple levels, and it is difficult to permanently preserve the original in its entirety. For an archives such as Curtis’s, it is usually the text of a page that is historically valuable. Text and document preservation is an easier process than attempting to preserve the functionality of the page (ex. dropdown menus, playable media). Over time, many tools once used for rendering pages, such as Adobe Flash, lose tech support and an older site will no longer function from a modern browser.

Archiving media from websites is possible but their large file size is especially problematic from a storage/budget concern. Additionally, the question to consider when archiving one’s own organization is “Does a better copy of this file exist elsewhere? There are also ‘crawler traps’ to be navigated. A simple calendar on a webpage can result in an infinite crawl as the software captures every possible date in the calendar app. The Digital Archivist at Curtis once ran over the site storage capacity by 1 terabyte by accidentally running too comprehensive of a crawl on Curtis’s YouTube page. This resulted in an attempted capture of every video and every URL that could be linked to from the site, including links in comments, ads, etc. There was a lot of apologizing to get Archive-It support to overlook (and not charge for) this huge amount of data.How I almost got fired


OK, the archivist in question was me.

I don’t make that mistake any more.


In conclusion, searching the Internet Archive for any topic of interest will provide a wealth of digital resources and internet history, curated but unedited. Curtis is a part of the IA’s vast collection, both from the institute’s own efforts and the efforts of 3rd parties with an interest in Curtis. Should you have any questions regarding accessing material from the Internet Archive, please contact us at archives@curtis.edu.


*A great article on Medium about the Oregon Trail for the post Gen X-ers: Why ‘The Oregon Trail’ is One of the Most Realistic Video Games Ever 

**On a side note, before plans were underway to build a website for Curtis, the curtis.edu address was secured by forward-thinking Head Librarian at the time, the late Betsy Walker!


Topics: Curtis Archives and Library, Technology