Why archivists are scrambling to back up the internet

Published: Feb 26, 2017, 7:49 p.m. MST

President Barack Obama uses a laptop computer to send a tweet during a "Twitter Town Hall" in the East Room of the White House in Washington, July 6, 2011. Charles Dharapak, Associated Press

If you open Vine, a once popular mobile app that allowed users to share 6-second looping videos, on your phone today, you’ll find only a simple camera. Any evidence of the app’s active feed and broad community of over 200 million users, closed as of Jan. 17, is now gone.

When Vine announced last October it would soon be shutting down, panic erupted among some of its avid fans. Where would their videos go? Were their short clips of grinning toddlers, pets’ antics, a flickering campfire — for some, brief but important records of everyday life — not actually theirs to keep?

For many, Vine’s closing served as a poignant reminder of how ephemeral our favorite web content is, always accessible with a quick tap of a finger until, suddenly, it’s gone. Institutions and companies fail (like Gawker or GeoCities), often sentencing the websites they host to digital oblivion.

Fortunately, Vine clarified that users could download videos for safekeeping any time before Jan. 17 and said that every Vine clip could be streamed on their website, albeit for, as Buzzfeed pointed out, “an unspecified amount of time.”

Yet, the internet remains an incredibly fragile space. Experts estimate the average lifespan of a website is only 92 days, and a 2014 Harvard Law School study found that half of the URLs in U.S. Supreme Court cases, which function as online footnotes, don’t link to the originally cited information.

These findings represent two widespread internet phenomena known as “link rot,” when a broken URL yields a blank error page, and “content drift,” when a link directs to new content that has overwritten the old, leaving no trace of what once was.

Because of this, many fear a “digital dark age,” or a future in which today’s digital documents are deleted or trapped in obsolete and therefore inaccessible formats. To combat potential loss and preserve a historical record of the 21st century, a growing group of individuals and institutions are scrambling to save web content.

What should we save?

In addition to revealing the internet’s impermanence, Vine’s failure also posed an important question regarding the ever-changing web: what is worth saving?

While some were sad to see the app go, others celebrated its demise. The culture of the youth-oriented platform was known by many as insular and unfunny at best and shockingly racist at worst. Is every Vine video indispensable in a digital record of 21st-century culture?

These questions of what's saved and what's lost plague digital archivists, especially when considering how much new web content is created every day. Internet Live Stats, a project measuring global internet growth, estimates that each second, there are more than 7,500 tweets posted, 750 Instagram photos uploaded and 2.5 million emails exchanged. Today’s exponentially expanding web consists of well over one billion sites.

Different libraries and archives have radically different approaches to collecting and preserving this web content, according to Trevor Owens, senior program officer over the national digital platform at the Institute of Museum and Library Services and a former digital archivist at the Library of Congress.

Some favor “being highly selective and curating” web archives while others “try to catch massive amounts of information and then use computational tools to sort and make sense of it,” Owens explained.

Archivists, including Abby Smith Rumsey, a historian, former Library of Congress archivist and author of the 2016 book “When We Are No More: How Digital Memory is Shaping Our Future,” are quick to point out that we can’t, and probably shouldn’t, save absolutely everything.

“Claiming we should save every single scrap of the web is like saying everything that’s ever been written down on a piece of paper deserves to be saved,” Rumsey said. Lots of web material, like forwarded email, is redundant.

Still, many archivists feel it’s best to err on the side of caution. Rumsey pointed out that we have lost 80 percent of all silent films ever made because contemporaries saw them as mere entertainment and not worth saving.

Conversely, “Moby Dick” was not a bestseller in Melville’s day and was only declared a great American novel after readers embraced and circulated the few remaining copies after the author’s death.

Preserving all the world’s knowledge

The scope of an institution’s mission also defines its approach to web archiving. Some, like the web archives of a state history museum or local newspaper, are interested in very specific types of content, which automatically narrows their focus.

Others are much more comprehensive in their archiving efforts. The National Archives collects most digitized federal government records and the Library of Congress archives billions of web pages deemed to be of cultural value by subject specialists.

Yet the biggest institution archiving the web for historical preservation is the Internet Archive, a nonprofit digital library founded in 1996 by technology entrepreneur Brewster Kahle.

Internet Archive founder Brewster Kahle checks connections to hard drives comprising his digitally stored library on Dec. 18, 2006, in San Francisco. | Ben Margot, Associated Press

The goal of this massive repository is to “create the Library of Alexandria version two” and provide “universal access to all knowledge,” including that on the public web, he explained.

Headquartered in San Francisco at a former Christian Science church, which Kahle appreciates because its Grecian columns allude to the most significant library of the ancient world, the archive houses a variety of digital collections, providing free public access to millions of ebooks, movies, software and music.

But its most impressive collection consists of over 279 billion web pages, captured and stored by the archive’s famous Wayback Machine. Created in 2001 and named after the WABAC machine from the 1960s cartoon “The Rocky and Bullwinkle Show,” which transported characters to the past to experience important historical events, this web archive features technology that takes screenshots of every web page it can find.

Kahle sees the internet as the world’s best library and wants to make it more reliable and permanent by mending its gaps. One of the Wayback Machine’s crawlers continuously archives and tests outbound Wikipedia links, replacing dead links with archived versions. He said the Wayback Machine has repaired over one million broken Wikipedia links.

Social media archiving

Other web preservation projects focus on collecting broad swaths of social media data, which archivists are beginning to view as rich, multifaceted accounts of historical events.

President Barack Obama meets with Twitter co-founder Jack Dorsey during a "Twitter Town Hall" in the White House on July 6, 2011. | Charles Dharapak, Associated Press

Rumsey explained that historians began to recognize social media’s potential as historical documentation during the 2011 Arab Spring, when Twitter helped organize protesters, command global attention and provide insight into social and political behavior on a broad scale.

The Library of Congress now archives all public tweets, though its collection is not currently open to researchers.

Recently, web archivists have been particularly interested in the social media materials from the Obama administration.

In addition to being stored in the National Archives (federal law mandates the preservation of all presidential records, including digital ones), the White House announced Obama’s social media legacy — from tweets to Instagram photos to Pinterest posts — would also be housed in a public portal: The Obama White House Social Media Archive, a searchable digital archive of over 250,000 social media records from official White House social media profiles.

Hello, Twitter! It's Barack. Really! Six years in, they're finally giving me my own account.
— President Obama (@POTUS44) May 18, 2015

The collection offers “an accurate retelling of what happened from the lens of the White House for the last eight years,” according to Anil Chawla, CEO of ArchiveSocial, the private company hosting the public archive. He foresees it being of immense value to journalists and historians but also to any citizen wanting to look back and reflect on the past.

Chawla said the archive illustrates the growing prominence of social media in public discourse: no longer can tweets be written off as purely “trivial stuff.”

And social media’s centrality has only increased since Trump took office, whose tweets have sparked debates about conflicts of interest, the destruction of presidential records and which stories should be labeled “fake news.” Accordingly, several Trump Twitter archives have already emerged online.

My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person -- always pushing me to do the right thing! Terrible!
— Donald J. Trump (@realDonaldTrump) February 8, 2017

A variety of developers are also designing tools to help academics collect and make sense of social media information.

Bergis Jules, an archivist at the University of California Riverside, leads a social media archiving project called Documenting the Now, which is building an application that allows users to retrieve and analyze sets of tweets related to particular events, hashtags, keywords or locations.

Jules first saw the need for this tool when attending a conference for the Society of American Archivists in August 2014. As he and other archivists attended panels, they were constantly distracted by their phones, unable to take their eyes off Twitter as the protests about police violence sparked by Michael Brown’s killing in Ferguson, Missouri, unfolded.

“We were just sitting there watching all that information flow through, and as archivists, our first instinct is to think about how to capture that type of content and how to preserve it for the long term so journalists and historians can find and use it later,” Jules said. “And so we played around with capturing some data and writing about what it means.”

Jules and others involved in the project believe studying events through the lens of thousands of voices on Twitter yields fuller, more accurate narratives and a more democratic approach to history.

Technical challenges

Though archivists insist saving web content is essential, they also admit the task faces obstacles, from technical to ideological, especially when considering long-term preservation.

“Digital materials themselves aren’t particularly long-lived, and the interfaces we use to view them aren’t either,” Owens said.

Keeping hard drives spinning for decades is expensive and consumes a lot of energy, Kahle noted. The bigger issue is whether or not data formats will be accessible years from now. Adobe Flash, a popular software platform used to display text and animation and stream audio and video, is already starting to go away, he said.

The Library of Congress archives all public tweets, though its collection is not open to researchers. | The Library of Congress

Preservation of analog materials like books or manuscripts is a fairly passive process. Archivists can set them on a shelf in a temperature-controlled environment and walk away. But digital materials require constant monitoring and updating so they can be accessible to new, accessible formats.

In her book, Rumsey likens this process to growing a garden: “everything we entrust to digital code needs regular tending, refreshing and periodic migration to make sure that it is still alive.”

In light of these challenges, some are turning away from digital archives altogether, arguing there is no way we can trust them to be accessible centuries or even decades from now. For instance, one Austrian archivist is so worried about digital data’s fragility, that he’s laser printing historical records on 1-mm-thick ceramic sheets and storing them in salt mines, defending his process by pointing to tablets of Sumerian cuneiform from 3000 B.C.E. still around today.

Rumsey, on the other hand, said she “greatly encourages those who try to come up with non-electronic ways of preserving information,” but noted that “you can’t have a complete record of contemporary events if you don’t have audiovisual recordings.”

Institutional responsibility

While hardware and software provide plenty of obstacles, philosophical approaches to web archiving are even more difficult to navigate.

Brewster Kahle prepares a book for digital scanning on Dec. 18, 2006. | Ben Margot, Associated Press

According to Kahle, the number one challenge to preserving as much of the web as possible is “a lack of clarity about institutional responsibilities.”

For many web materials, “there is no natural custodian,” Jefferson Bailey, director of web archiving programs at Internet Archive, explained. Plenty of archives are saving some things, but a great deal of potentially valuable material still falls through the cracks.

For example, who is going to take it upon themselves to gather and preserve most of YouTube?

Rumsey argues that ultimately public institutions must take charge of broad-scale web archiving efforts, suggesting that they, unlike commercial companies, have a responsibility to serve future generations. Yet both she and Kahle worry this will be difficult when public libraries and archives taxpayer funded.

Web archiving, and social media archiving, in particular, is also fraught with ethical issues, according to Jules. Just because an individual posts information on a public forum like Twitter, do archiving institutions have the right to collect and store it?

There are no clear answers right now, and Jules explained these questions are complicated by the fact that collecting social media content may actually threaten the people who created it. In 2016, the ACLU revealed data mining companies like Geofeedia used social media information to make police departments aware of individuals participating in public protests and other forms of civil disobedience.

Documenting the Now is testing different ways of establishing social media users’ consent and alerting them when their data is collected, though archivists are still learning how to navigate these ethical challenges.

Uncertain but optimistic

Despite obstacles, archivists are confident in future web archiving innovation — they just don’t yet know what that innovation will look like.

They are more certain of the ways preserving the web has and will continue to benefit us.

According to Kahle, having access to large amounts of social media data and other web information can give people a fuller, more accurate sense of what’s going on around them. In an era of post-truth and fake news, people are often “hoodwinked by anecdotes,” Kahle said. But if we can examine thousands of tweets and analyze hundreds of news articles to get a sense of larger trends, we can develop more comprehensive narratives.

Rumsey sees similar benefits, but noted this new type of thinking will require a new kind of digital literacy that involves sorting through massive amounts of information and making sense of it.

View Comments

Owens also expressed optimism, explaining that most archivists are frustrated by talk of a looming digital dark age because it discounts the decades of digital preservation progress we have already achieved.

“Some people say we are going to lose everything. But we won’t,” Owens insisted. “And historically, there has only been a small sliver of everything ever created that persists in libraries and archives in a permanent collection. We have so much more digital information saved than we ever had from previous eras."

Rumsey acknowledged that, in the end, “we will lose some things, and we will regret what we’ve lost. But we will be astounded at how much we will be able to save.”

Email: lfields@deseretnews.com

Looking for comments?

Find comments in their new home! Click the buttons at the top or within the article to view them — or use the button below for quick access.