SF Technotes

The Library of Everything

By Michael Castelluccio
October 27, 2021
0 comments

This year, the Internet Archive is celebrating its 25th year. In that relatively short time, the online library has added more than 622 billion web pages to its publicly available archive.

 

If you want to see what Google’s web presence looked like in 1998 or check Apple’s home page on October 22, 1996, they’re both there.

 

The idea that someone should collect and index the vast body of materials on the internet came to Brewster Kahle (below) in 1996. A computer engineer, Kahle had already helped create the WAIS system, the first search and document retrieval system for the internet, and when he and the partners sold their web traffic analysis company, Alexa Internet, to Amazon, Kahle turned his full attention to an even greater archiving project—the Internet Archive.

 

Brewster Kahle at “20th Century Time Machine” October 2017 Anniversary event by Brad Shirakawa

 

Probably the web’s most famous digital librarian, Kahle provides a thoughtful explanation for the necessity of an Internet Archive in the FAQ on the Archive’s website. “Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive’s mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.”

 

What the cofounders accomplished wasn’t just a colossal digital public library of websites, but access as well for those who would like to browse among thousands of Metropolitan Museum of Art images, NASA images, and United States Geological Survey maps, along with separate archives devoted to film, audio, microfilm, TV news, software, and numerous miscellaneous collections from places like the Brooklyn Museum and the Michelson Cinema Research Library.

 

They’re all part of today’s Internet Archive. The images collection alone, when taken together, totals more than 3.5 million items. Other collections have narrower limits. The Great 78 Project, for instance, involves the effort to digitize 250,000 donated 78rpm singles from 1880 to 1960. When completed, that will offer a “room” in the library, housing 500,000 vintage recordings. The website collection currently has listings and links for 240 billion URLs.

 

The Internet Archive headquarters in San Francisco, Calif.  Image courtesy/Internet Archive

 

THE BEGINNING

 

The Argentine author Jorge Luis Borges once commented that a library without an index becomes paradoxically less informative as it grows. With that kind of undirected growth already underway for the internet, Kahle began work on his indexing project, the Internet Archive, in May of 1996.

 

With the ambitious mission of providing “universal access to all knowledge,” Kahle’s group released web crawlers on a regular schedule to collect the data that would become the basis of the index. The web crawlers are internet bots that were given sites to visit, and there, they copied pages that later were processed by a search engine that indexed the information. Users could then search the saved pages in what Kahle described as “three dimensions.” The crawlers are different from the hordes of their fellow online seekers called “web scrapers,” which are bots automated to collect specific data sets out on the web.

 

The crawlers do their work recursively, so that an individual site will be visited many times to create a history of its progress. If you search the Apple website on the Archive, for instance, you will find snapshots of Apple home pages grabbed from 233,524 visits from October 22, 1996, to October 20, 2021. The visits to the Google home address now total an astounding 6,581,355 times since 1998.

 

You might wonder why anyone would want folders this thick on individual home pages. There are many reasons. Business concerns can study the web presence of competitors in detail over long periods of time, not only for their presentation but also for content and directions taken by those companies over time.

 

Journalists also often search for information on the Internet Archive. In 2014, one of the Archive’s saved social media pages of a separatist rebel leader in Ukraine, Igor Girkin, still had his boast of having shot down a “suspected Ukrainian military airplane” that turned out to be a civilian Malaysian Airlines jet. Girkin deleted the post on his social media page and then publicly blamed Ukraine’s military for the attack on the plane. He forgot, or didn’t realize, a snapshot of the original page with the admission and boast was still available on the Internet Archive. Editors at Wikipedia rely on the Archive for verification, and journalists frequently check dead websites, dated news reports, and changes like Girkin’s to website content.

 

Patent lawyers looking to establish ownership of ideas use the Archive. “The United States Patent Office and the European Patent Office will accept date stamps from the Internet Archive as evidence of when a given Web page was accessible to the public. These dates are used to determine if a Web page is available as prior art for instance in examining a patent application” (Wikipedia).

 

 

Click the image for access to the Wayback Machine

 

The Wayback Machine is where you go on the Archive website to search addresses and pages. It’s your doorway to more than 622 billion web pages saved on the Archive. This part of the site attracts 550,000 users per day and is adding 750 million pages captured each day on the crawlers’ recursive journeys to old and new sites.

 

The Wayback Machine was created in 1996 and released to the public in 2001. At its most elemental level it’s designed to allow users to travel “back in time” to visit websites in the past. Its whimsical name is from the television show The Adventures of Rocky and Bullwinkle and Friends. The characters Mr. Peabody and Sherman had a device called the Wayback Machine that enabled time travel and translations for the heroes in the show.

 

When you go to the Wayback page, the search field at the top will bring you to the data for the site you type in. It’s the calendar view; you can choose a time span, and then, below, you can select particular dates you’d like to see pages for. Tabbed across the top of the page are other ways to sort your search.

 

Kahle and Gilliat began archiving web pages on May 12, 1996. The Archive was made public in 2001. In 1999, the Archive began expanding its collection to include other media, beginning with the addition of the Prelinger Archive, which Rick Prelinger established in 1982 to preserve what he called “ephemeral” films. Prelinger’s goal was “to collect, preserve, and facilitate access to films of historic significance that haven’t been collected elsewhere.” By 2001, the collection had 60,000 films of varying lengths.

 

The addition of other archives of physical and digital media continues to expand online and at its San Francisco base. Today, the Internet Archive is more accurately an archive of archives and projects, and Kahle offers the reason for this: “Knowledge lives in lots of different forms over time,” he writes. “First it was in people’s memories, then it was in manuscripts, then printed books, then microfilm, CD-ROMS, now on the digital internet. Each one of these generations is very important.” And today, each is represented on the Archive, including an online lending library. You can see a listing of scores of the top collections on the Internet Archive index page.

 

A profile of current holdings casts quite an impressive shadow:

 

General: The 70+ petabytes of data (70,000,000 gigabytes) include 622 billion web pages and more than 65 million public media items (texts, movies, audio, images, etc.).

 

Books: There are 33 million texts in the collections (books, documents, magazines, etc.), including 4.6 million books that were digitized by Internet Archive. This total grows by 3,500 books digitized per day in 18 digitization centers worldwide.

 

Movies: There are 7.3 million movie items, and that doesn’t include the television programs that hold two million news programs available for closed caption search and 3,000 broadcast hours from the events of September 11, 2001.

 

Audio: There are 14 million audio items, including 220,000 live music concerts from 8,000 bands and 200,000 78rpm sides already digitized.

 

Images: The four million images are from a wide variety of contributed sources.

 

Software: This includes 775,000 software titles, and many can be emulated, allowing them to run on a web browser.

 

Like most conventional libraries, there is a help desk, and it’s worth a visit for advice on how to navigate the vast site. Start there, and your travels through the halls of the “library of everything” will be less confusing and more rewarding.

 



Michael Castelluccio has been the technology editor for Strategic Finance for 26 years. His SF TechNotes blog is in its 23rd year. You can contact Mike at mcastelluccio@imanet.org.


0 No Comments