Internet evangelist Cerf talks importance of digital preservation

When the word “evangelist” is thrown around, it is often first associated with religion and, more specifically, Christianity. Vinton “Vint” Cerf, the vice president and chief Internet evangelist of Google Inc., redefines the term.

Cerf, one of the Internet’s creators, made his Chautauqua Institution debut on Tuesday when he took the Amphitheater stage to speak on the morning platform. Keeping with the Week Six theme, “Vanishing,” he titled his lecture on digital preservation “Digital Vellum.”

The word vellum is derived from the Latin “vitulinum,” meaning “made from a calf.” It refers to a parchment made of calfskin. As opposed to ancient media like tablets and papyrus, information recorded on vellum was intended to be preserved, Cerf said.

As an influx of information — more than ever before — becomes available on the Internet, Cerf worries for the future preservation of its information and development. The modern challenge is to archive scientific data, such as published software, metadata, operating systems and hardware calibrations, while simultaneously working in tandem with new computing innovations.

“There is such a wide range of information we need to hang on to in order to use them later, for example, to compare with new measurements and theories,” he said.

Cerf asked the audience to put themselves in the shoes of a 22nd-century Doris Kearns Goodwin, the famous biographer, historian and political commentator. He said that, for her book Team of Rivals: The Political Genius of Abraham Lincoln, she was able to reconstruct dialogue between Lincoln and his cabinet members through meticulous research of their physical correspondence — letters, library materials, etc.

But in 2100, that option won’t be as readily available due to the digitization of communication and the lack of appropriate measures for preservation.

In the past, there were now-obsolete methods such as various floppy disks and VHS tapes. But even CDs are becoming obsolete as manufacturers produce computers without readers for them, such as Cerf’s own MacBook that he used for his presentation.

Add a wide proliferation of content platforms and Internet-enabled devices (including refrigerators and surf boards), and there is more information floating around the Internet highways than there are places to store it, Cerf said.

Things get more complicated when dealing with dynamic memory, such as games, as opposed to static memory, such as videos and pictures. Application-specific content needs to be preserved with the software that can open it. Cerf used an example of WordPerfect documents he has that can’t be opened by anything other than WordPerfect software.

Cerf said it is important to make old software such as WordPerfect viable over long periods of time by constructing future machines compatible with past programs.

There are a multitude of challenges for those engaged in digital preservation, he said. People in the field contend not only with engineering backward compatibility, but legal and business models unsuited for a developing marketplace, such as outdated bankruptcy laws, intellectual property rights and capacity for data storage.

“I think we need, in the intellectual-property space, the notion of preservation with digital materials,” Cerf said. “We need to have some sort of rights associated with those looking to achieve that in the same way we offer fair-use rights to things that are still under copyright.”

With the exception of the 1996 Digital Millennium Copyright Act, copyright law has not been updated since 1976.

While the bounds of current storage space on hard drives and thumb drives have yet to be exhausted, the not-so-distant future will open up molecular storage, Cerf said. He even theorized that we might, one day, be able to store large amounts of data in DNA. In addition, he described the need for a “Rosetta Stone” that could act as a translation tool for the litany of diverse programs, past and future.

Cerf cited the Open Library of Images for Virtual Execution (Olive) Project as the prime example of innovation in the field of digital preservation. Mahadev “Satya” Satyanarayanan, a renowned experimental computer scientist and professor at Carnegie Mellon University, runs the National Science Foundation-funded preservation project.

Satya has been working to create a “digital X-ray” to create execution fidelity, Cerf said. He used his MacBook as an example.

“The image you could have is of this laptop running an Apple OS and running an application, in this case PowerPoint,” he said. “With a metaphorical digital X-ray machine, you’d be getting an image of the hardware, the OS and the software all loaded. If you could store that image and pull it back, you could do what we’re doing at this exact minute.”

Quickly, the issue transitions from creating mirrors and emulations for a multitude of programs to an issue of scale. How does one carry around the massive amount of memory required, Cerf asked. Satya’s answer was the digital cloud.

From the cloud, a variety of emulations for numerous systems and software can be pulled into different interfaces as needed. Cerf said this will work for simpler point-and-click applications such as Microsoft Word or PowerPoint, but for dynamic systems, the technology has not yet caught up.

Beyond the scaling issue, digital preservation efforts face even more obstacles, Cerf said. Roles like those of digital librarians and archivists either don’t exist, or face challenges of change. The same can be said of legal frameworks.

The Olive Project is not the only effort to archive the Internet. Brewster Kahle, computer engineer, internet entrepreneur and digital librarian, has started the Internet Archive. It is a San Francisco-based digital library with the stated mission of “universal access to all knowledge.” In case of a West Coast earthquake, Kahle has backed up much of his library at the Library of Alexandria in Egypt, Cerf said.

Other projects Cerf mentioned include the Computer History Museum, Google’s book scans and Google Cultural Institute.

“I want to emphasize that it is technically challenging to make this work,” he said. “And the legal framework doesn’t exist yet. If we’re going to preserve our long-form digital heritage for hundreds and thousands of years, we need to start working on this now and assure we have the framework to achieve it.”


Q: You touched on this at the very end, making a distinction between dynamic content. Can you speak more to how the notion of an artifact is changing?

A: Artifact, historically, has meant a physical thing. We are an artifact-creating species. We have invented bows and arrows and knives and everything else — lots of other things more complex like the Large Hadron Collider. But because of computers, we are also creating what I will call logical artifacts. These are things that don’t have an existence as much in this physical world as they do in a watchable world we may call cyberspace. The notion of artifact now has taken on this broader meaning. Preservation of this artifact implies some of the things I have been talking about — retention of software. This has expanded the scope of how we’re thinking about artifact preservation and the methods by which we can do it.

Q: You’ve talked a lot about how the machines that are needed to read these digital artifacts, but there’s also this well-established generational gap when it comes to how we access, create and engage with digital content. Who decides what’s worth saving?

A: This is a very interesting question. I was sitting in a room smaller than this one with about 75 librarians talking about this because they are, in fact, the prototypical archivists. At one point, a  young man got up and said, ‘Dr. Cerf, this is not a problem. The important stuff will be copied from one medium to another, or it will be translated into new formats that are modern, and the stuff that isn’t important will go away, and no one will care.’ It took half an hour to get the librarians off the ceiling. The reason is simple; they pointed out that, sometimes, you don’t know what’s important for a couple of hundred years. That letter, that message, that tweet is what triggered some important thing. To save as much as possible may turn out to be in the best interest because we don’t know ahead of time what’s important. I want to emphasize that if we can’t save everything — and probably we cannot — I don’t want to lose the abilities that I think are worth saving. I don’t want to lose the capacity to preserve information and make it accessible. I want it to be available to everybody if you choose to exercise that preservation method — that right, if you like. That’s where the young man and I parted ways.

Q: Can you address the vanishing of certain industries as a result of the new share economy? For example, Uber versus the traditional taxi, or Airbnb and the hotel industry.

A: This is interesting. This is sometimes called the sharing economy. This is enabling business models that wouldn’t work before. Let me use one that many of you may be familiar with: Ebay. Before we had this online auction environment, you had to go somewhere to participate in an auction. Yes, you could be on the phone and somebody could be substituted, but somebody had to be there in order to act on your behalf. That meant the only auctions that made any sense were the ones where you could amass enough people to generate a real competition for that which was that lot. In the Ebay world, you can sell in a virtual auction to people from all over the world that have access to the Internet. This is a new business model where auctions were created that couldn’t have existed before in the absence of this online environment. There are going to be a lot of examples like that. There are lots of debates. There are questions about consumer safety, there are questions about taxation and what rights does the city or the county or the state or the country have to intervene in business models. We haven’t solved all these problems yet because many of these new models are new. One of the nice things about the Internet is that, until now, you haven’t had to get permission to go try out a new idea on the network. This is what’s called permissionless innovation. This is a really important attribute of the Internet, and it’s worth preserving. It’s not always clear what business models are going to work and which ones are not. When Larry and Sergey started Google, they didn’t have a business model. All they wanted to do was download the entire Internet and index it. In fact, as I recall, when they were graduate students, they didn’t have all the computers they needed to try and index it, so they borrowed some. I don’t know that they had permission in all cases to borrow them. It’s just an example of permissionless innovation. It’s a business model that has been very successful. Of course, it turned out that the profit was advertising associated with searching the World Wide Web. But that was not predictable. This idea that you could try things out on the Internet and use it to create new businesses is really important, form my point of view. that’s why we have things like Uber and Airbnb, and I’m sure others will come as well. We do have to be concerned about people’s safety, about their privacy, and about their security. That’s what governments are often concerned about for their citizens.

Q: We hear so much about efforts to fight cyber terrorism. What about cyber storage terrorism?

A: A good example of this was the attack against Sony. Frequently, attacks against part of the Internet have been webpages. For example, defacing a web page or vectoring people off on a place they didn’t think they were going to. This particular instance, data was actually destroyed as well as exposed — and that’s a bad thing. This is not something that any of us would like to contemplate, especially if we are storing a lot of our own personal information out on the Web or in the data sectors of the cloud. The technologists of the world, including me, have an obligation to everyone who uses these systems to do everything we can to create a safer, more secure and adequately private environment. We have a lot of work to do there but it’s not an impossible task. A great deal of the application of cryptography has been undertaken over the course of the last four or five years, in particular. Google, just to give you a concrete example, now encrypts all the traffic going from your laptop or desktop going to our servers. When we put data in our servers, we encrypt that while its at rest, so that if somebody breaks in they don’t have the ability to decrypt it necessarily. When we move data between our data centers, which is a lot in order to replicate the data to ensure that nothing is ever lost, we encrypt that as well. We are urging people to use two-factor authentication. This involves having a piece of equipment in addition to having a username and password. Lots of people just have usernames and passwords, and, I’m sorry to say, a lot of people don’t pick their passwords very well. A lot of people pick the word ‘password’ as their password because it’s easy to remember. But the bad guys know that, too. At Google, not only do we have usernames and passwords, but we a physical piece of equipment that you have to plug into the computer, which generates a cryptographic, one-use password that never gets reused for a long period of time. If you don’t have a piece of hardware to do that, you can put the same kind of cryptographic code in your mobile and ask your mobile to generate that password for you. There are things that you and I can do to protect our privacy and our safety. We have to be careful about clicking on things where we aren’t sure where they go. You’ll find that Google’s Chrome browser now is starting to tell people to be careful about this message, it looks like one that involves phishing, which means you click on the attachment, and you click on a hyperlink, you may be going to a place you don’t want to go. You need to be a little suspicious. In fact, I tried to get the Google marketing team to change the name of our Android operating system to Paranoid because this operating system is trying to protect you.

Q: If the Internet is a distributed system, why can’t the solution be distributed as well? You seem to be advocating for a centralized solution.

A: That’s a very odd statement. I am absolutely not advocating for a centralized solution at all. What I’m saying is that a certain technique has to be developed that allows you to run a virtual machine or run an emulation. But that can be done anywhere. You could do it on your own laptop, you could do it on the cloud. This is absolutely not a centralized system. What it is saying is that you need a technology to get the effect that you want.

Q: Are there any particular formats that are more likely to remain robust?

A: Yes there are. One of them is called ASCII, which is the American Standard Code for Information Interchange. It’s this utterly trivial 7-bit code that represents, essentially, the letters of the alphabet, numbers, and a few symbols. To be honest with you, for many, many years, all the documentation about how to design and build the Internet and its protocols had been in simple documents using that ASCII coding. That’s been usable for many, many years. As you start moving into elaborate formats, like PDF from Adobe, it gets more complicated. This is not a gratuitous dig at Adobe at all. Making sure that those can still be interpreted a thousand years from now needs that kind of thing that I was talking about before. The ASCII format is simple enough that you can imagine retaining the ability to interpret files that are pure ASCII files. Some of you will remember an editing program called WordPerfect. I have a lot of files that were done in WordPerfect where you can still display the file as an ASCII file — as a text file. But to get it to format properly, you need the WordPerfect software to interpret those little commands that were inside of the text. That doesn’t work very much anymore because WordPerfect isn’t around, unless, of course, you kept the old software. The answer is that some formats — the simpler ones — will probably last for a long time. The more complex ones take extra work.

Q: This person asks about content you don’t want online. They use the example of revenge porn, but it could be content intended to shame somebody else. Will it ever be possible to permanently remove content or “govern” the Internet?

A: I think my answer is no. Let me give you an example of something a few years ago. Amazon makes a thing called a Kindle, and they make different versions of that. Some years ago, after having distributed an electronic copy of a book, they discovered that they did not have the right to distribute the right to the electronic copy. It happens that they had the ability to reach out to the Kindles and they can push content to the Kindles, but they could also remove it. So they reached out and removed the copy that they didn’t have the right to distribute. Ironically, that book is 1984. That being said, they would never do that again. The effect would be the equivalent of going to a house, breaking in, and removing copies of 1984 from your personal library. There are books and papers that have stuff in them that some people wish could go away, but they don’t have the right to take it away from you, necessarily. I think the same is going to be true for online content. Content that is online can be copied, typically, and it could end up on somebody’s disk drive or it could be on somebody’s memory stick. Once it’s out there, it’s really hard to get rid of it all. You need to get over that assumption. The right to be forgotten — the same thing. Don’t assume that once you put it out on the net that it will be removable.

Q: How does encryption of content affect the ability to access old content?

A: This could be a significant challenge. At the same time, you can imagine going to the trouble to encrypt a whole bunch of stuff you’re trying to protect, and then forgetting the password or forgetting the cryptokey. That could turn out to be very bad. There are people asking questions about what happens after you die to all of your digital online content. There are even companies that are starting to spin out that are trying to figure out ways to let you say what should happen to access to your online trail. Some people would like it to go away, some people would like legitimate parties to have access to it. The honest answer here is, if you’re going to encrypt things, think really carefully about how you will preserve the keys that will be needed so that the legitimate parties can get access to it. Otherwise, you may have in fact locked yourself out.

Q: As we think of artifacts — past, present and future — can scanning projects help us mine and apply potentially great ideas?

A: The answer is that historical data can be extremely valuable, not only just to understand a story behind technology or a story behind sociology or history, but also to validate or invalidate new theories. This is why some of the historic scientific information is so important, especially as we invent new theories to explain phenomenon. We have the right to go back and figure out whether or not the data fit the form earlier. Being able to see the notebooks, or other records, of researchers or engineers or businesspeople, can be really illuminating. The one thing I want to emphasize is the idea of notebook as a static object is going to have to be expanded pretty dramatically. The tools used now are not simply writing. The tools we have now are based on software artifacts. We are going to have to preserve the software the helped us keep our lab notebooks in order to be able to view them in the future. We should not have this backward-looking model that is a physical lab notebook, and have a forward-looking model that information is accumulated in a  software, artifact setting.

Q: If the U.S. gives up control of the Internet domain name organization, should we be concerned about foreign governments restricting what we have access to on the Internet?

A: There’s this misunderstanding that the U.S. controls the Internet. There was a time when it did — that’s because I did. I was the program manager at the Defense Department. I was figuratively writing the checks to the researchers who participated. Bob Kahn made decisions on our own nickel, so to speak, but we made the decisions on behalf of the Internet for quite a long time. When it was turned on in 1983, others started to invest in building pieces of Internet, so the control of the Internet expanded to parties who owned and operated pieces of it. The U.S. government has a rather limited connection with Internet governance today, specifically the National Telecommunications and Information Agency, which is a part of the Department of Commerce, holds a contract with the Internet Corporation for Assigned Names and Numbers, which it has held since 1998. That contract basically says to ICANN, please administer the Internet address space. But the Internet address space and the policies for it are managed by organizations called regional Internet registries. I’m the chairman of the board of one of them called the American Registry for Internet Numbers. We manage the Internet address space for the U.S., Canada and some Caribbean countries. It is our constituency that formulates policy for Internet address allocation. It is regional. This is a much more distributed responsibility. In the case of domain names, it’s even more complicated because there are many constituencies within ICANN that help to develop policy. Policy is developed by the parties who will be affected by it, which includes the technical community, the private sector and civil society, and all those constituencies together develop policy. Then it has to executed, it has to be enforced, which is falls on the smaller organizations. When it comes to laws, that’s typically a government responsibility to enforce. When it comes to business policy, that might be the responsibility of ICANN. This is a much more diffuse environment than it was in the days in the 1970s when Bob Kahn and I ran the Internet as a research project. There’s nothing we can do about a government that decides to interfere with the Internet in its own territory. There is this notion of autonomy and ability to operate within your territory. We can’t stop the Chinese from interfering with Chinese access to the Internet. On the other hand, I don’t think that translates into those governments taking control of the Internet because the control is completely distributed across the parties that are actually operating pieces of the Internet.

Q: How will the next generation of digital content be interconnected over the course of time?

A: There are some very interesting technologies that are coming along — new digital ways of capturing and presenting information. I’ll give you a couple of examples. At Google, there is an organization called ATP, which stands for Advanced Technology Program or something like that. It’s run by the former director of DARPA. This particular thing is called 360 degree films. This involves taking a camera which is capable of literally copying everything 360 degrees all at once. They have multiple cameras viewing the scene at all angles. There is an audio system which they have adopted that allows the sound to be presented to you as if it’s coming from behind or in front or over on the side — way behind left or right stereo. they have made this technology available, and you can see these 360 degree films on YouTube. How do you view a movie which is taking shape in space? You have to look to see it all. They render the movie with the mobile. If you move your mobile around, the mobile’s gyroscope senses that you have changed position and displays a different part of the 360 degree movie. You have to interact with the space in which the movie was taken in order to see what’s going on. This is a completely new medium. Just to give you a sense of how hard this is — it’s not just the technology. Think about the storyboard for this. A movie has a linear storyboard, and then the camera sees this, and then sees this, and then sees that. This thing has stuff going on multiple things at the same time. They had to invent a whole new storyboarding technique and a whole new editing system in order to correctly say what’s going on where. It’s astonishing. There are other things that are slowly making there way out — holographics for example. The dynamic ability to present what looks like a three-dimensional object — we’ll get there eventually. There’s that stuff that’s going on. Let me just stop there because there are so many possibilities. Software is an endless frontier — you’re limited only by your imagination and your ability to code. Lots more is coming. I don’t know what it is.