About the Million Book Project

What is the current status of the Million Book Project?

Use Internet Explorer to access the Million Book Project/Universal Library sites:

  1. Million Book Project [The Universal Library, China site], available
  2. Million Book Project [Digital Library of India], available
  3. Million Book Project [The Universal Library, U.S. site], available

As of June 2004 -

What Purpose Does the Million Book Project Serve? What Problems Does the Project Address?

Research reveals that students and faculty look online first when they need information because of the speed and convenience of online access. They prefer remote access to electronic resources rather than having to go to a physical library facility. Though faculty and graduate students often turn to a library web site or licensed electronic resources when they need information, undergraduate students tend to start with popular Internet search engines like Google because these search engines are more convenient and easier to use than library databases. Most students believe the information they find on the open Internet is good enough to use in their coursework. Unfortunately, only about 6% of the surface web content indexed by popular search engines is appropriate for student academic work. Faculty are concerned that the lack of quality resources on the surface web is having a negative impact on the quality of student learning.

Meanwhile, the increasing availability and use of online bibliographic databases, the increasing number of scholarly publications, and the increasing cost of library materials have created a situation wherein libraries are spending more money but acquiring fewer materials. Interlibrary loan is increasing, but the turn-around time is often inadequate for both the highly competitive research conducted by faculty and the shorter deadlines of students. Consequently, user satisfaction is decreasing. Research recently conducted by Carnegie Mellon University Libraries to improve our understanding of the graduate student experience exposed their frustration with the amount of time it takes to get the materials they need for teaching and coursework because the Libraries' electronic resources are not easy to use. To save time and aggravation, they often turn to an Internet search engine first. Among the concerns they expressed was the difficulty of acquiring old journals and out-of-print books. Collection size, the turn-around time required for interlibrary loan, and the cost of document delivery constrain their selection of research topics, the quality of their work, and their grade point average. Lack of free and speedy access to quality resources has a negative impact on the timeliness and success of academic work. Research indicates that most students and faculty perceive a significant gap between their need for speed and convenience and the service their library is providing.

Beyond the boundaries of these problems, tremendous disparity exists across the nation and around the world in the size and accessibility of library collections. Some single institutions, like Harvard and Yale, have more books in their libraries than some entire states have in all of their libraries combined. In our rapidly changing world, lifelong learning and access to books have become essential to employment, health, peace, and prosperity. Greater public access to information is consistent with the goals of education and deliberative democracy. The expectation is that greater access to information will enhance respect for diversity and pluralism, alter the ways in which people work and deliberate together, and better equip people to understand and challenge the world around them. The Million Book Project will digitize a large body of published literature and offer it free-to-read on the surface web - providing students, faculty, and lifelong learners with rapid, convenient access to quality resources. Equitable, world-wide access to the Collection will contribute to the democratization of knowledge and empowerment of a global citizenry. An important byproduct of the Collection will be the existence of a test bed that stimulates and supports much-needed research in information storage and management, search engines, imaging processing, and machine translation. top

For More Information

"How students search: Information seeking and electronic resource use" (EDNER [Formative Evaluation of the Distributed National Electronic Resource] Project, Issues Paper 8, 2002). Available:

S. Jones and M. Madden, "The Internet Goes to College: How Students Are Living in the Future with Today's Technology" (Pew Internet & American Life Project, September 15, 2002). Available:

S. Lawrence and L. Giles, "Accessibility and Distribution of Information on the Web." Nature 400 (1999): 107-109. Summary available:

LibQual+TM Spring 2002 Survey Results, Association of Research Libraries (Texas A&M University, 2002). LibQual+TM Spring 2003 Survey Results, Association of Research Libraries (Texas A&M University, 2003). See

D. B. Marcum and G. George, "Who Uses What? Report on a National Survey of Information Users in Colleges and Universities," D-Lib Magazine 9, 10 (October 2003). Available:

OCLC, "How Academic Librarians Can Influence Students' Web-Based Information Choices" (White Paper on the Information Habits of College Students, June 2002). Available:


What are the research issues in the Million Book Project?

What content will be included in the Million Book Project? What scanning is currently underway? What about copyright permissions?

Our initial thinking was to take a staged approach to collection development on a discipline-by-discipline basis. However, discussion with project partners and potential partners in November 2001 at a collection planning meeting funded by NSF resulted in the decision to focus on providing free-to-read access to multiple collections. Copyrighted works will be digitized upon receipt of permission from the copyright holder to include the works in the Million Book Project.

Our partners in India and China are currently digitizing local materials. Our Chinese partners are digitizing unusual and unique rare collections in Chinese libraries. Our Indian partners are digitizing government textbooks published in eleven of the eighteen official languages in India.

Who are the key U.S. participants in the Million Book Project?

Who are the other partners in the Million Book Project?

What university/scholarly presses are participating in the program?

The University of Texas Press, Brookings Institution, the American Meteorological Society, American Institute of Biologocal Sciences, and Rand McNally are among the presses that have given permission to digitize their out-of-print in-copyright books. National Academy Press has given us permission to digitize all of their books published prior to 1995. As of June 2004, we are in negotiation with many other presses, including Johns Hopkins, Duke, Penn State, and the Russell Sage Foundation. top

How is the Million Book Project supported?

To date, two grants totalling $3.6 million have been received from the National Science Foundation to purchase equipment.

The Chinese Ministry of Education, Chinese Academy of the Sciences, Indian Institute of Science, and Carnegie Mellon University Libraries and School of Computer Science are providing personnel and facilities, and participating in collaborative research. Carnegie Mellon University Libraries is training the scanning operators.

University of California, Merced, will be a mirror site for the Million Book Collection. They have also contributed funds and personnel for copyright permissions work.

Brewster Kahle (Internet Archive) is providing disk storage.

OCLC is providing project partners with metadata at no charge, will support a registry to track progress and avoid duplicate scanning, and might become a sustaining host of the final Million Book Project collection.

Additional grant proposals are planned to support seeking copyright permissions, further collection development, the management of project logistics, and shipping costs. top

Can users of the Million Book Collection print or download the books?

The delivery systems for the Million Book Collection might restrict Print and Save functionality to one-page at a time. netLibrary’s experience indicates that this is sufficient deterrent to prevent users from printing or downloading entire books. This restricted functionality is required for copyrighted books in the Collection. (Note that copyrighted books are included in the Collection only with the permission of the copyright holder.)

To Print or Save a displayed page, move the mouse over the page image. A little toolbar will appear, with icons that enable users to Print, Save, and Email the page. Just click on the appropriate icon to Print, Save, or Email the page.


Will the Million Book Project preserve the fixed format of the initial publications?

Yes, the digitized works will preserve the fixed format of the initial publications. top

What metadata is being captured about the digitized works?

MARC records and administrative metadata are being captured following existing standards. Dublin Core is being used for materials that have not previously been catalogued or where MARC is inappropriate, for example, for photographs and three-dimensional cultural artifacts. top

Publishers might not give the MBP blanket permission to digitize and make available all of their out-of-print, in-copyright titles, but might entertain requests for permission to digitize specific titles. Is that possible?

The MBP approach is to request permission for a range of years, for example, everything published prior to 1990. A publisher could specify the cut-off year or, alternatively, specify the list of titles for which they grant non-exclusive permission to digitize in the MBP. top

What value-added services will the MBP develop and what formula will be used to calculate publisher royalties? When might participating publishers begin to see income from the project?

The MBP is not developing a for-profit system. All of the content will be available free-to-read on the Internet. Participating publishers will get copies of the digitized books and metadata, and can themselves provide or enable others to provide value-added services to access the digital books. Permission granted to the MBP is NON-exclusive.

Reading the case study of the National Academy Press's experience putting their books online free-to-read could facilitate understanding and appreciation of the benefits of this approach. See: Barbara Kline Pope, "How to Succeed in Online Markets: National Academy Press: A Case Study," Journal of Electronic Publishing 4, 4 (May 1999). Available:


What kind of accuracy will the MBP achieve in scanning?

Carnegie Mellon has established a workflow (based on pilot, 100-book and 1000-book projects) that includes steps to insure capture of high resolution images and essential metadata, post-processing to correct skewing and crop dark borders surrounding the page images, and OCRing to create searchable ASCII text with 98% accuracy. top

Will the TIFFs meet the Print-On-Demand (POD) standards of Replica and Lightning Source?

The MBP follows the standards and best practices supported in "A Framework of Guidance for Building Good Digital Collections" Developed by the Institute for Museum and Library Services in 2001 and endorsed by the Digital Library Federation in 2002. See:

More specifically, our guidelines for data production (excerpted from the MBP NSF proposal and based on pilot projects) are:

Once you've scanned a title, how soon will you return TIFFs to the publisher?

The timing depends on many factors, including how long it takes us to locate copies of the books for which permission was granted, how many books are involved, what's already in the queue of books waiting to be scanned, etc. top

Who will determine the pricing of value-added components of the MBP?

The publishers or vendors who develop the value-added components will determine the pricing for the services they provide. top


