A digital book collection is much more than a collection of digital versions of printed texts. It may, for example, be used as a corpus, a structured body of text that can be the subject of systematic, automatic analysis. The National Library of Norway has taken a first step in this direction by developing a so-called n-gram search based on its digital book collection. This search service application can make important contributions to new knowledge in many different fields of research.
Access to large digital book collections online changes the use of both texts and bibliographic data. The National Library of Norway assumes a leading role in this development – both nationally and, somewhat more surprisingly, internationally.
Ten years ago the National Library decided to digitize its entire book collection before the end of 2018. This immense investment in digitization would have had relatively limited significance if it had not been associated with a corresponding investment in making the result available. This has occurred thanks to an innovative use of the institution of extended collective licensing agreements.
First in the world
Under the so-called Bookshelf Agreement entered into with the copyright organization Kopinor, the National Library can make approximately 250.000 books published in Norway up until 2001 available online. Norway is thereby probably the first country in the world to make practically the whole national literary heritage available in digital form.
For legal reasons this availability is limited to Norwegian IP addresses, but can in principle be extended to other countries whose legislation allows the use of extended collective licensing agreements – for example the other Nordic countries.
Reference tools and research resources
Digital book collections provide many advantages for information searches and knowledge development. Firstly, they offer increased availability, but even more important there are radically improved search options for the purposes of documentation and research. Free-text search makes it possible to find information that is difficult or impossible to find in a physical book collection in the absence of information that allows a very precise delimitation of the material to be reviewed.
However, digital book collections are something more than and different from a collection of digital versions of printed texts. They encourage use in a way that deviates so radically from the possibilities of a physical book collection that they become something else – a new element of our cultural ontology. By this I mean that digital book collections can be used as a corpus: a coherent, structured body of data that can be made the subject of systematic, automatic analyses.
Quality in metadata
The quality of the digital texts and the metadata that accompany them partly dictates how advanced and precise these analyses can be. As Google Books has demonstrated, metadata is a deciding factor. In the National Library’s digital book collection all documents are accompanied by catalogue data – quality-assured, bibliographic metadata from the national bibliography and authority registry.
In addition there is metadata about the digitised book pages in the form of annotation data on, for example, word breaks, position information and OCR quality. The texts in the book collection can also be analysed (tagged and parsed) to clarify grammatical features and semantics.
With the launch of the NB n-gram search service application, the National Library has taken a first, but significant step in the further refinement of the digital book collection as a corpus.
What is an n-gram? It is a systematic analysis of linguistic combinations in a corpus. An n-gram consists of n-elements that can be letters, syllables or words. An n-gram comprising one word is generally called a unigram, two words a bigram and three words a trigram. Larger combinations also exist, but increasingly tend to be unique occurrences.
The n-gram service opens up the National Library’s digital collection for the users in a new way. Using bibliographic metadata to structure the corpus makes it possible to generate statistics that show the birth, life and possible death of a word (see Fig. 1), to investigate which words occur most often in conjunction with each other (see Fig. 2), and to compare (groups of) texts along different axes – for example time, genre, authorship, single works etc. (see Fig. 3).
The results can then be processed further in different ways. The National Library’s n-gram search offers a set of services that accomplish this, while at the same time ensuring that all downloading of data observes the copyright restrictions that are binding on the institution.
The n-gram is obviously interesting from the point of view of linguistics, but language usage and changes in vocabulary over time also provide an insight into societal development and historical events that are relevant for many disciplines.
It also allows the pursuit of literary studies in new and creative ways. The ngram is therefore increasingly used by researchers from the humanities and social sciences who work in the rapidly growing field that is termed digital humanities, text mining, culturomics and so forth.
Common to all these approaches is that they use quantitative methodologies to analyse material which has hitherto only been analysed in the form of close readings of a very limited and more or less canonized selection of texts.
Franco Moretti, professor of literature at Stanford University and the originator of the term ‘distant reading’, describes the possibilities offered by digital book collections as follows: “It’s like the invention of the telescope. All of a sudden, an enormous amount of matter becomes visible.” Try the service yourself at www.nb.no/ngram/bokhylla
The graph shows the fluctuation in use of the word «grasroot». The word underwent a metaphorical extension (e.g. «grassroots movement») in the early 1970s, resulting in a significant increase in use.
This graph shows how the verb “eat” collocates with other words in syntactic relations of the type verbobject in a selection of Norwegian newspapers from the period 1999-2011.
Words following «his» (right) and «her» (left) in a corpus of Norwegian 19th century literature. The tight circles consist of words that are related to one of the possessive pronouns, while in the middle are words that occur with both. The further from the cloud a word is situated, the less frequently it occurs with the two pronouns.