The Visibility of Open Access Monographs in a European Context: Full Report

KU Research in cooperation with OPERAS has launched a report on the visibility of Open Access monographs in a European context. The full report is available at: doi: 10.5281/zenodo.1230342

Authors: Lucy Montgomery, Cameron Neylon, Alkim Ozaygen, Frances Pinter, Neil Saunders

Executive Summary

What is it about?

This report explores the extent to which Open Access specialist scholarly books can be seen by the communities that might make use of them. It also identifies the key challenges that will need to be tackled in order to ensure that Open Access books are fully integrated into digital landscapes of scholarship; as well as the steps that need to be taken to achieve this goal. The report focuses on Open Access books made available by publishers and platforms that are part of the OPERAS research infrastructure.

Why is it important?

Specialist scholarly books are the core research output of the Humanities and Social Sciences. Ensuring that they are integrated into digital landscapes of scholarship will play a decisive role in the future of these disciplines, and their impact on the world. Identifying gaps in existing infrastructure and creating a roadmap to address them is vital groundwork.

Perspective

This report forms part of the OPERAS-D project, which focuses on the development of a European e-infrastructure for open access publications in the Humanities and Social Sciences. Knowledge Unlatched Research is a core partner in the OPERAS-D project. KU Research is an independent research and analysis group focusing on strategy and analytics that support the ecosystem of scholarly monographs.

Objectives

This task addresses the challenges associated with tracking the use and impact of Open Access monographs across open global digital networks.
The task is broken into three parts:

Mapping the digital visibility of OA monographs made available by the OPERAS network;
Flagging technical challenges specific to the collection of metrics on usage and impact for OA monographs;
Identifying opportunities for the more effective integration of information relating to the use of OA monographs into metrics and altmetrics ecosystems

Background

OPERAS is a distributed Research Infrastructure (RI) project for open scholarly communication. Its main goal is “to introduce the principle of Open Science and ensure effective dissemination and global access to research results in the Social Sciences and Humanities (SSH)”. The network includes a wide range of mainly European Open Access publishers and research institutions, and is in the process of engaging with a wider international network of potential partners.

The OPERAS Network includes a diversity of participants with differing interests, ranging from traditional publishers with a growing portfolio of Open Access content, through to OA only presses. It includes publishers as well as platforms, technology providers and research institutions. The diversity in OPERAS network participants makes available a range of different financial models, priorities, and technical concerns. The network also continues to grow over time, increasing in both numbers and types of stakeholder organisation. In particular 2017 brought the Latin American SciELO platform to OPERAS as an international partner alongside nine other new partners based in Europe.

OPERAS works in a range of areas. Through its seven working groups and two main H2020 projects its aim is to provide technical and social infrastructures that support Open Access publishing and optimising the use of scholarly content with a focus on Social Sciences and Humanities (SSH). While the network is not exclusively focussed on scholarly books, its focus on SSH means a greater emphasis on questions that relate to books than in many more Science, Technology, Engineering and Medicine (STEM) focussed projects and efforts.

The challenge of tracking scholarly books

While the modes and advantages of Open Access for journal articles are now broadly accepted, at least in STEM subjects, the funding models, technology, and most importantly, the advantages for Open Access books pose more of a challenge. Issues that are specific to SSH often combine with issues that are peculiar to book publishing and dissemination. In broad terms there are three areas where books pose a particular challenge compared to journal articles

Digital books are not necessarily made available through a publisher controlled website and may be made available through multiple online platforms.
The technical infrastructure for cataloguing, indexing and discovering digital and online books is more recent than that for journal articles and is less consistent and reliable as a result. Dependence on intermediaries for the distribution of digital books means that monograph publishers and platforms also have less direct experience with these systems than tends to be the case for journal articles.
Traditionally, book publishers have focussed on the sale of print copies to intermediaries and have had less direct interactions with readers. Existing performance indicators are largely driven by measures of physical distribution. Print remains an important, and often parallel, part of book publishing.

When we consider Open Access books specifically this raises a number of issues. Firstly many of the platforms that exist for distributing books and bibliographic metadata were built with licensed content in mind. This leads to a range of assumptions about tracking of users, their institutions, and their usage that are not applicable to freely accessible Open Access books.

In comparison to journal articles, which made a transition to digital formats much earlier than has been the case for books, the challenges associated with making a shift towards open access are occurring in the context of an incomplete transition to digital distribution and funding models for HSS books.The diversity of HSS monograph publishers – which include many small publishers, as well as library-based and independent presses, adds an additional layer of complexity to the process of integrating OA digital books into digital landscapes of discoverability and use. Firstly publishers often do not host their own digital books on sites under their control but leave this to other platforms. Open Access platforms (such as OAPEN and OpenEdition Books) have developed in parallel with traditionally licensed platforms (such as JSTOR). Established platforms for traditionally licensed content, including JSTOR and Ingenta have also begun to create programs and infrastructure to support Open Access content. Some publishers have begun consciously making the same content available via a variety of distribution sites in order to maximise the visibility and use of digital monographs. ((Examples of publishers making Open Access books available via several platforms include the four presses discussed in the study Exploring the Uses of Open Access Books via the JSTOR platform, available at: http://kuresearch.org/PDF/jstor_report.pdf))

The availability of services intended to help publishers to ensure that Open Access books are optimally integrated into pathways of discovery and use is increasing.

As platforms hosting open access books are maturing and systems for integrating OA content into digital landscapes become part of scholarly workflows, a second issue has emerged. An illustrative example of this is the challenge of applying the Crossref Digital Object Identifier (DOI) infrastructure, developed largely for journal articles, to books. DOIs serve two functions. They are both unique and persistable identifiers for scholarly works, and a referral mechanism by which a user may follow a link to arrive at a specific scholarly work. DOIs work well when applied to a single version of record of a journal article that can be found on a website under publisher control, particularly when the demand and use of print copies has been largely replaced by online discovery. DOIs are more problematic for books that might be found on multiple sites in digital form, where the repository is not under the control of the publisher. ((It is worth noting that such multiple-location problems are increasing for journal articles with the increasing frequency of self archiving and preprint repositories. Solving this problem well for books may be of value in turn for the journal community. Crossref is currently piloting an approach for supporting multiple DOI for books with the intent of offering coordinated lookup. )) Challenges of ensuring that correct redirection addresses are maintained in the absence for commercial incentives to ensure that OA content is easy to locate create additional resourcing challenges, particularly for the many smaller publishers operating in the OA monograph space.

The tangle of technical issues involved in identifying and discovering books, combined with a relative lack of investment by platforms in tracking the usage and conversations around books content leads to a reinforcement of a third challenge. Many publishers and presses remain focussed on traditional metrics and KPIs for monograph publishing. These are not focussed on the usage of books but on distribution through intermediaries – traditionally measured in terms of sales (which also assumes that all publishers make the same effort to sell their books equally). This in turn means a limited demand from presses for detailed information about the use of books, as well as limited capacity to influence the metrics and reporting services provided by platforms.

The importance of understanding digital visibility for Open Access books

With the shift towards Open Access, the question of visibility is crucial. It is perhaps a little harsh to describe traditional metrics as counting copies in warehouses. Nonetheless, even as a straw-person argument it illustrates the point that distribution based measures are simply not helpful for tracking the impact of freely accessible books with online distribution. This is particularly the case given the significantly greater per item investment for books compared to journal articles. Demonstrating the potential value of investing in Open Access, and identifying where that value is realised and the return on investment is greatest is critical to supporting the transition to a future where Open Access is the default for scholarly books. Another important aspect for books is the degree to which they will be accessible to entirely new, and perhaps unexpected, audiences. Scholarly books, much more so than journal articles, have potentially much wider audiences than they currently reach, particularly given the price of many scholarly monographs.

The question of visibility is therefore a complex one. It is clear that there is a need to track scholarly use, including citations and downloads within institutions, as well as the potential to track use and interest by wider publics. We can track the communities that discuss books and ask about how they discover and interact with these texts both online and in print. We can expect books to influence and impact society in ways that are very difficult to track and may not involve a visible trace of usage that we can measure.

The promise for Open Access scholarly books is immense, but the risks and the potential need for investment are also large. If we are to have an evidence-led conversation on strategies for investment, then we need to track the visibility, discoverability, and ultimately the use and impact of scholarly books. In turn, this evidence base will help to change the culture of publishing in HSS, leading perhaps to a greater concern with how an author and the support services in a press can help to shape a work so as to maximise its potential for use and impact.

Survey of OPERAS Partners

As part of the visibility project we surveyed OPERAS partners in order to understand how they engage with usage and other data relating to the titles that they publish or host. In particular we were interested in how partners saw the value of such data and how they were interacting with it. We had 18 responses to the questionnaire contributed by presses, platforms, and data and technology providers. The survey was not intended to be quantitative or representative but to provide a view into the thinking and needs of partners. We therefore do not report quantitative results but a qualitative interpretation and categorisation of the responses. The questionnaire rubric is available in Appendix X.

Findings

Partners are particular about how they describe themselves. While a range of options were presented from which survey participants could choose (publisher, platform etc) many participants chose ‘other’ to provide a free text answer. Sometimes this was to provide greater specificity (e.g. “a library running a press”) and sometimes to step outside the categories provided. This was particularly the case for contributors who were involved in funding OA books and other technical platforms.

This echoes the diversity of participants in the OPERAS network. It also suggests a heterogeneity in the ecosystem which we believe to be an important and distinguishing characteristic of book publishing and of scholarly publishing in SSH more generally.

OPERAS partners that are book publishers or book platforms are collecting a range of data. Every respondent who indicated that they were either a publisher or a platform, or both, stated that they (or their partners) were collecting usage data in some form. This ranged from simply collecting web analytics through a tool like Google Analytics or Piwik through to more sophisticated data collection and management pipelines.

Respondents generally showed a good awareness of the technical systems that were involved in collecting data, describing specific tools and systems, as well as standards, principally COUNTER. Named web analytics were fairly evenly split between Google Analytics, which provides a centralised and easily managed means of tracking web usage and Piwik, an open source tool that provides many of the same data collection functions but runs locally, meaning data is not transmitted to Google.

Respondents also showed an awareness of specific limitations in their systems, in several cases describing difficulties in obtaining data specifically on subsets of their collection. Distinctions were made between views and downloads in several cases, although there was limited evidence of that distinction being used in analysis. The two largest hosting platforms OAPEN and OpenEdition Books were the only two to specifically mention the COUNTER standard, with OAPEN passing data to IRUS-UK to generate COUNTER download counts.

The use, processing, and quality assurance of data is patchy. While the awareness of usage data was good, there were substantial differences in the way that data were being used, or indeed not being used. This was connected to differences in the sophistication of data processing and the existence of documented or automated processes. Several publishers and platforms used manual or ad hoc processes to collect data and in several cases there was an indication that data was being collected but not necessarily used. While the wording of the question focused on ‘processing’ (‘Do you have a process for gathering and managing usage data relating to your OA books?’) we had hoped to elicit commentary on data management and quality assurance. However, while issues of data quality were implicit in some answers (“Download data is sent to IRUS-UK who create COUNTER compliant data”, “PHP scripts calculate and produce COUNTER metrics…to COUNTER V4…V5 will be implemented [in]…2018”) quality assurance processes, such as data validation or cross-checking procedures, re-use of data in internal systems were not specifically mentioned.

The general lack of concern with quality assurance was consistent with the variety of uses that data was put to. In some cases the use of the data was explicitly limited (e.g. “The books we publish are selected on the basis of scholarly merit”, “Decisions are now based on print circulation, or number of e-books sold through commercial platforms”) to subsidiary and management issues. Others explicitly noted that usage was a key indicator of performance and important for reporting to stakeholders. This was particularly where a case was being made for Open Access, either to authors or to other stakeholders. Several respondents reported being unsure what it could be used for but nonetheless had a sense that it was, or would become, important, with plans for future work in development.

A desire for standards and consistency is in tension with a need for flexibility and contextualisation. Several respondents raised the issue of gathering and integrating data from multiple platforms as a challenge. Of these a number expressed a desire for simplified and standardised tools that could achieve this. At the same time respondents were concerned both about the advisability of combining data from multiple sources, their capacity for analysis of such complex data, and the uses and misuses it might be put to.

Analyzing usage data is difficult and can easily lead to wrong assumptions about the impact of a OA book. In our case this could be detrimental to our [authors institutions], which tend to compare their “success” to [other institutions]. This means that we clearly need to understand what the usage data is telling us before we have any use for it.

A number of respondents expressed a desire for a “dashboard” or other visualisations that could bring multiple data sources together. The consequent need for data integration and standardisation to achieve this was mentioned in one or two responses but awareness of the challenges of comparison across sources appeared to be limited. There was some evidence of a conflation of visualisation with data integration.

Respondents are small organisations with limited capacity. There is a desire for coordination and shared services, infrastructures, standards. A common thread in the responses was that the publishers and platforms who are engaged in Open Access scholarly book publishing are relatively small. This is both a challenge and an opportunity. They have limited capacity to develop internal processes and systems are looking for shared services and platforms to assist in developing usage data capabilities.

It would be of great help if we could have a main service from where we could manage all the information related to statistical usage data.
[To engage more effective with usage data we would like a]…consortium agreement with Google on how to gather and access usage data.
We would like to see an usage aggregation service that consolidates usage data from different hosting partners into one standardised report in an automated way. In turn, this should translate into an usage dashboard that can be embedded into platforms and allows customers to use different filters to analyse usage by publisher, region, etc.
[one of our biggest challenges is…optimizing workflow, how to do more work with small resources.

What emerges overall is a picture in which platforms and publishers are implementing tools and approaches locally and using what they are provided with to some degree. There is generally a good technical awareness of the tools being deployed, but less apparent awareness of data curation and quality assurance issues.
Many of the challenges arise from issues of data integration and standardisation. Small, and even medium-sized, players have limited capacity to engage with detailed standards or technical development. Equally there are limitations on what capacity a small organisation can provide to investigate the meaning and context of the data being generated. The majority of data use seemed to be in promotion or advocacy rather than strategic decision making. Concerns were raised about the misuse of usage data or a lack of understanding of its limitations by downstream users.

Mapping the digital visibili ty of OA monographs made available by the OPERAS network

The idea of ‘visibility’ is not one that has been theorised in detail in existing library literature. Studies tend to focus on issues of information retrieval, addressing precision and recall for a specific information seeking task. ((The information retrieval literature focuses naturally on questions of precision and recall with visibility used as a non-technical term in many cases. Criticisms of web-based indicators often focus on the idea that they measure “mere visibility” without strictly defining it. Models that link discovery to usage with a sophisticated application of proxies are rare although see Haustein, Bowman and Costas (2016) in Theories of Informetrics and Scholarly Communication, Sugimoto (ed), De Gruyter, Berlin, and essays by Wouters and Cronin in the same volume. )) ‘Visibility’ as a concept also at least suggests a concern with serendipitous discovery or non-directed information seeking. In our case we are also concerned specifically with open access books, so ‘visibility’ presumably includes the clarity of information making about the availability of freely accessible copies of a work.

Ideally we would address the full range of information seeking behaviours, testing for instance the presence of a known book in specific catalogues, the likelihood of a book rising to the top of results for a well-crafted search query, and the potential for serendipitous discovery in a potential reader’s regular work-flow. However, developing a well grounded taxonomy of visibility is beyond the scope of this report. We have therefore focussed on testing a range of information sources for the presence and quality of information on a specific set of identified books.

Identifying the target books

We developed a simple typology of OPERAS partners involved in the publication of OA monographs; and OPERAS partners involved in the hosting of OA monographs. OPERAS partners involved in publishing OA monographs were contacted and basic information about their approach to the dissemination of OA books was requested. A metafile for the OA books published by each press was also requested.

In order to maximise the quality of our communications with publishers a personalised approach to email communications was chosen. This included sending an initial email explaining the purpose of our work package and requesting a metafile, as well as specific information needed in order to clarify technical points. Wherever possible we drew on information gathered in WP3.1.

There was substantial variation in the format and content of metadata provided by the various OPERAS partners. The provided files included Excel, XML, and OAI-PMH feeds. Some partners provided metadata feeds rather than a single output metadata file. These variations also reflected diversity within the partners in their activities as well as in their capacity and workflows. For example, IBL Pan is not a publisher of traditional monographs but involved in alternative approaches to OA books.

Publisher	Provided Metadata?	Format	Comments
UCL Press	Yes	ONIX
IBL PAN	No		Not publishing traditional monographs
Coimbra University Press	No		Don’t currently produce a single metafile as a standard process
Göttingen University Press	Yes	OAI-PMH XML
Open Book Publishers	Yes	They sent us an Excel xlsx file
Ubiquity Press	Yes		Produces OAPEN compliant OAI-PMH
SHARE Press	Yes	OAI-PMH XML

Table 1. Provision of metadata by OPERAS Partners

Ubiquity Press does not maintain a single meta datafile relating to published books but relies on OAPEN for onward metadata distribution (they are currently developing their own feeds for MARC records). In contrast, while UCL Press also uses OAPEN as a platform and generates OAI-PMH from their internal hosting platform. UCL Press maintains a separate metadata master file.

The metadata provided also showed some weaknesses in the handling of internal information by OPERAS partners. For instance, a small proportion of ISBNS (51 out of 11,000) provided by partners either did not validate via the internal check-sum or could not be automatically validated through a standard regular expression. This suggests that the metadata provided to this project is not generally re-used in internal systems where such errors would be discovered.

Overall, the initial findings in terms of the quality and availability of data from OPERAS partners was that it was inconsistent between partners, and of variable quality. As we will see this leads to a range of problems in information retrieval and visibility analysis.

Testing for ‘visibility’

To address the question of visibility we conducted three broad kinds of survey:

Presence in relevant catalogues.
Visibility in web search.
Visibility in general information workflows.

The first approach was to survey whether the selected books could be identified within specific catalogues. The catalogues selected for examination were selected to cover common sources for books and open access content. These were WorldCat, BASE, Google Books, DOAB and OpenAIRE. We used their API by searching title and author, to check weather the titles were in their catalogue and to identify the repositories hosting most of these titles.

In each case a search was run using identifiers or titles, with the aim of exhaustively identifying all books that could be confirmed as being available in each catalogue. We used the WorldCat classification API to identify the subjects for each title using ISBN numbers. We used Bielefeld Academic Search Engine (BASE) which harvests OAI metadata from institutional repositories and other academic digital libraries that implement OAI-PMH. We also checked the titles and their authors via the OpenAire API. As of November 2017, OpenAIRE contains around 23 million documents from 980 compatible data providers. The OpenAire system covers a higher proportion of titles from OAPEN and OpenEdition Books compared to BASE which covers the OBP corpus more completely. Both repositories support search via DOI but not by ISBN, and were designed primarily with journal articles, rather than books, in mind. We also used the Google Books API and compared its results with the DOAB metafile in order to identify whether ISBNs for individual titles were registered in both catalogues.

The second form of visibility was the presence of the book in web search. We used the Webometric Analyst 2.0 tool developed by the group of Thelwall et al. ((Thelwall, M. (2009). Introduction to Webometrics: Quantitative Web Research for the Social Sciences. San Rafael, CA: Morgan & Claypool.)) to analyse both the number of pages discovered with a search of the book’s title and author’s surname, and their top and second level domain names. This gives some indication of geographic location (via country TLDs) and of domain of interest (via TLDs and SLDs, e.g. ‘.ac.uk’ or ‘.edu’ vs ‘.com or ‘.com.au’).

Finally, we examined a range of services for evidence of activity or presence that would support the visibility of books. We investigated the reported OA status of books with DOIs using the oaDOI service as well as the presence of ISBNs and DOIs relating to the target books in the ORCID 2017 public data dump. We additionally provided Altmetric.com with a complete list of DOIs and ISBNs which was used to interrogate their dataset for information on social and mainstream media that could be linked to one of the target books.

Visibility of Target Books in Specific Catalogues

Surprisingly, BASE shows relatively poor coverage overall. In most cases the general catalogues of content show fairly good coverage, but for BASE this is not the case. The visibility results are dominated by the large number of books from OpenEdition Books and from OAPEN. The aggregate results therefore hide some substantial differences between book sources. In particular it is the 29% representation of OpenEdition Books books in BASE, and about 50% coverage of OAPEN that drives the lower numbers for BASE overall.

Figure 1. Shows the overall results for all the books in our set across the full range of ‘discovery services’. Overall we see good coverage of the books in this set in DOAB, Google Books, OpenAIRE and WorldCat. There is also some form of web search results for most of the books. By contrast, presence in Altmetric results and in ORCID is much less comprehensive.

Coverage in DOAB is uniformly good across all sources of content, OpenAIRE coverage is generally good but weak for EKT, Gottingen, and Napoli University, and a similar pattern is seen for WorldCat, except that Gottingen has excellent WorldCat coverage. Overall the larger three sources (OAPEN, OBP, OpenEdition Books) show better visibility in these catalogues.

There are no obvious differences between catalogue visibility on the basis of language. The analysis here is challenging as a smaller number of European languages cover the majority of books and different content sources have differing language focus. Therefore the question of visibility by language is confounded with that of the visibility by source. Dutch books appear to be underrepresented in both DOAB (58% absent) and WorldCat (65% absent) but well represented in BASE (80%) and OpenAIRE (96%). This may be due to the fact that a significant number of books from the Netherlands in OAPEN do not have an open licence and are therefore not in DOAB (which is in turn feeding WorldCat).

OPERAS Partner	Google Books	OpenAIRE	DOAB	BASE	World Cat
	(% present)	(% present)	(% present)	(% present)	(% present)
EKT	0	0	100	0	17
Göttingen University Press	89	42	98	39	96
Napoli University Federico II	44	28	97	34	28
OAPEN	73	91	92	49	85
Open Book Publishers	99	74	100	86	94
OpenEdition Books	89	93	99	29	90

Table 2. Visibility of OPERAS partner books in a range of catalogues.

Visibility of Target Books in Web Search

Web visibility was determined by running searches with the title and author’s name. This provided a score as well a list of referring sites. Due to small numbers it is not possible to draw any comparative conclusions between platforms in terms of their web visibility.
In general terms each platform saw a similar pattern with a high variability in web presence across the collection i.e. some books show a significant web presence with many showing only a small presence. This is an expected pattern given the different level of interest expected across such a large corpus of books. As the corpus also includes older books some references may also not be to the online open access versions.

Figure 2. Box-plot showing the number of websites associated via web-search with each published book in the corpus. Each dot represents a single book. The box and line shows the mean and one standard deviation for each host platform. This form of analysis may be of value in identifying both books with high web visibility and also those which would benefit from additional marketing activity. The analysis is relatively straightforward with the Webometrics tool and can provide quite rich information. As an example we look at how different languages feature in terms of their visibility. This analysis gives a sense of both the relative proportion of books in different languages as well as a comparative sense of visibility.

Figure 3. Distribution of web-presence by language of book. Languages are ordered by the mean number of linked websites. For most common languages, the means are within a single standard deviation of each other indicating no statistically significant difference.

In this case we see the dominance of French and English in this corpus (density of points) alongside German, Dutch, Spanish and Italian as other well represented languages. Overall we see no strong or significant difference between the web visibility of these books based on language. While a bias towards English might be expected this does not seem to be the case. This is at least in part due to the strong focus on French (and other non-english) language books by OpenEdition Books.

A different form of analysis is to look at Top Level Domain (i.e. country codes) in URLs referring to these books by the language of the book. This provides an interesting insight at an aggregate level as to the interest in books from different countries in different languages. Here we show the most represented language of book for each country top level domain. This reveals a logical pattern with Latin America showing a preference for Spanish books, with the exception of Suriname (Dutch, the official language), French Guiana and Brazil (French). Francophone and Anglophone Africa are quite clearly distinct and East Timor shows the expected preference for Portuguese. France, the Netherlands, Germany and Italy all show a preference for their native language. There are apparently unexpected results which deserve more analysis on a larger corpus. Spain, Portugal, and Brazil all show a preference for French which is mostly likely due to the limited presence of Portuguese books in this corpus.

Figure 4. Top Publication Language by top-level domain. For each country code (e.g. ‘.uk.’) the most visible book (the one referenced by the most search results) was identified and its language identified. Latin America has a higher visibility of spanish-language books and francophone and anglophone Africa are clearly visible.

Visibility in General Scholarly Information Workflows

To examine the visibility of OPERAS partner books in general scholarly workflows we examined a number of sources of information. The first of these is the oaDOI service that provides information on open access status of objects identified by Crossref DOIs. This service is being deployed in a range of library systems and within Web of Knowledge by Clarivate – so accurate information on open access books is of value.

The second source of visibility data was Altmetric.com, which provides data on mainstream and social media activity for scholarly works. Finally we searched the ORCID public data dump for 2017 for the presence of DOIs and ISBNs associated with OPERAS partner books. These would in most cases have been added by the authors to their profiles.

In all three cases we saw extremely poor visibility. Of the 636 DOIs that were available for this analysis within the OPERAS corpus only 41 were returned as Open Access by the oaDOI service. Only 31 were present in the ORCID data dump. The oaDOI service is limited to providing information on DOIs, which is only relevant for ~10% of the corpus, but the reasons for the poor results merit further investigation. It is likely to be a combination of a service that is focussed on journal articles and the general variability in quality of metadata provided by OPERAS partners.

Only 160 ISBNs were identified in the ORCID data dump suggesting that overall there is little encouragement from either publishers, platforms or author’s institutions to include information on book-length works in ORCID profiles. This may also represent a lack of support for the automated ingestion of book metadata to ORCID, which in turn would need to be supported by more consistent and complete metadata streams from publishers or platforms.

The data obtained from the Altmetric.com service is more interesting and also more informative. Nearly 1000 of the OPERAS books show some form of activity tracked by Altmetric.com, either mainstream or social media. The vast majority of these are on the OAPEN platform with a further contribution from OBP and OpenEdition Books. The dominance of OAPEN is possibly related to the presence of <meta> tags on OAPEN records. ((Euan Adie, Altmetric.com, personal communication)) Another 304 books are registered in the service but show no activity, again dominated by books from OAPEN followed by OBP. These are stub records that have been created for institutional customers of the Altmetric.com service where book authors are affiliated.

The Altmetric.com service was originally targeted at journal articles, with one primary location online at the publisher website. A large part of its value offering is a high quality aggregation of online references to articles that is achieved by tracking all the relevant URLs that refer to an article, rather than just DOIs as is common for some other services. This is much more challenging for books that often reside at multiple locations. Therefore the service works to actively track and aggregate URLs relevant to books that are of interest, particularly those published by authors based at institutions that are Altmetric.com service. This is important because it illustrates how engagement with a downstream service can help motivate the gathering of relevant metadata to improve data aggregation and analysis. More generally it shows how the provision of good metadata, in this case a curated list of all the URLs where a book might be found, can prime a service to collect higher quality data. It is important to note that the responsibility for providing this kind of data, does not currently belong to anyone in the supply chain. Making a community decision about where to locate that responsibility and how partners might provide data is a role that OPERAS might take.

Findings

The metadata held and managed by OPERAS partners is inconsistent and variable in quality. Collecting and aggregating data from multiple OPERAS partners was a challenge due to inconsistency in bibliographic metadata processes and formats. Several partners were not explicitly included in the analysis because separate data was not available, and some analysis is limited by issues with the data provided. This includes ISBNs that appear to be incorrect.

These data quality issues create a number of downstream challenges. Firstly analysis is more challenging and involves more manual work, raising the cost and limiting the generalisability of findings. Secondly it creates a relative lack of interest amongst downstream data aggregators and providers in collecting data relating to books. Books offer particular challenges and the market remains focussed on journal articles. Nonetheless as we note below, there is interest in handling books better, which would be encouraged by the provision of more consistent and complete metadata.

The visibility of OPERAS partner books in catalogues varies by publisher. OPERAS partners have clearly focussed on different catalogues to optimise the visibility of their content. Given the heterogeneity of OPERAS partners this is not surprise. It is also evidence of a lack of crosstalk between catalogues. Again, the provision of standardised bibliographic metadata could aid both small and large publisher and platforms in gaining more visibility across all the relevant catalogues.

Evidence can be obtained that books relevant to specific regions gain interest and attention in that region. On aggregate we have shown evidence from the analysis of country top level domains that books are often more discussed and written about in countries where the language of the book is common. We have previously shown how web visibility and country-level usage analysis can demonstrate local usage of single books. This new analysis shows that similar information can be gained at a corpus level.

While we did not see an obvious visibility bias for languages that appear frequently in the OPERAS corpus, it may be the case that rarer languages do see a bias. It may also be the case that the lack of bias is due to strong representation of French work by OpenEdition Books. We did see less visibility for books in Greek, Arabic and Russian (i.e. in different scripts) however the small numbers here limit any statistical conclusions.

The variable quality of book metadata creates challenges in analysing visibility consistently. Throughout this analysis we have had challenges in comparing like with like due to the differences in metadata completeness and quality. Similarly this will create challenges within individual partners seeking to do similar analyses. Finding ways to maintain, use and deliver high quality metadata at low cost, probably through the development of shared platforms, offers multiple benefits for OPERAS partners including better internal information, greater ease in tracking and better engagement with downstream collectors and analysts of data.

The variable quality of book metadata creates challenges for downstream data aggregation and analysis providers. In discussion with a series of downstream data providers including oaDOI and Altmetric.com the issues of tracking information for books was raised. These downstream providers are aware that of limitations in their data collection for books and have an interest in improving quality and completeness of the data they collect. In most cases they currently appear to be limited to manually updating data based on direct interactions with customers.

In general there is a question for those engaged in the production of books and open access books in particular as to who they want to design and implement solutions. By default the sector will get systems focussed on journal articles and STEM output processes. There is interest in engaging, but without a concerted effort from the providers of book content this is unlikely to be well integrated with book production.

Digital Visibility Challenges and Opportunities for OPERAS Partners

The promise of Open Access scholarly monographs is multi-faceted. First it provides easier and more efficient access to scholarly work for scholars. Secondly it offers access to previously expensive content to broader communities of interest who either do not have access to, or would not think to use, an academic library. In particular the free distribution of content online offers to bring together communities of interest around a specific topic. These communities may be small as well as diverse and geographically distributed. Their engagement with, and ultimately their input into scholarship has the potential to strengthen public support and enrich and diversify its impact.

To achieve this promise it is not sufficient that open access monographs be available, they must also be visible and also accessible to these diverse audiences. OPERAS partners, funders, platforms, and publishers are already delivering on the issue of availability. Here we address the question of visibility. As has been discussed visibility is a complex issue. Visible to who? Under what circumstances? After what kinds of search? Mapping all the possible discovery pathways is a future challenge.

In this work we have taken a deliberately narrow scope. We start with the assumption that high quality and consistent bibliographic metadata at source is key to enabling the wide range of services and systems that will support discovery and visibility in diverse contexts. Our focus in these recommendations and issues is therefore on the way in which consistent metadata provision and dissemination through common channels provides a route towards visibility.

Challenge – The quality and consistency of OPERAS Partner metadata is variable

An early finding of the work package and consistent throughout the survey, the provided metadata, and the completeness of records in third party systems was variability in both the format, completeness, and quality of metadata. In the survey there was qualitative evidence of differing degrees of concern and interest with specific issues, relevant to specific presses and platforms. In the metadata provided there were substantial inconsistencies in format, completeness and validity. For instance the small but significant presence of identifiers that were invalid (51 ISBNs that did not validate) was an issue.

Further downstream in the data and discovery process there was clear evidence of a lack of consistency in metadata delivery. As will be discussed below this at least in part a result of diversity in the mission and goals of specific OPERAS partners and their capacity to focus on internal metadata systems. It is also a function of existing discovery and metadata systems only recently grappling with the issues of books. However, in a distributed and global information ecosystem the provision of consistent, correct, and high quality metadata is a necessary condition of optimising for visibility and discovery.

Challenge – Diversity of gathering, cleaning, reporting usage data across OPERAS partners make comparison difficult

Usage data was a focus of the survey work and previous work by KU Research has focussed on usage data collected by the OPERAS partner UCL Press ((http://dx.doi.org/10.17613/M6H49K)) as well as for four presses using the JSTOR platform. ((http://kuresearch.org/PDF/jstor_report.pdf)) It was not part of the visibility mapping exercise, at least in part because the previous work and survey showed that a comparison is not feasible. OPERAS Partners that host content collect data differently, clean that data differently, and report it differently. Even where a standard protocol is used, for instance where data is referred to as “COUNTER Compliant” or “COUNTER Protocol” there is evidence of substantial differences in collection, management, exclusions and reporting. In some cases this relates to differences in the definition of access status and in some just in differences in technical systems.

Details of internal operations tend to be sensitive as is the release of data, particularly where it is likely to be used for comparisons. Data quality issues currently mean that any comparison is likely invalid, but equally without an increase in transparency for data collection and reporting the development of best practice is unlikely. Legal, ethical and trust issues are also a significant challenge (see below).

In particular the small scale of many OPERAS partners means that they will not have the capacity to develop their own in-house expertise and systems. Adoption of good practice to generate high quality data will depend on sharing the burden of capacity building in some way. That in turn, cannot happen until there is a framework that provides sufficient trust to allow the sharing and comparison of data and its management.

Challenge – Application of existing systems is not always straightfoward for books

Existing systems for digital and online research discovery and distribution have been largely built with journal articles in mind. The implicit assumption of a single Version of Record, hosted on a publisher-controlled website, that only rarely goes through any change is built into metadata creation, identifier systems, discovery and distribution channels. The dominant means of delivery for journal articles is now online with print a niche provision in many disciplines. In contrast for books, print still remains the focus for many publishers and the engagement with online and digital supply chains reflects that.

The confusion and inconsistency in coining and distributing Crossref DOIs and ISBNs is one example of this. Even though the set of OPERAS partners are focussed on online and digital as open-access focussed providers, there is confusion and inconsistency in the use of identifiers. Partner-provided metadata files referred to many different types or ‘versions’ of DOIs and ISBNs (‘electronic’, ‘online’, ‘print’, different file formats, platforms), in addition to the inconsistent provision of DOIs at the chapter level.

As noted elsewhere the scale of OPERAS partners and book providers in general means that the technical capacity is not necessarily available internally to engage with these issues and systems. In addition, as small players, OPERAS partners and others often do not have the levels of staff capacity to engage directly in community efforts to develop greater consistency in data practices.

The lack of applicability to books also plays out downstream. Systems such as Altmetric.com are able to exploit the (generally) single and predictable online location of journal articles to connect Crossref DOIs to URLs and aggregate mentions. For books Altmetric.com needs to undertake this work in a manual and directed fashion because there is no straightforward way to discover all the locations of a book online, and therefore to understand when social or mainstream media is linking to a copy. This challenge is also exacerbated by inconsistent practice and quality of metadata provided by publishers and platforms.

It is worth noting however that journal articles will start to face some of the same issues as green open access increases alongside preprint adoption. OPERAS partners could take a lead on developing best practice for identifying multiple locations online and take a leadership role in supporting the next generation of discovery and identifier infrastructures.

Challenge – Diversity of approaches, goals and definitions creates challenges for developing common platforms

As we have noted in several places in this report there is enormous diversity in the missions, goals, and activities that different OPERAS partners undertake, even those that might be categorised together as “publishers” or as “platforms”. This plays out in many ways, in the different assumptions that various partners bring to engaging with external platforms, but also in the needs for reporting and the strategic goals that drive decision making.

One example of this is the different definitions of what constitutes “open access” amongst various OPERAS partners. OpenEdition Books and Open Book Publishers offer a set of freemium offerings where some formats of the book are free but others are charged for. Others deliver only one freely accessible online format. At the same time demonstrating the use of online content appears important for most partners. This leads to a situation where usage data is sensitive and potentially competitive but also not readily comparable.

In the longer term it will become necessary to address questions as to whether formats for screen reading (some of which may have restricted functionality) are more “visible” than epub and fully downloadable PDF, and how digital visibility relates to print sales. The diversity of OPERAS partners is a strength in providing offerings for different parts of the scholarly community. It will also be a challenge in divining how the investment in visibility supports different communities. The small scale and competitive nature of OPERAS partners means that finding ways to share information and best practice will be critical. The diversity of goals, funding streams and contexts will be a challenge in delivering that.

Challenge – A lack of engagement with data governance and ethics

While not a technical issue, the issue of data governance appears a substantial risk for OPERAS partners in two areas. Firstly there is significant variability in awareness of the implications of handling and analysing user logs. While some partners use Piwik as a local tool to collect logs many use Google Analytics. While Google Analytics (and other Google services) will presumably meet the standards being introduced under the General Data Protection Regulation in Europe there is a growing sense that they don’t meet the ethical expectations of the scholarly community.

Survey answers and parallel work in the HIRMEOS project suggests to us that while some partners are sensitive to these issues the majority are not. Further, it is not clear that the technical capacity exists to properly address issues of privacy that arise as the desire for more granular information on usage and visibility grows. Future work should address the legal liability issues that arise from holding such logs and the forms of analysis, data sharing, and data retention that are appropriate for our community.

A related issue is that of governance frameworks for data sharing. If the goal of OPERAS network is to support shared best practice and capacity building, then this will necessarily involve data transparency and sharing. As noted, usage data in particular can be highly sensitive, in addition to implicating privacy regulations. Building a framework in which trusted parties can benefit from data and tool sharing will be crucial for achieving the goals of the OPERAS network.

Opportunity – OPERAS can act as a growing network for best practice and capacity building

A theme with many of the challenges is that of coordination and sharing the burden of developing technology and best practice. That in turn is a substantial opportunity for OPERAS to develop a network which can support partners in sharing the development and implementation of best practice. The ongoing growth of the OPERAS network is a positive sign in this sense.

OPERAS could benefit from building its own capacity to act as a hub for initiatives or even to act as a node for the coordination of resources. While it’s current role as a focus for grant funded activities is a good step in this direction building up a long term capacity to deliver value for partners will support sustainability of the network as well as providing a focus for future activities.

The diversity of partners within OPERAS means that there already is both knowledge and existing best practice that could be shared from within the network. Building internal trust will be important, and this suggests that some of the issues raised above on governance arrangements should be tackled early. This will also need to develop a global focus to include other key players. If successful, OPERAS could play a key role in ensuring a continuing diversity of scholarly book publishing organisations in contrast to the continuing concentration of journal publishing and the issues that that brings.

Opportunity – Working with downstream providers through better metadata provision

While we have focussed on the inconsistency of metadata provided by OPERAS partners, the deficiencies of downstream systems in handling books, and the consequent gap, we have also seen a desire to engage and improve these systems. In particular downstream systems face challenges in connecting identifiers to a complete set of online locations (URLs) and clarity on the use of metadata to signal access state and other issues.

If practice can be systematized and the overall quality of metadata improved, there are therefore significant opportunities to improve the visibility of open access books in these systems. There is also an opportunity to engage with these systems to ensure that the interests of OPERAS partners are served in implementation decisions that will need to be made.

There are unresolved questions of where in the supply and distribution system the responsibility for creating, managing, and distributing metadata lies. As noted elsewhere these decisions were largely made by default in the journal article system. For books with the complex relationships between publishers, aggregators, platforms and discovery tools these responsibilities are less clear. Who should register DOIs? Who is responsible for maintaining landing pages? For different editions and versions? How can multiple competing platforms work together to enable discovery? While the answers to these questions are beyond the scope of this report, working to resolve them is an opportunity for OPERAS to take a leadership role as well as to maximise the visibility and usage of OPERAS network books in ways that are appropriate and suitable for OPERAS partners.

Appendix A – Survey Questions

Appendix B – Survey Responses

Appendix C – Analysis by platform/publisher

(Please see pdf version for appendices)

The full report is available at: doi: 10.5281/zenodo.1230342

The Visibility of Open Access Monographs in a European Context: Full Report

Executive Summary

Objectives

Background

The challenge of tracking scholarly books

The importance of understanding digital visibility for Open Access books

Survey of OPERAS Partners

Findings

Mapping the digital visibili ty of OA monographs made available by the OPERAS network

Identifying the target books

Testing for ‘visibility’

Visibility of Target Books in Specific Catalogues

Visibility of Target Books in Web Search

Visibility in General Scholarly Information Workflows

Findings

Digital Visibility Challenges and Opportunities for OPERAS Partners

Challenge – The quality and consistency of OPERAS Partner metadata is variable

Challenge – Diversity of gathering, cleaning, reporting usage data across OPERAS partners make comparison difficult

Challenge – Application of existing systems is not always straightfoward for books

Challenge – Diversity of approaches, goals and definitions creates challenges for developing common platforms

Challenge – A lack of engagement with data governance and ethics

Opportunity – OPERAS can act as a growing network for best practice and capacity building

Opportunity – Working with downstream providers through better metadata provision

Share this: