Technical mapping of the OPERAS environment


Download the full OPERAS Design Study here: OPERAS Design Study

doi: 10.5281/zenodo.1009544


Objective

The technical mapping of OPERAS environment is meant to provide a global description of the technical, organizational and information systems within OPERAS consortium. More precisely, the mapping has collected detailed information about workflows, software, development languages, data and metadata management, dissemination and distribution tools.

The main scope was to identify similarities, compatibilities and possible interoperability.

Executive summary

 Method

The technical mapping has been done through a questionnaire sent to the different partners. Each of them has been sent a table structured alongside the most common types of digital publishing activities.

As digital publishing is not standardized enough yet, a draft has been proposed to various individuals and profiles from the consortium and then collectively validated. The draft and the final version are loosely based on enterprise architecture concepts (see : https://en.wikipedia.org/wiki/Enterprise_architecture_framework).

The tables were the following:

  • organization;
  • activity;
  • applications and services;
  • information system;
  • hardware;
  • prospects
Participants
Other partners

Not relevant (no platform):

  • Knowledge Unlatched (en)
  • ISCTE (pt)
  • CRUI (it)
  • CNR (it)
  • AEUP (fr)

No response:

  • Zadar University (cr)
  • Università di Venezia (it)
  • CVCE (fr)

New partners:

  • Coimbra University Press (pt)
  • Humanum (fr)
  • IBL PAN (pl)

Results

 Preliminary remarks

This work represents a first identification of practices, workflows and tools within the OPERAS consortium. It is mainly a basic inventory. The categories used in the survey can and must be improved later through a collaborative process.

The responses are detailed and represent a reliable collection of all the information needed. Nevertheless, some answers indicate that the categories used for the survey were somehow too loose or too abstract. For instance, the questions about publishing on one hand and workflow on the other hand created some confusion and the same response could be found in each field. The metadata questions were uneasy to classify because of their several types and use, but this aspect has to be better formalized in order to have a better description of the data management process within the consortium. Compared to this first attempt, the main activities of the partners should therefore be defined anew in order to offer a better articulation between concepts and real practices.

For these reasons, we have decided not to follow the tables progression but to reorder the content of this report on the basis of the schema in Annex 1. This schema represents in a circular way the various activities and missions of the digital publishers involved in the OPERAS consortium.

The sections below are an adaptation of this schema to our technical content (see table  “Functional architecture” in Annex 2). We will present the various functions from the more technical to the more abstract.

Information system

Development language, Database, Size limit, Hardware

Leaving aside the front-end languages (HTML, CSS, JS), the general information collected regarding the development languages is two-fold:

  • a first group of participants benefits from an external IT system managed by their organization or a partner and don’t have information on the topic;
  • another group is characterized by an in-house IT, that is an independent IT department or an operational autonomous set of IT skills (EKT, OAPEN, OBP, OE, SHARE, UGOE, UP).

In this second group, it could be useful, when many languages are indicated, to better know which use in what range is made of each language. In this way, it would be easier to imagine potential collaborations.

It is interesting noticing, however, that a majority of partners are PHP/MySQL users. With the exception of MWS (Python/Zope Object Database) and UGOE (XML publishing of Cocoon-Apache), all the others are using PHP alone or in combination with other languages.

The database and data size limit give us information about the present data management status and its possible evolution. For books and/or journals only, here are the database sizes:

  • less than 1 GB (OBP, SHARE books, UGOE)
  • around 2 GB (SHARE journals)
  • around 15 GB (OE Books)
  • around 30 GB (EKT, OE journals)
  • 100 GB (MWS), 240 GB (UP)

These data should nevertheless be completed with additional information on the destination of the database and the existence or not of many databases for each DBMS.

Few but some partners indicated a data size input limit (EKT, OAPEN, UGOE, UP), ranging from 20 MB to 4 GB, and it could be interesting to know if it affects their practices and in which way.

As for the hardware, here is the essential distribution:

  • Virtual Machines: OBP (2 VMs)
  • Servers: MWS (2 rented servers), SHARE (3 servers), UGOE (1 server), UP (6 servers)
  • Servers and VMs: EKT (2 servers, n VMs), OE (21 servers, 40 VMs)
Data and metadata processing

Indexing, Search functionality, Reference sets, Metadata standards, Identifiers

In this section are being gathered the processes which will create access points to the content or allow for its referencing.

The indexing of the content is mainly handled in an automated way by the participants. A certain number is using the full-text search provided by their publishing tool or repository application: OJS, OMP, E-prints or DSpace (EKT, SHARE, UniTo). Others are using a specific search engine like Solr (OE, UGOE) or Lucene (OAPEN). Some manual indexing is nevertheless used for completing the work of the application (UGOE, OBP) or for specific purposes (SHARE for Worldcat). Automated indexing also allows for a faceted search, but another set of questions could be useful in assessing the quality of the search functionality, especially by evaluating the results for each facet. In fact, one participant indicates some poor results of the embedded search functionality of OJS/OMP.

A minority of participants also enrich their content with referenced subject headings: BIC, BISAC, VLB, LCSH (OAPEN, OE, UCL, UGOE). It is hard to assess how much these reference sets help the discoverability and if they are uneasy to maintain but maybe the concerned partners could give more information on this question.

Despite the similarities expected, the standard metadata used by participants are present with some variations (no one is using exactly the same set of standards); this could be looked at more closely in an interoperability perspective. As we are lacking information on the way these metadata are generated, it is hard to tell how difficult would be an adjustment; it is worth mentioning, though, some publishing tools allows for this generation (e.g. OJS). The main generated standards are: DC, MARC, ONIX – rarer are DCQ and MARC XML. Alternative standards are: METS, NLM, RFC1807, ESE and PICA XML. Leaving aside the various functions of the standards (DC for PMH, ONIX for distribution, etc.), it might be appropriate to give some more information about the specific use for each standard to check how much they are effectively interoperable.

Identifiers are another kind of metadata and we wish to outline the rather wide use of interoperable identifiers. Alongside the HIRMEOS group (EKT, OAPEN, OE, UGOE) where are being implemented DOI, ORCID and Funding registry, others already have DOI (soon MWS, OBP, OLH, SHARE, UCL, UniTO, UP) or ORCID (OLH, SHARE, UniTo, UP).

On a related topic, which could have been investigated in the survey, it is interesting mentioning one partner is providing persistent URLs for its content (MWS).

Publishing

Types, Number of documents, Printed copy, Publishing tools, Single source publishing

This section gathers the various elements of the OPERAS consortium central activity of digital publishing.

The majority of the participants publish more than one type of document. Far from being limited to the more traditional journals and monographs, the types of documents handled by the participants cover almost the whole range of academic production. Even if all the different kinds of  documents are not taken care of in the same way, it is interesting noticing, in the perspective of the scholarly communication evolution, that some participants have expertise with different sorts of data. Alongside with proceedings, textbooks and thesis, we also find blogs, images, audio/video files, software or, potentially, any kind of data. To be noted that sometimes the different types are handled with specific software, but this seems more related to the size of the organization (e.g. SHARE, UniTo).

The overall published content of the participants clearly gives a strategic position to the OPERAS consortium. One partner remains isolated by its size and its variety (OE), but it would be interesting to know the trends and perspectives of each partner.

The print-on-demand service among the participants is more present than one could think (OBP, SHARE, UCL, UGOE, UniTo). If needed, this could allow for collaborative work or counsel.

As for the publishing tools, the first observation is the rather wide use of PKP’s software (OJS, OMP) among the partners (EKT, SHARE, UCL, UniTo and soon MWS). This also obviously opens possibilities of collaborations and it already does for some of them. As some participants in this group are not using only PKP’s software for all their contents (UniTo, MWS) and others are using also different tools for their content (Lodel and WordPress for OE), it might be interesting to investigate more in detail the relations tool/purpose and the reasons of the choices.

Another important aspect regarding the publishing tools is the development. Two partners are managing an entire publication process with their own software: OE (Lodel), UP (Rua/Jura). Others have a strong development activity (OBP) or have produced plugins (EKT, MWS). This could lead to fruitful technical collaborations useful to the OPERAS consortium.

The publishing tools analysis can also include the single-source-publishing question. If it seems easier to have a single pivot format with only one publishing soft (XML-TEI / Lodel for OE), other participants are also using as a pivot format the XML (MWS) or the PDF (UGOE). This aspect couldn’t be detailed within the survey table but it surely must be developed by these partners.

Last observation to be clarified in the future: it wasn’t always easy to tell what was the use made by the participants of each soft or application. There is maybe even here some detailed benchmark to conduct.

Dissemination

Distribution, Referencing, Harvesting, Metrics

The majority of the participants are using their own platform(s) to achieve their content’s  distribution (EKT, MWS, OAPEN, SHARE, UGOE, UniTo, UP). A smaller group is using other channels and, apart from one (OLH), it seems directly or partly related to their sales activity (OBP, OE, UCL, UP). In the last case (OBP, OE, UP), the number of distribution channels is logically very high. Even if of minor importance, we can note that the latter (OE) is externalizing the distribution process to electronic bookstores.

As for the referencing, it is more difficult to identify specificities. The main referencing entities among the partners are: DOAJ, DOAB, EBSCO. Nevertheless, not every participant has its contents referenced in each one and some referencing is sometimes more limited (MWS, UCL, OLH). There is maybe some effort to make to have a more uniform referencing throughout the consortium.

On the other hand, almost every participant is maintaining an OAI repository for the harvesting protocol. Even if differences obviously exist between the sets or the standards used, this remains a solid basis for an effective interoperability.

The situation regarding the metrics appears rather disparate, even if some synergies seem possible. A certain number of partners is using or will use Google Analytics (OBP, OLH, SHARE, UCL, UP). Others are providing COUNTER statistics (EKT, OAPEN, OE, UniTo) – but some more information could be useful here as the production of COUNTER is rather complex for OE, while it seems automatic for UniTo with OJS. Some partners, finally, are using other applications: Piwik (MWS, OE, UP), Awstats (OE – soon completely replaced by Piwik), ALM metrics (SHARE).

Editing

Peer-reviewing, proofreading, type-setting

We put together in this “editing” section peer-reviewing, proofreading and type setting as being parts of the traditional publishing activity.

Although not always directly involved in this editing work, most of the participants have it integrated to their own workflow. The situations are quite diverse, being present the two extremes: from the participants who are not involved in editing (UniTO) to those who are traditional publishers (OBP). In between, we can find different levels of involvement.

As for the peer-reviewing, we can observe that the participants whose publishing activity is part of library services can participate more or less directly (UGOE, UCL). In the other cases, the peer-reviewing is a requirement or a recommendation (OE, EKT) – difference between these will have maybe to be clarified in ulterior surveys. The peer-reviewing of journals and books tend to be the same (e.g. 2 academic referees) but this also may need to be confirmed by each concerned participant.

Proofreading and type-setting are most of the time effectuated by the editor and the author. Nevertheless, the same participants involved in the peer-reviewing also do the proofreading or the type-setting (OBP, MWS), but they can also be externalized (UCL, OLH).

Workflow

Process steps, Formats management, Access rights

Being very different according to the statuses, the services and the organizations, the workflows used by the partners cannot be exactly similar. It was in fact difficult to give a clear and schematic representation of this section. Nevertheless, it should be possible to identify the tasks defining their mission, and more precisely their types, number and complexity.

The answers led to a first observation: those partners who use PKP publication tools (OJS, OMP) are heavily helped to structure and formalize their workflow. As though this gives a clear representation of the workflow, it is mainly “author-oriented” and doesn’t really focus on the digital publisher’s work (the “layout editor” in the OJS schema).

Even if such a schema wouldn’t be necessary for the OPERAS consortium, a short list of its main publishing activities would be useful to better assess the strengths and weaknesses of the workflows.

This list could be more or less the list of sections used in this report and is reflected by the various answers. For a better focus on the “who does what when?”, the list could be slightly summarized in these specific digital publishing steps:

  • Editing: peer-reviewing (partly effectuated, verified, requested?); copy-editing / type setting (externalized or not?); linear or circular process; access rights to the platform for authors or editors?
  • Admission: document taken as it is sent; document modified (another format? Which one(s) with which tool?).
  • Enrichment: adding metadata (for search, for dissemination, for archiving?).
  • Dissemination: production of the output formats for the platforms; specific tasks related to the distribution outside the

These various aspects can of course be amended or completed, but they would give some sound elements to evaluate the length, the complexity and the efficiency of the digital publishing process.

Organization

Status, Funding, Budget

Although a bit outside the perimeter of a technical mapping, the organizational characteristics have technical implications: IT autonomy and size, ability to a changing of scale, HR availability, etc.

Basically, one dominant organizational model comes off from the survey: public status with institutional funding.

But we can notice the few exceptions:

  • OAPEN: a not-for-profit foundation with public institutional funding;
  • OLH: a charitable company whose funding comes from donations;
  • OpenEdition: a public organization which receives institutional funding and freemium sales revenue;
  • OBP: a CIC (specific UK status allowing profits for public good) funded by grants, membership and sales;
  • UP: Private Limited compagny (APC/BPC and fees for books and journals financing)

The information on budget were rather poor and they will maybe be collected in another occasion as it was slightly external to the technical investigation.

Prospects

A last set of questions tried to identify the interest of the partners for each other’s features and  tools or outside the OPERAS consortium.

It was probably a bit too soon to ask to the participants which technical interactions were possible for them with or within the OPERAS consortium; maybe this report will help to identify possible collaborations.

Among the few suggested collaborations, however, we can note the interest for the HIRMEOS implementations: identification, annotation, entity recognition (OBP, SHARE, UniTo). A partner would be interested in changing its method of publication by using OJS (OBP), already used by other partners. As possible prospects of development for the entire OPERAS consortium, some participants would like enrich their system with data mining or text analysis (SHARE, UGOE).


doi (Technical Mapping): 10.5281/zenodo.1009562