Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

Owen, David; Groom, Quentin; Hardisty, Alex; Leegwater, Thijs; Livermore, Laurence; van Walsum, Myriam; Wijkamp, Noortje; Spasić, Irena

Please use this identifier to cite or link to this item: http://dx.doi.org/10.34960/101

Full metadata record

DC Field	Value	Language
dc.contributor.author	Owen, David	-
dc.contributor.author	Groom, Quentin	-
dc.contributor.author	Hardisty, Alex	-
dc.contributor.author	Leegwater, Thijs	-
dc.contributor.author	Livermore, Laurence	-
dc.contributor.author	van Walsum, Myriam	-
dc.contributor.author	Wijkamp, Noortje	-
dc.contributor.author	Spasić, Irena	-
dc.date.accessioned	2021-11-10T12:50:50Z	-
dc.date.available	2021-11-10T12:50:50Z	-
dc.date.issued	2020	-
dc.identifier.citation	Owen D, Groom Q, Hardisty A, Leegwater T, Livermore L, van Walsum M, Wijkamp N, Spasić I (2020) Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. Research Ideas and Outcomes 6: e58030. https://doi.org/10.3897/rio.6.e58030	en_US
dc.identifier.uri	https://know.dissco.eu/handle/item/237	-
dc.description.abstract	We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on state-of-the-art technologies. Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on largescale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text. Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. We have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process	en_US
dc.publisher	ICEDIG	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.title	Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections	en_US
Appears in the Folders:	ICEDIG Project Outcomes

Files in This Item:

File	Description	Size	Format
RIO_article_58030.pdf		1.88 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License