Case studies
Geospatial data extraction from printed publications
The REEDS Nautical Almanac, among the most recognised publications in leisure sailing, is a comprehensive guide to local tidal information; lights, buoys & waypoints;
distance tables; passage information, and special notes, accompanied by detailed local charlets, covering the entire UK coastline and Atlantic Europe.
This is coupled with an equally comprehensive reference section on regulations, navigational principles, weather, safety and communications.
The requirement was to derive a digital dataset from this publication for use in future digital applications and for marketing to new partners.
This required a bespoke content extraction system which accepted many files of various formats including MS Word, Adobe InDesign, MS Excel and proprietary formats, aggregated in a database specially crafted to retain the structure and 'meaning' of the book's content.
The structured nature of the book presented a number of challenges, including:
- Retention of content organisational structure through recognition of styles and font attributes
- Detection and interpretation of proprietary font data for symbol recognition
- Recognition and interpretation of geospatial information (latitude/longitude, depth indicators, etc)
- Recognition and extraction of tabular data
- Retention of book order for cross-referencing
- Extraction of index information to facilitate search functions
- Recognition of proprietary use of text attributes (bold, italic, etc) for example to indicate proper nouns, safety information, radio frequencies, and local facility information
The resulting database represents more than simply a digital copy of the printed publication: for example the recognition of geographical positions (latitude/longitude) and conversion to numerical data enables features to be plotted over digital cartography, while extraction of local facility information from plain text to structured data could inform a local facility search tool. Or it could simply form the basis for a content management system to streamline the future production of printed publications.
These are just examples; the details will be specific to the publication(s) in question. Once a digital dataset has been extracted in this manner, unlimited post-processing is possible to derive new datasets, aggregate ancillary sources, and derive any number of digital applications and new revenue streams.
Portsmouth Samaritans Shop & Suport Network
Using our charityCLICKS toolkit we developed a search and shopping portal for supporters of The Portsmouth and East Hampshire Samaritans to raise money while they shop online.
Click here to visit the website.