#HeritageEveryware Lit Long Edinburgh: literary big data mapped & curated

Muriel Spark, Irvine Welsh and Ian Rankin all found global fame as authors of indelibly Edinburgh-set fiction in the second half of the last century but as far back as the early 1800s Edinburgh was already conceived of in the popular imagination in literary terms. This trend was crystallized in the towering Scott Monument, completed in 1846 adjacent to Waverley station itself named after Walter Scott’s novel of 1814.

In turn, the city’s myriad wynds, buildings, squares and volcanic outcrops have brought meaning and life to the pages of fiction equal to and vivid as the memorable characters inhabiting them. This had a halo effect on how the city is known, both by residents and outsiders visiting or reading of it from elsewhere…

Guidebooks, travelogues and studies of old have long interlaced the real city fabric with it’s literary points of interest – both the real place settings found in fiction and the places where authors lived and worked – but even comprehensive-styled works were never more than portion of the total picture. Could the “whereness” of all Edinburgh-set work at a granular level ever be captured in aggregate?

Literary geocritic Robert Tally lamented as recently as 2011 that any such enterprise centered on a culturally rich city seemed a vain hope. But creeping up behind him were better computational tools and methods, and alongside them in the last decade, mass digitisation of out-of-copyright works.

EdinburghMapHome_800x350Figure: Lit Long’s range of automated filter exploration options on the website homepage

The time was now ripe for mining and surveying the setting of literary Edinburgh using digital tools and datasets, and an interdisciplinary network of academics from the University of Edinburgh and St Andrews University banded together thanks to AHRC funding to realise this goal in the Palimpsest project.

Where big data, geolocation and digital humanities meet

Despite the potential they recognised as immanent in the marriage of digitised literary texts and tools for big data analysis, utopian and abstract views of technology’s problem solving power were given short shrift.

Cognizant of lessons learnt from other recent attempts to mine large datasets for the purposes of research in the digital humanities domain (particularly the Trading Consequences project), Palimpsest started out from the premise that human, scholarly curation of the results of automated queries of big data was key to fashioning digital tools fit to meet the core criteria upon which their project rested.

Equally important in the creation of robust data mining processes and digital curation tools was the adoption of an iterative approach, whereby expert user feedback from humanities scholars operated in synch with the technological development and was kneaded into the prototyping, pilot and assisted curation phases.

Incorporating geodata was equally essential to their enterprise, and as with the multiple digital collections that formed their literary bedrock, here again disparate datasets of location-based information were queried and merged to create a bespoke place-based record fit for their needs, aptly named the Edinburgh Gazetteer.

EdinburghMap_Farest_850x400Figure: far view of literary city map with larger pins denoting density of located works

So what was the core criteria of what would become the web and app-based outputs of Lit Long: Edinburgh? Palimpsest took inspiration from Tally’s preliminary thoughts on the matter, albeit with more a optimistic prognosis:

“How does one determine exactly which texts could, in the aggregate, reasonably constitute a meaningful body of material with which to analyze the literary representations of a given geographical site? … With certain cities such as Paris, London Rome or New York, the almost mythic status of these places and the seemingly innumerable textual references to them render any geocritical analysis, at least those laying claim to a kind of scientific value, impossible… A geocentred method, if it aims to truly avoid the perception of bias, seems doomed from the start.” [Tally, 2013]

Palimpsest re-purposed his question for their central aim: “To examine the dimensions of literary Edinburgh through using text mining to scour accessible historical and fictional literary works to uncover those which mention Edinburgh or place names within it.”

As an innovative project in the literary and geopatial domain, examining their work sheds light on challenges both particular to Edinburgh and relevant across a much wider range of data mining and location-based humanities endeavors.

Tracing the assembly of Lit Long to learn from it

First, a confession. Sifting the otherwise illuminating papers and presentations published on the project to date, I reached my own state of textual and directional confusion.

Having read and reread them, the exact chronology of how Palimpsest built, modified and interlinked each tool used was unclear. Sometimes it seemed they were explaining a step I thought had happened earlier, at other points I felt they’d jumped ahead, skipping certain steps that other articles on the project said had been required to get there.

With a mounting sense of dread I realised that in my journey to plumb the matrix from which Palimpsest had fashioned Lit Long : Edinburgh I’d become somewhat lost in it, like a first-time visitor to Auld Reekie squinting at an unfinished map, uncertain of what was where and how to proceed as night and fog descended.

LitPalimpsestEdinburghPlan1647~550Figure: an old map of the east end of High St and west end of Canongate

So instead of stepping back to analyse the project in sum, I’ve set out here to try and summarise all the main steps taken in the development of the Palimpsest project to configure, finesse and productively interlink the range of tools and literary sources they used to produce the final dataset that powers the Lit Long web and mobile app interfaces.

Tracing out the journey in summary has helped me understand (from a digital heritage management point of view) how the enterprise came together and unfolded. Hopefully it might help others. [Disclaimer: there may still be errors. I’m poised for corrections!]

Unpacking this process also helps flesh out how, as a pioneering project in the literary and geopatial domain, Palimpsest’s Lit Long : Edinburgh has explored the new frontier of automated text mining and uncovered some of its shortcomings from a humanities and geolocative perspective, proactively turning failures into problem-solving opportunities and quick fire lessons.

There is much, much more to Lit Long : Edinburgh of course, especially from literary critical, shared public space, pervasive heritage, and digital experience design perspectives, but that’s for other days (see also the further reading given at the end of the post here). The need to get a handle on the process requires me to focus.

Text mining big data for local meaning: the digitised literary collections

So, what informed and shaped the Palimpsest process? Following an earlier smaller prototype centred purely on scholarly crowdsourcing and curation of widely known Edinburgh-set texts, the team set out to build tools that could automate the discovery of fully or partly Edinburgh-based literature from the sum of all digitised texts (albeit largely those in English and omitting poetry) that were now available.

One major limitation on this was the thorny issue of copyright, so they drew principally on the digitised out-of-copyright collections that were available, but augmented this with the in-copyright works of a handful of modern authors including Irvine Welsh and Alexander McCall Smith who permitted their reuse solely for this project, giving LitLong an up-to-date slant and some welcome contemporary voices.

Amounting to over 380,000 digitised works, these collections were:

• Public domain subset of Haithi Trust data (243,250 documents)
• British Library 19th Century Books Collection (65,235 documents)
• Project Gutenberg data (64,047 documents)
• A collection of National Library of Scotland documents (3,007 documents)
• Oxford Text Archive Text Encoding Initiative (TEI) text data (2,729 documents)
• A small set of in-copyright works from modern authors (46 documents)

Place matters: marrying datasets of geolocative records

Given their aim, an initial stumbling block was that no unified record existed of all the types of places (denoting an Edinburgh location) mentioned in literary works that they wanted to include – area names, street names, open spaces, buildings, statues and monuments.

So they did it themselves and created the bespoke georeferenced ‘Edinburgh Gazetteer’ drawing upon 4 digital sources:

•  Ordnance Survey Locator
•  Canmore site records (Historic Environment Scotland)
•  Edinburgh subset of Open Street Map
•  Quattroshapes (Foursquare)

To do this they first aggregated the sources, then de-duped them and lastly cleaned up the data to discard records which might trigger faulty recognition of place names. The resulting Edinburgh Gazetteer outputted as 13,064 records of 10,204 unique place names.

As such, even before they could filter and structure the data by automated means, human assisted curation of a robust locative record was necessitated. The geolocatory imperative meant judgement and expertise were needed just to get off the starting blocks.

Despite their pre-existing expectation that pure automation would not suffice to meet their goal they applied these methods systematically as they moved into the project’s next stage, allowing for the possibility that applied research could prove their hypothesis wrong.

Pilot phase (2 weeks): automated curation of digitised works

At the outset of their pilot phase, they took the data from the 5 digitised literary collections (amounting to over 380,000 out of copyright works plus the 46 in-copyright works) and converted it in aggregate to standardised XML files.

They then indexed it using the Indri 5.5 search engine (developed as part of the Lemur Project toolkit at Carnegie Mellon University). Supporting large scale search, Indri 5.5 has two main components: the query language and retrieval model, with both components supporting retrieval at different level and type.

In the Indri search engine, there are two main functions. Firstly, the IndriBuildIndex:
this function built an index for the entire digitised collections dataset aggregated by Palimpsest. Then for each of the documents in the dataset, Indri built an index for the query retrieval. Secondly IndriRunQuery: this function was used to query the index files that were created.

After indexing it in this way using Indri 5. 5 they ranked it using a set of 1,633 Edinburgh place names. [Note: I’m in a cold sweat at this point thinking about the complexity of all this but also awed by the ingenuity and ambition; this is why applying smart data development tools and smart people to messy cultural heritage datasets is amazing!]

GeoSpecific_BAlex_BLLabs2014_885Figure: presentation slide summarising the project’s geolocatory tasks; Alex, B., 2014

Next they ran this formatted version of the digitised collections through a big data pipe. This pipe first queried the inputted text against their bespoke Edinburgh Gazetteer, then two other geolocatory systems: Geonames and the Edinburgh Geoparser.

The big data pipe then queried the text in terms of publications metadata. Note that at this point, not all the documents had genre metadata, or other standardised metadata, for Palimpsest to query.

The output of this pilot phase process, which had centred purely on automated analysis of the literary datasets, was a ranked set of documents per collection.

Inspecting the automated results: genre & location credibility gaps

At this point, scrutinising the ranked list of documents they were presented with, it became abundantly clear to the involved scholars that there was a major problem: many less relevant documents appeared high up the rankings, largely due to:

•  the high frequency of place name mentions in non-literary texts, and
•  ambiguous place names that exist in Edinburgh but were frequently used in other places: eg. London, New York and Boston, meaning that non Edinburgh set works were gatecrashing and muddying their dataset.

This messed things up a lot as it meant that these documents didn’t match their literary criteria one iota, which was:

“Edinburgh-centric works, which either belonged to a recognisable literary genre, such as the novel or short story, or had strong narrative or loco-descriptive components (eg. memoirs and travel journals)”

Facing up to the shortcomings of the digital ranking system that their curatorial review of the pilot phase had surfaced, with its many spurious inclusions, the team decided to focus their efforts in the next stage on a single more reliable collection.

Assisted curation phase: textual analysis of Edinburgh-centric literature

Given the issues surfaced above, they now took the ranked data solely from the largest digital collection, that of the Haithi Trust public domain documents with available genre information in the metadata.

They built a semi-automatic curation tool for their participating scholars to use – in lieu of describing its main functionalities and interface, here’s a screenshot:

AssistedCurtation_clear_860Figure: presentation slide showing assisted curation tool interface created for the project, Alex, B., 2014

Based on feedback from the curators’ two week pilot, they added in historical place names and variants of Edinburgh (eg. ‘Edinboro,’ ‘Edinbra’, ‘Edinburg’, ‘Edinbrughe’, ‘Edinburrie’, ‘Embra’ and ‘Embro’) to the Edinburgh Gazetteer place name lexicon, and made document inclusion conditional on Edinburgh or a variant occurring at least once.

In turn they removed works with non-literary title words: such as ‘dictionary’ and ‘catalogue’.

They upweighted documents for multiple Edinburgh place name mentions within a document and its associated Library of Congress metadata, and they downweighted documents for ambiguous place names.

They then applied this modified ranking system to the books, journals and other literary entities in the Haithi public domain collection.

Using this more optimised retrieval and ranking framework, a total of 33,277 documents were presented to the curators. Aware of the relevance disparities between the collections, and the diminishing amount of project time left, they then chose to run only the top 10% of each ranked collection through this modified tool.

What emerged from the assisted curation processing of this subset of the collections’ documents was that ranked output increased the mean average precision (MAP) score of the documents that passed muster only slightly – but more significantly it created a large decrease in the outputted number of documents to consider, removing so many documents deemed of negligible or no relevance that the total shrank by almost 60%.

Final text mining and geo-resolution steps

This human-assisted curation phase resulted in 503 out of copyright documents which were considered to meet the Palimpsest literary criteria (plus 43 works from modern authors). Quite a small dataset in contrast to the original 380,000 digitised works considered at the project’s beginning!

Palimpsest’s journey from big (or “biggish”) to small data makes sense though when we remember they were asking a very particular, localised question of a large part of the total English language digitised corpus. As Tim Hitchcock has observed, it’s in the particular that meaning – and thereby better understanding of the universal – is concentrated.

EdinMap_MurielSpark_850Figure: Morningside snippet & map; The Prime of Miss Jean Brodie, Muriel Spark

To ready the finalised data for inclusion in the Lit Long : Edinburgh app and website they then ran it through a text mining pipeline (this one being an adapted version of the Edinburgh Geoparser) which recognised place names and other entities in text, and a geographic ambiguity resolution component to resolve competing place names given their textual content.

This text mining pipeline worked by first converting text into into common XML format segments, and then at each stage of processing incrementally adding annotations into the markup.

In this way, first the text was segmented into paragraphs which were tokenised to add words and sentence elements. ‘Named Entity Recognition’ was then performed; then place name recognition (using lexicons from the UK, the rest of world and augmented with the Edinburgh Gazetteer lexicon before the OS and GeoName lookups).

The output of text mining contained named entity annotations for dates, person names and place names. This was then inputted to the geo-resolution step which looked up the place names in one or more gazetteers (lookup done against the Edinburgh Gazetteer first, then the other geolocatory systems: OS Locator and GeoNames) and ranked the results to arrive at the most probable interpretation (ie. the geographic location) given the context of the document.

EdinMap_IrvineWelsh_TrainspottingLeith_850Figure: Leith snippet, map, related author works & resources; Trainspotting, Irvine Welsh

These geo-resolution results were added as XML annotations to each document and then the immediate context of each geo-referenced place name (with the prior and subsequent sentence, barring paragraph breaks) – which they named snippets – was marked up for display in the Lit Long interfaces.

Mapped and curated data: revealing the invisible literary city

You can see the results of how these snippets and their associated maps and online resources display in the web version of Lit Long : Edinburgh in the above two images.

Navigating and swiping through the intuitive and elegant Lit Long website and app, which also had a major redesign and new features added in 2017, you’d never know such massive effort and complexity lay behind it!

Options for exploring the mass of snippets include filtering the map view by genre, author, gender or book title, and a lucky dip option if randomness is preferred. These results can further be constrained by date and by the number of locations to display.

You can also stitch snippets together into ‘Paths‘ that you can save and choose to share: perhaps to plan a group walk with, or to record the snippets encountered en route while using the app in situ, or maybe for research or to illustrate an essay or tour guide route… The potential uses are almost endless.

Big data in the literary cityscape: chaos tamed by the human touch

As the Palimpsest / Lit Long enterprise shows, the growing ocean of digitised humanities data in its current incarnations is often messy and imprecise (although this is changing), a welter of undefined genres, disparate schemas and uneven vocabularies.

Its much vaunted promise is that it can now be queried, analysed and mapped en masse, allowing previously unseen patterns and insights to be identified, layering extra meaning onto past learning, and rendering the cacophony comprehensible on a large scale and newly navigable fashion accessible to wide audiences.

Working with humanities big data means not only doing all this at a scale not possible before; it drives the parallel development of assistive computational tools and critical curation methods required for the job that are now taking root and evolving at pace in the overlapping digital and spatial humanities sectors.

Given its iconic association with the world’s largest monument to a literary author there’s something apt about the fact that – amid all the artifice and ambience of Edinburgh as a literary entity – it’s become the world’s first comprehensively (out of copyright at least) and reliably mapped literary city. Science and art are forging a brave new alliance in the Athens of the North.

ScottMonumentLandscapeAug17_900Photo: The Scott Monument, Princes Street, Edinburgh; August 2017, Ⓒ Dialling The Past

Yet there’s also a cautionary twist to this tale. The centrality of both human curation and place-centric analysis to making the data, place and text mining process actually work here tells us that, at this stage, the Palimpsest approach in its totality for Lit Long isn’t a cast iron blueprint for digitally mapping the literary politic of all cities.

Urban centres are not replicas of each other and are known rather for their distinct geographies, histories and cultures. So adopting identical procedures for uncovering and accurately mapping their literary mentions is unlikely to fly very far before coming unstuck somewhere.

As a rough guide however, an emerging assemblage of curation, automation and developmental methods, Palimpsest has great promise. Delivering a working product that broke new ground Lit Long is a compelling and exciting proof of concept.

Some of their more specific tools, such as the georeferenced Edinburgh Gazetteer and Lit Long’s Paths feature, have an integrity and completeness of their own that marks them out as robust and reusable tools which a broad range of other local projects and services could potentially leverage and benefit from. Equally, other cities could follow or adapt Palimpsest’s model to create gazetteers and path features of their own.


In turn, other digital humanities projects working to extract meaning or relevance from large datasets could (if other approaches have proved wanting) gain from adopting Lit Long’s interdiscplinary iterative feedback development method which enabled them to spot and respond to issues more swiftly than a more rigid, less collaborative development model would have permitted.

Given limited time and resources, this empowered them to change tack adeptly, improve their tools’ efficacy at pace and pivot away from wasteful cul-de-sacs in their sprint to fashion data fit for purpose. The flexibility of software development allows for this, while its complexity means that pinning your whole project on a preset path makes it harder and more costly to reverse later when your siloed team didn’t see the early warning signs that things weren’t working.

Finally their text mining procedures were not intrinsically flawed, they just needed to be combined with expert judgement and a pragmatic openness to intervention and change, whenever errors of a disruptive magnitude surfaced.

The digital wellspring: inspiring current audiences & future innovation

As the odyssey to craft Lit Long : Edinburgh has highlighted, there are pre-existing peculiarities to a locale, and differing linguistic patterns that must be reckoned with on the one hand; and on the other, cultural data itself is often riddled with metadata gaps and other inconsistencies, making it hard to interpret with passable accuracy when only automated processing  is relied on.

In this context human and locally expert input into automated processing of humanities datasets for projects such as Lit Long is still essential, though it may ebb as tools improve. Moreover, what forms the optimum process for literary data mapping in other cities might mirror Lit Long in some respects but be notably different in others.

Nonetheless Palimpsest’s experience can still usefully inform similar projects elsewhere, and spark further innovation. Their processes can be tested, refined and adapted; their tools and design interfaces reviewed and built upon. Being first and delivering a fun, quality product raises aspirations and provides the impetus for the next generation technologies and experiences that will reanimate other literary neighbourhoods.

Lit Long’s technical limits (like its app, only available for iOs) and residual data conundrums (the map positioning of snippets which only mention the city name seems random, for instance) are not out of the ordinary for small budget projects and certainly haven’t derailed their momentum.

On the contrary they act as gauntlets thrown down to get more people using it to garner more constructive feedback, and spur future efforts to improve it or its next generation replacement, whetting the appetite for more bold experiments, creativity and public interaction with cultural datasets and the literary world they shed light on..

Digital humanities in & with the public: recasting the engagement dynamic

Humanities big data currently is often deprived of and inaccessible to real public engagement and difficult to parse for cultural meanings such as genre through distant reading (although Ted Underwood has moved the dial on that). But when approached with a user focused and public-minded purpose, and a collaborative approach to making sense and use of it, it can become a magnet for a host of audiences

In Lit Long’s case, it’s comprehensivenes has already bubbled up gaps and biases in our broader literary bookkeeping, surfacing works from lost, marginalised and overlooked authors. Struck by how many of the women writers brought into view had scant coverage in the digital space, they organised Lost Literary Edinburgh, a Wikipedia Editathon as part of the Being Human 2017 festival and rallied a posse of current and new editors to create or improve Wikipedia entries for these  neglected authors.

As Lit Long already linked where possible to the collaborative encyclopedia for their writers’ biographies and works, working with the public to redress this imbalance was a logical extension of their core criteria, simultaneously improving Lit Long’s resources and adding value to the wider public knowledge ecosystem.

Path_Craigmillar_900Figure: a user-created path featuring some of Lit Long : Edinburgh’s in-copyright works

Having unlocked and joined the dots of the city’s rich literary narratives at street level, Lit Long : Edinburgh has the potential to strike a chord with many publics – from secondary level pupils and teachers, through to tourists, historians, game players and book lovers everywhere.

From the casually curious to the diehard urban detective, from the committed community caretaker to the creative magpie hunting nuggets of inspiration… Lit Long’s lively events programme has already started to engage multiple audiences, growing a diverse community and becoming a catalyst for creative activity on many levels, a social hub as much as digital construct, a host rather than a higher authority.

Lit Long resonances: all our stories in place & unbound

“Elementary my dear Watson” were words never spoken by Sherlock Holmes in the writings of Edinburgh-born Arthur Conan Doyle, but in a later film adaptation. Perhaps Lit Long as a new entry point to the narrative continuum embedded all around us will take root in the same organic way, becoming a byword for those looking to demystify and those seeking re-enchantment in the cityscape’s literary canvas.

Want to liberate your collections and integrate your stories with the public realm? Want to explore history where it happened and let people shape their own encounters? The reanimated literary city is your oyster.  “Just Lit Long it”.

Whatever happens, Palimpsest’s mission to comprehensively mine place-based meaning from the digital archive and connect it to the city’s grid has definitely brought the riches buried in big data into sight in a pioneering fashion, imbuing the urban fabric with new depth and transforming accessibility to digitised collections for present day audiences.

A corner has been turned and with it Tally’s pessimism put out to pasture. Don’t just take my word for it – sample the multiple avenues, timeframes and DIY danders of Lit Long: Edinburgh yourself and see if it’s a landscape you’ll dive further into.

====

Acknowledgements:

I am grateful to Dr. Beatrice Alex, Research Fellow in Text Mining and James Loxley, Professor of Early Modern Literature at the University of Edinburgh for their permission to reproduce the two British Library Labs presentation slides featured, and their helpful supply of relevant background publications relating to the Palimpsest Lit Long : Edinburgh project.

Follow the project’s continuing adventures on Twitter @litlong

Further Reading:

Alex, B. Palimpsest: an Edinburgh Literary Cityscape. Invited talk to the British Library Labs Symposium 2014, London, UK, 3rd November 2014. [PDF slides]
http://homepages.inf.ed.ac.uk/balex/talks/BL-Labs-Symposium-slides.pdf

Alex, B. Text mining big data: potential and challenges. Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki Collegium for Advanced Studies Symposium, Helsinki, Finland, 1st December 2014. [PDF slides]
http://homepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf

Edinburgh’s literary history mapped at the click of a button, The Guardian, 28th March 2015
https://www.theguardian.com/uk-news/2015/mar/28/edinburgh-literary-history-online-map-lit-long

Hitchcock, T. (2014). ‘Big Data, Small Data and Meaning.’
Historyonics blog post, 09/11/2014 based on his keynote talk at the British Library Labs Symposium on 03/11/2014.
http://historyonics.blogspot.co.uk/2014/11/big-data-small-data-and-meaning_9.html

Robert Tally ‘Foreword’ in Bertrand Westphal and Robert Tally, Geocriticism: Real and Fictional Spaces, New York: Palgrave Macmillan, 2011

‘Multiplicity embarrasses the eye’ : The digital mapping of literary Edinburgh. / Loxley, James; Alex, Beatrice; Anderson, Miranda; Hinrichs, Uta; Grover, Claire; Thomson, Tara; Harris-Birtill, David; Quigley, Aaron; Oberlander, Jon.
The Routledge Handbook of Spatial History. ed. / Ian Gregory; Don Debats; Don Lafreniere. Routledge, January 2018. https://www.routledgehandbooks.com/doi/10.4324/9781315099781-35

Palimpsest: Improving assisted curation of loco-specific literature
Beatrice Alex, Claire Grover, Jon Oberlander, Tara Thomson, Miranda Anderson, James Loxley, Uta Hinrichs, Ke Zhou
Digital Scholarship in the Humanities, Volume 32, Issue suppl_1, 1 April 2017, Pages i4–i16, https://doi.org/10.1093/llc/fqw050

Underwood, T. ‘Understanding Genre in a Collection of a Million Volumes’, Interim Report, December 2014, http://dx.doi.org/10.6084/m9.figshare.1281251

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s